[00:00:18] (03PS1) 10Dzahn: Revert "stewards: migrate stewards1001 to puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/973802 [00:00:29] (03CR) 10Dzahn: [C: 03+2] Revert "stewards: migrate stewards1001 to puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/973802 (owner: 10Dzahn) [00:03:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host stewards1001.eqiad.wmnet with OS bookworm [00:03:54] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet wi... [00:04:05] PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [00:14:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage [00:16:09] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:21:00] (03Abandoned) 10BCornwall: Add Prometheus metrics for fifo-log-demux [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/967293 (https://phabricator.wikimedia.org/T345939) (owner: 10BCornwall) [00:23:22] (03PS1) 10Dzahn: Revert "Revert "stewards: migrate stewards1001 to puppet7"" [puppet] - 10https://gerrit.wikimedia.org/r/973803 [00:27:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stewards1001.eqiad.wmnet with OS bookworm [00:27:52] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with O... [00:31:41] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host stewards1001.eqiad.wmnet [00:32:03] PROBLEM - ensure kvm processes are running on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:32:15] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "stewards: migrate stewards1001 to puppet7"" [puppet] - 10https://gerrit.wikimedia.org/r/973803 (owner: 10Dzahn) [00:33:56] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host stewards1001.eqiad.wmnet [00:38:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973413 [00:39:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973413 (owner: 10TrainBranchBot) [00:40:56] (03PS1) 10BCornwall: fifo-log-demux: Update project homepage [puppet] - 10https://gerrit.wikimedia.org/r/973887 (https://phabricator.wikimedia.org/T347623) [00:42:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:43:43] PROBLEM - ensure kvm processes are running on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:58:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973413 (owner: 10TrainBranchBot) [01:02:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351144 (10phaultfinder) [01:40:32] (03CR) 10Ryan Kemper: [C: 03+1] search-loader: remove references to search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/973880 (https://phabricator.wikimedia.org/T351123) (owner: 10Bking) [01:57:33] RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:59:41] RECOVERY - ensure kvm processes are running on cloudvirt1037 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:26:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bookworm [02:38:54] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [02:46:25] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage [02:59:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bookworm [03:00:06] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0300) [03:07:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.5 [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/973414 (https://phabricator.wikimedia.org/T350081) [03:07:44] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.5 [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/973414 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [03:08:48] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bookworm [03:08:54] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [03:13:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [03:16:14] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage [03:22:51] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.5 [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/973414 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [03:22:58] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [03:25:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage [03:33:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm [03:43:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bookworm [03:46:06] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1042.eqiad.wmnet with OS bookworm [03:48:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm [03:49:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bookworm [03:49:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [03:53:54] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0400) [04:01:27] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage [04:01:36] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973892 (https://phabricator.wikimedia.org/T350081) [04:01:38] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973892 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [04:02:32] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973892 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [04:02:57] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.5 refs T350081 [04:03:01] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [04:04:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage [04:22:23] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1043.eqiad.wmnet with OS bookworm [04:22:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [04:31:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1042.eqiad.wmnet with OS bookworm [04:42:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:13] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.5 refs T350081 (duration: 51m 15s) [04:54:17] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [04:58:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm [05:03:48] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS bookworm [05:05:17] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1043.eqiad.wmnet with OS bookworm [05:17:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage [05:18:18] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1044.eqiad.wmnet with OS bookworm [05:18:36] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm [05:20:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage [05:42:48] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS bookworm [05:45:31] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1044.eqiad.wmnet with OS bookworm [06:09:59] 10SRE-swift-storage, 10Move-Files-To-Commons, 10WMDE-TechWish-Maintenance, 10MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), 10Wikimedia-production-error: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 (10Kizule) 05Open→03Invalid Not happening anymor... [06:13:43] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Kizule) 05Open→03Resolved Then let's close this in order to have less confusion. :) [06:50:44] (03PS1) 10Muehlenhoff: Add dedicated insetup role for Buster [puppet] - 10https://gerrit.wikimedia.org/r/973896 (https://phabricator.wikimedia.org/T349619) [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0700) [07:00:04] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0700). nyaa~ [07:00:54] (03CR) 10Marostegui: "Thanks, normally this requires a restart on sanitarium, but given it is on x1, we don't have to do it now, and it can be done whenever the" [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz) [07:03:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch [07:03:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch [07:08:54] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:55] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [07:27:57] !log include golang-github-mmatczuk-anyflag_0.0~git20231026.5f42d2f in apt.wm.org (bookworm) [07:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:32] (03PS1) 10Slyngshede: Implement stricter permission checks [software/bitu] - 10https://gerrit.wikimedia.org/r/974070 (https://phabricator.wikimedia.org/T351143) [07:39:34] !log stop bacula dir (and puppet) at backup1001 T350022 [07:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:38] T350022: Switchover m1 master (db1164-> db1119) - https://phabricator.wikimedia.org/T350022 [07:40:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.9066639290823906s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:41:48] prometheus job for bacula will complain, as it only have one job, which I stoppedf [07:42:00] will ack when it complains [07:45:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.8177954390190108s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:48:20] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974071 (https://phabricator.wikimedia.org/T349090) [07:48:54] (JobUnavailable) firing: (3) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:15] (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974071 (https://phabricator.wikimedia.org/T349090) (owner: 10Marostegui) [07:51:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/974070 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede) [07:52:47] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement stricter permission checks [software/bitu] - 10https://gerrit.wikimedia.org/r/974070 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede) [07:53:54] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:57:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10DMburugu) I've discussed this with @Urbanecm and I approve his access. [07:59:36] !log installing dbus security updates on bullseye [07:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0800). [08:00:05] apergos: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] o/ [08:02:42] who's running the window today? [08:04:25] Amir1 or urbanecm either of you around? [08:04:45] !log Failover m1 from db1164 to db1119 - T350022 [08:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:50] T350022: Switchover m1 master (db1164-> db1119) - https://phabricator.wikimedia.org/T350022 [08:05:03] all done [08:05:16] 👏 [08:05:24] should we merge the other patch? [08:05:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:05:28] yep [08:05:35] etherpad seems to be fine [08:05:39] no restart required [08:05:48] (03CR) 10Marostegui: [C: 03+2] dbbackups: Switchover master from db1164 to db1119 [puppet] - 10https://gerrit.wikimedia.org/r/969753 (https://phabricator.wikimedia.org/T350022) (owner: 10Jcrespo) [08:06:12] !log installing nghttp2 security updates [08:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:33] will run puppet on backupmon, backup1001 [08:06:41] when you tell me, marostegui [08:06:45] jynus: go for it [08:07:31] (03CR) 10Ayounsi: "Please make sure someone from Traffic (hello Sukhe :) ) had a look as well given how it's tied to critical parts of the infra (DNS)." [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [08:08:09] icinga checks should be updated now [08:08:15] bacula is still starting [08:08:19] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:34] moritzm: ^that is probably because of the m1 switchover [08:08:53] maybe try restarting it? [08:09:02] yeah [08:09:12] hrm with no backport deployment window runner, I feel uneasy just self-deploying anyways... guess I'll wait and see if one of them turns up [08:09:13] first time it happens [08:09:41] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:21] (03CR) 10Ayounsi: "Change overall lgtm but I don't know enough about nftables to properly review it." [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [08:10:22] bacula should be backup up [08:10:33] running a backup to confirm [08:10:44] jynus: great, when done let me know, so I can reimage the old master [08:11:05] k, I'm restarting cfssl-ocsprefresh-debmonitor.service to be on the safe side [08:11:13] moritzm: I am doing it :) [08:11:21] And it is taking ages btw [08:12:29] (03PS1) 10Slyngshede: Stricter checking of user id when updating email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/974073 [08:12:44] 12.69 G OK 14-Nov-23 08:12 gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data [08:12:53] ^ marostegui [08:13:27] everything looking good on my side [08:13:29] (03PS2) 10Slyngshede: Stricter checking of user id when updating email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/974073 [08:13:44] PROBLEM - MariaDB read only m1 #page on db1164 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:13:54] (JobUnavailable) firing: (3) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:13:59] oh [08:14:10] puppet didnt update? [08:14:36] yeah I guess [08:14:38] Sorry for the page [08:15:19] it is not that [08:15:25] it says "Could not connect to localhost:3306" [08:15:41] 0 processes with command name 'mysqld' did it crash? [08:15:44] No [08:15:47] I stopped mysql [08:15:48] or just downtime [08:15:57] but puppet didn't run on icinga yet, so notifications were enabled [08:16:00] ah, good, then the procedure itself worked [08:16:35] it was just the "maintenance" after the switch [08:16:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1164.eqiad.wmnet with OS bookworm [08:17:30] * Emperor arrives with first tea of the day [08:18:38] moritzm: did the debmonitor alert got fixed? [08:18:50] yes, check irc [08:18:56] it was fixed with the restart [08:19:04] [09:09:41] <+icinga-wm> RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:05] yeah, it recovered with the restart [08:19:19] it took a while to restart though, I was surprised [08:19:39] it usually takes >3 min [08:19:50] please add those to the docs: https://wikitech.wikimedia.org/wiki/MariaDB/misc#m1 [08:20:04] so next time it is not a surprise [08:20:39] the prometheus exporter for bacula didn't recover, so doing another manual restart [08:20:45] done [08:20:51] (added to the docs) [08:21:57] I will add that too, although I think it is a bug on daemon config for dependencies [08:26:07] actually, the exporter is ok, but I think there is some lag on the alerting [08:26:22] should recover after whatever is the window for checking [08:27:26] (03PS1) 10Marostegui: Revert "dbbackups: Switchover master from db1164 to db1119" [puppet] - 10https://gerrit.wikimedia.org/r/973804 [08:27:33] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/973804 (owner: 10Marostegui) [08:28:35] (03CR) 10Jcrespo: [C: 03+1] Revert "dbbackups: Switchover master from db1164 to db1119" [puppet] - 10https://gerrit.wikimedia.org/r/973804 (owner: 10Marostegui) [08:28:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1164.eqiad.wmnet with reason: host reimage [08:29:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/974073 (owner: 10Slyngshede) [08:30:07] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Stricter checking of user id when updating email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/974073 (owner: 10Slyngshede) [08:32:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1164.eqiad.wmnet with reason: host reimage [08:37:35] (03CR) 10Muehlenhoff: [C: 03+1] P:bird::anycast: migrate to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [08:41:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973896 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:42:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:27] (03PS3) 10Elukey: services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950) [08:42:29] (03PS3) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [08:42:31] (03PS3) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) [08:44:29] (03CR) 10Marostegui: [C: 03+1] mariadb: add db1238 and prepare db1138 retirement [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:46:12] (03PS1) 10Hashar: Archive repository [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/974104 (https://phabricator.wikimedia.org/T347623) [08:46:24] (03CR) 10CI reject: [V: 04-1] Archive repository [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/974104 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [08:46:58] (03CR) 10Hashar: [V: 03+2 C: 03+2] Archive repository [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/974104 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar) [08:52:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1164.eqiad.wmnet with OS bookworm [08:52:42] (03PS1) 10Slyngshede: Version 0.0.3 - Block unauthorized access to keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) [08:53:42] (03CR) 10Slyngshede: "Version 0.0.3 have already been deployed in production and test." [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede) [08:55:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede) [08:56:17] (03PS2) 10Slyngshede: Version 0.0.3 - Block unauthorized access to keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) [08:56:17] !log add 80g to prometheus/ops in eqiad [08:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:25] !log add 80g to prometheus/k8s-ml-serve in eqiad [08:56:26] jouncebot: now [08:56:26] For the next 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0800) [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:30] (03CR) 10Slyngshede: Version 0.0.3 - Block unauthorized access to keys. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede) [08:56:39] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Version 0.0.3 - Block unauthorized access to keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede) [08:57:42] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) [08:57:47] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 [08:59:23] (03PS1) 10Marostegui: pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974107 [08:59:32] (03CR) 10Brouberol: [V: 03+1] Generate the netboot.cfg file to avoid typos impacting everyone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:59:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one optional nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [09:00:01] (03CR) 10Marostegui: [C: 03+2] pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974107 (owner: 10Marostegui) [09:00:24] (03CR) 10Muehlenhoff: [C: 03+2] Add dedicated insetup role for Buster [puppet] - 10https://gerrit.wikimedia.org/r/973896 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:02:26] (03CR) 10Jcrespo: [C: 03+1] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 (owner: 10Marostegui) [09:03:13] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 (owner: 10Marostegui) [09:03:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch [09:04:01] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 (owner: 10Marostegui) [09:04:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch [09:05:12] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974106|ProductionServices.php: Promote pc1014 to pc3 master]] [09:06:44] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974106|ProductionServices.php: Promote pc1014 to pc3 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:07:01] !log marostegui@deploy2002 marostegui: Continuing with sync [09:08:11] (03CR) 10Elukey: [C: 03+2] services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [09:11:55] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973805 [09:12:11] (03PS1) 10Marostegui: Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974126 [09:12:36] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974106|ProductionServices.php: Promote pc1014 to pc3 master]] (duration: 07m 24s) [09:13:07] (03PS1) 10Volans: sre.hosts.decommission: remove also from Puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) [09:13:34] 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) 05Open→03Resolved [09:15:48] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Disable initial-import job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 (owner: 10Kosta Harlan) [09:16:41] (03CR) 10Volans: sre.hosts.decommission: remove also from Puppet7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans) [09:16:50] (03Merged) 10jenkins-bot: ipoid: Disable initial-import job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 (owner: 10Kosta Harlan) [09:18:13] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973805 (owner: 10Marostegui) [09:18:58] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973805 (owner: 10Marostegui) [09:19:04] (03PS1) 10Elukey: services: add kafka base settings for cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974109 [09:19:25] (03CR) 10Marostegui: [C: 03+2] Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974126 (owner: 10Marostegui) [09:19:34] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:973805|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] [09:20:01] (03PS3) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 [09:20:34] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan) [09:20:38] (03CR) 10Elukey: [C: 03+2] services: add kafka base settings for cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974109 (owner: 10Elukey) [09:20:49] (03CR) 10CI reject: [V: 04-1] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan) [09:20:58] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:973805|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:21:08] !log marostegui@deploy2002 marostegui: Continuing with sync [09:21:12] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10MatthewVernon) @Urbanecm_WMF I think this is awaiting confirmation from @KFrancis that an NDA has been signed (and that we have a legal name on file), per the comment from 1... [09:22:36] (03PS4) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 [09:23:15] (03CR) 10CI reject: [V: 04-1] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan) [09:25:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:06] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [09:25:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:26:18] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [09:26:36] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:973805|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 07m 02s) [09:27:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:27:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:28:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:28:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:28:41] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-staging-etcd2003.codfw.wmnet [09:28:54] (03PS3) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) [09:29:42] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974113 [09:30:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:30:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:30:37] (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974114 [09:30:46] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974113 (owner: 10Marostegui) [09:31:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch [09:31:07] (03CR) 10Marostegui: [C: 03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974114 (owner: 10Marostegui) [09:31:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch [09:31:32] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974113 (owner: 10Marostegui) [09:32:00] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974113|ProductionServices.php: Promote pc2014 to pc3 master]] [09:32:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:32:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [09:33:16] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:33:27] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974113|ProductionServices.php: Promote pc2014 to pc3 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:33:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:33:30] (03CR) 10Jbond: [C: 03+1] "lgtm feel free to merge or +1 and i will 😊" [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [09:33:31] !log marostegui@deploy2002 marostegui: Continuing with sync [09:33:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:33:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:33:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53379 and previous config saved to /var/cache/conftool/dbconfig/20231114-093353-arnaudb.json [09:33:55] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bullseye [09:34:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) [09:34:34] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:34:44] (03PS5) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 [09:35:35] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/974115 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [09:36:24] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan) [09:36:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53380 and previous config saved to /var/cache/conftool/dbconfig/20231114-093625-arnaudb.json [09:36:30] !log reimaging kubestage2002 to verify with puppet7 [09:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:43] (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner1003 [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) [09:37:14] (03Merged) 10jenkins-bot: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan) [09:38:00] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:38:29] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/449/con" [puppet] - 10https://gerrit.wikimedia.org/r/974115 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [09:38:31] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:38:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:39:01] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-staging-etcd2003 to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974115 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [09:39:11] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974113|ProductionServices.php: Promote pc2014 to pc3 master]] (duration: 07m 11s) [09:39:12] (03CR) 10Arnaudb: [C: 03+2] mariadb: add db1238 and prepare db1138 retirement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [09:39:34] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974127 [09:40:06] (03PS1) 10Marostegui: Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974128 [09:43:18] (03PS6) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) [09:43:20] (03PS6) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) [09:43:22] (03PS6) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) [09:43:24] (03PS6) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [09:43:46] (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: unregister gitlab-runner1003 [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:43:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:43:57] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:30] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974127 (owner: 10Marostegui) [09:44:45] (03CR) 10Marostegui: [C: 03+2] Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974128 (owner: 10Marostegui) [09:45:10] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974127 (owner: 10Marostegui) [09:45:23] (03CR) 10Muehlenhoff: [C: 03+1] P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [09:45:30] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-staging-etcd2003.codfw.wmnet [09:45:36] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974127|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] [09:45:45] (03CR) 10Muehlenhoff: [C: 03+1] cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [09:46:15] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:46:23] (03PS1) 10Kosta Harlan: ipoid: Remove timeZone property [deployment-charts] - 10https://gerrit.wikimedia.org/r/974119 [09:46:28] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Remove timeZone property [deployment-charts] - 10https://gerrit.wikimedia.org/r/974119 (owner: 10Kosta Harlan) [09:47:03] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974127|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:47:13] (03CR) 10Majavah: [C: 03+2] cloudlb: haproxy: migrate to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [09:47:15] (03Merged) 10jenkins-bot: ipoid: Remove timeZone property [deployment-charts] - 10https://gerrit.wikimedia.org/r/974119 (owner: 10Kosta Harlan) [09:47:21] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:47:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 8.354 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:47:27] !log marostegui@deploy2002 marostegui: Continuing with sync [09:48:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:48:39] (03CR) 10Filippo Giunchedi: [C: 03+1] webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [09:49:39] (03PS1) 10Stevemunene: druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120 [09:50:35] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a prometheus_instance parameter to prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [09:51:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P53383 and previous config saved to /var/cache/conftool/dbconfig/20231114-095132-arnaudb.json [09:51:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM modulo jobs configuration comments in related task" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [09:52:00] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [09:52:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/450/con" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [09:53:02] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974127|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] (duration: 07m 26s) [09:53:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036 [09:53:27] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [09:53:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036 [09:53:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036 [09:53:54] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036 [09:54:16] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runner1003 [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:54:54] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [09:55:12] (03CR) 10Filippo Giunchedi: Send metrics from Airflow analytics test (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [09:55:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:56:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:57:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:57:03] (03CR) 10Majavah: [C: 04-1] use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [09:57:12] (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/974129 (https://phabricator.wikimedia.org/T344951) [10:01:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:02:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:03:01] jouncebot: nowandnext [10:03:02] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [10:03:02] In 0 hour(s) and 56 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1100) [10:03:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:49] train presync failed last night, rerunning it now [10:03:56] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.5 refs T350081 [10:04:01] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [10:04:41] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/973418 (https://phabricator.wikimedia.org/T351184) [10:04:54] 10Puppet, 10Wikidata, 10Wikidata Analytics, 10wmde-wikidata-tech, 10Technical-Debt: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072 (10Lucas_Werkmeister_WMDE) [10:05:29] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::ml_etcd::staging [10:06:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P53384 and previous config saved to /var/cache/conftool/dbconfig/20231114-100638-arnaudb.json [10:07:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T351184 [10:07:55] T351184: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T351184 [10:08:17] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T351184 [10:08:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db1160 with weight 0 T351184', diff saved to https://phabricator.wikimedia.org/P53385 and previous config saved to /var/cache/conftool/dbconfig/20231114-100843-arnaudb.json [10:10:31] (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: unregister gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/974129 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:10:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Some additional information * puppet7 agents can talk to both centrallog1002 and ce... [10:10:44] (03PS1) 10Hnowlan: page-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) [10:11:23] (03CR) 10EoghanGaffney: [C: 03+1] Revert "gitlab_runner: unregister gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/974129 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:11:28] (03CR) 10Klausman: [C: 03+2] hiera: migrate ML staging etcd role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974121 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:15:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etcd::v3::ml_etcd::staging [10:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:41] (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner1004 [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) [10:17:07] (03CR) 10ArielGlenn: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [10:17:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:18:07] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:21:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) from a very simple test this appears to only affect buster ` # in the following eve... [10:21:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53386 and previous config saved to /var/cache/conftool/dbconfig/20231114-102145-arnaudb.json [10:21:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:21:49] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:22:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:22:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53387 and previous config saved to /var/cache/conftool/dbconfig/20231114-102206-arnaudb.json [10:24:16] !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.5 refs T350081 (duration: 20m 19s) [10:24:21] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [10:24:40] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon) [10:25:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53388 and previous config saved to /var/cache/conftool/dbconfig/20231114-102517-arnaudb.json [10:25:42] (03PS1) 10MVernon: admin: ngkountas to have a shell account in the restricted group [puppet] - 10https://gerrit.wikimedia.org/r/974125 (https://phabricator.wikimedia.org/T350779) [10:25:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [10:25:58] !log imported 5.1.19+4.0.11-3+wmf2+bullseye1 to component/php74 for bullseye-wikimedia [10:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:24] !log jnuche@deploy2002 Pruned MediaWiki: 1.42.0-wmf.3 (duration: 02m 06s) [10:26:45] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-staging-ctrl2002.codfw.wmnet [10:29:32] (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-staging-ctrl2002.codfw.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974146 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:33:43] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/973418 (https://phabricator.wikimedia.org/T351184) (owner: 10Gerrit maintenance bot) [10:33:52] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-staging-ctrl2002.codfw.wmnet [10:34:30] !log Starting s4 eqiad failover from db1138 to db1160 - T351184 [10:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:34] T351184: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T351184 [10:36:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Promote db1160 to s4 primary T351184', diff saved to https://phabricator.wikimedia.org/P53389 and previous config saved to /var/cache/conftool/dbconfig/20231114-103601-arnaudb.json [10:38:24] !log imported php-redis 5.3.2+4.3.0-2+deb11u1+wmf2+bullseye1 to component/php74 for bullseye-wikimedia [10:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Feels like this could be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bu... [10:39:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'T351184 - weight mirror', diff saved to https://phabricator.wikimedia.org/P53390 and previous config saved to /var/cache/conftool/dbconfig/20231114-103941-arnaudb.json [10:39:51] T351184: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T351184 [10:40:12] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_k8s::staging::master [10:40:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P53391 and previous config saved to /var/cache/conftool/dbconfig/20231114-104024-arnaudb.json [10:41:40] (03PS1) 10Elukey: profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) [10:41:42] (03PS1) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) [10:42:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974125 (https://phabricator.wikimedia.org/T350779) (owner: 10MVernon) [10:42:49] (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: unregister gitlab-runner1004 [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:43:08] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [10:46:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'migrate db1138 to db1238 - T344036', diff saved to https://phabricator.wikimedia.org/P53392 and previous config saved to /var/cache/conftool/dbconfig/20231114-104603-arnaudb.json [10:46:08] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:46:30] (03PS1) 10Kamila Součková: kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) [10:46:33] (03PS2) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) [10:48:00] (03PS1) 10Arnaudb: mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) [10:48:10] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_k8s::staging::master [10:48:29] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runner1004 [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:49:13] (03PS2) 10Arnaudb: mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) [10:49:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/452/con" [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [10:49:28] (03CR) 10Marostegui: [C: 03+1] mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:49:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [10:49:36] (03PS3) 10Arnaudb: mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) [10:50:14] (03CR) 10Arnaudb: [C: 03+2] mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:50:45] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-staging2001.codfw.wmnet [10:51:33] (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runner1004" [puppet] - 10https://gerrit.wikimedia.org/r/974134 (https://phabricator.wikimedia.org/T344951) [10:54:34] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1138.eqiad.wmnet onto db1238.eqiad.wmnet [10:54:36] (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-staging2001 to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974152 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:55:16] !log imported php-msgpack 2.1.2+0.5.7-2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [10:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P53393 and previous config saved to /var/cache/conftool/dbconfig/20231114-105530-arnaudb.json [10:56:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) > > edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035 ok i... [10:57:21] (03CR) 10MVernon: [C: 03+2] admin: ngkountas to have a shell account in the restricted group [puppet] - 10https://gerrit.wikimedia.org/r/974125 (https://phabricator.wikimedia.org/T350779) (owner: 10MVernon) [10:57:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-presto1001.eqiad.wmnet [10:58:24] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-staging2001.codfw.wmnet [10:58:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This is now done (modulo time for... [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1100) [11:01:11] (03PS1) 10Muehlenhoff: Switch an-presto1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974153 (https://phabricator.wikimedia.org/T349619) [11:01:53] (03CR) 10EoghanGaffney: [C: 03+1] Revert "gitlab_runner: unregister gitlab-runner1004" [puppet] - 10https://gerrit.wikimedia.org/r/974134 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [11:02:16] (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: unregister gitlab-runner1004" [puppet] - 10https://gerrit.wikimedia.org/r/974134 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [11:04:42] (03PS1) 10MVernon: admin: add urbanecm to stewards-users group [puppet] - 10https://gerrit.wikimedia.org/r/974154 (https://phabricator.wikimedia.org/T350834) [11:05:41] (03CR) 10CI reject: [V: 04-1] admin: add urbanecm to stewards-users group [puppet] - 10https://gerrit.wikimedia.org/r/974154 (https://phabricator.wikimedia.org/T350834) (owner: 10MVernon) [11:06:59] Emperor: fyi, there's https://gerrit.wikimedia.org/r/c/operations/puppet/+/972911 by Daniel already ready :) [11:07:33] doh [11:07:41] (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runners in codfw [puppet] - 10https://gerrit.wikimedia.org/r/974155 (https://phabricator.wikimedia.org/T344951) [11:08:13] (03Abandoned) 10MVernon: admin: add urbanecm to stewards-users group [puppet] - 10https://gerrit.wikimedia.org/r/974154 (https://phabricator.wikimedia.org/T350834) (owner: 10MVernon) [11:08:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch an-presto1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974153 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:09:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974155 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [11:09:18] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans) [11:09:39] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_k8s::staging::worker [11:09:42] (03CR) 10MVernon: [C: 03+2] admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [11:10:04] (03PS6) 10MVernon: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [11:10:29] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53394 and previous config saved to /var/cache/conftool/dbconfig/20231114-111037-arnaudb.json [11:10:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:10:49] I'll rebase the CR and then merge it [11:10:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:10:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:11:01] (assuming CI still content) [11:11:39] (03CR) 10Klausman: [C: 03+2] hiera: migrate ML staging worker role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974156 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [11:12:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:13:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [11:13:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T348183)', diff saved to https://phabricator.wikimedia.org/P53395 and previous config saved to /var/cache/conftool/dbconfig/20231114-111316-arnaudb.json [11:15:40] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_k8s::staging::worker [11:15:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T348183)', diff saved to https://phabricator.wikimedia.org/P53396 and previous config saved to /var/cache/conftool/dbconfig/20231114-111549-arnaudb.json [11:15:54] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:16:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10MatthewVernon) 05Open→03Resolved a:05DMburugu→03MatthewVernon Done (once puppet has done its magic). [11:17:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) I can confirm that e.g. bookworm hosts are sending syslog fine, e.g. titan1002:... [11:17:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [11:18:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-presto1001.eqiad.wmnet [11:19:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) Ditto bullseye: ` centrallog2002:~$ tail -5 /srv/syslog/thanos-fe1001/syslog.l... [11:19:55] 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10MatthewVernon) I think this request needs management approval? Which would be @OSefu-WMF for @Hghani and @kzimmerman for @OSefu-WMF. Can you both approve the relevant request, please? [11:21:34] (03CR) 10Hnowlan: [C: 03+2] page-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [11:21:38] (03PS1) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [11:22:14] (03CR) 10Santiago Faci: [C: 03+1] "It looks good! Thanks!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [11:22:17] (03Merged) 10jenkins-bot: page-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [11:23:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Jelto) [11:24:16] 10SRE, 10Infrastructure-Foundations, 10Stewards-Onboarding-Tool, 10Stewards-and-global-tools, and 2 others: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) [11:25:30] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host gitlab1003.wikimedia.org [11:25:50] (03PS1) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) [11:26:37] (03CR) 10CI reject: [V: 04-1] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:26:55] (03CR) 10Jcrespo: "This is my second attempt, and thanks to Riccardo's help, it looks much cleaner now!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:28:11] (03PS1) 10Muehlenhoff: Switch gitlab1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974160 (https://phabricator.wikimedia.org/T349619) [11:28:51] (03CR) 10Jcrespo: "Unit test works for me locally, Could I be missing a dependency for CI?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:29:17] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [11:29:52] (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974160 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:30:40] (03CR) 10MVernon: RemoteExecution: Add comments and fix a few lint errors (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:30:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P53397 and previous config saved to /var/cache/conftool/dbconfig/20231114-113055-arnaudb.json [11:31:11] (03PS2) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) [11:32:56] (03CR) 10Jcrespo: "Let's focus on the real fix first (the next patch), then the things we find along the way, otherwise we will never finish :-D" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:34:00] (03PS2) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) [11:34:44] (03CR) 10CI reject: [V: 04-1] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:34:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host gitlab1003.wikimedia.org [11:35:50] (03CR) 10Jcrespo: "that ain't it" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:36:46] (03PS3) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) [11:37:27] (03CR) 10Filippo Giunchedi: "LGTM, I'll let Keith vote though" [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [11:37:30] (03CR) 10CI reject: [V: 04-1] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:38:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [11:38:43] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10LSobanski) The alert has since recovered but looking at the names in the linked change I'm adding Data Platform SRE to rev... [11:40:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::presto::server [11:40:47] (03CR) 10Jcrespo: "Ah, it is the cumin version, it is hardcoded." [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:41:35] (03PS4) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) [11:42:29] (03PS1) 10Muehlenhoff: Switch analytics_cluster::presto::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974162 (https://phabricator.wikimedia.org/T349619) [11:44:11] (03CR) 10Jcrespo: "I will add a "cumin>=4.2.0" I guess?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:45:13] !log imported xdebug 3.0.3+2.9.8+2.8.1+2.5.5-0+deb11u1+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [11:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:56] (03PS5) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) [11:46:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P53398 and previous config saved to /var/cache/conftool/dbconfig/20231114-114602-arnaudb.json [11:46:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::presto::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:51:08] (03PS4) 10Volans: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [11:51:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [11:53:55] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:56:17] (03CR) 10Jcrespo: "This should be ready now for review- I don't expect you to ok'ed the merge as it is, just to sanity check and confirm this is the right ap" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [11:56:18] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:59:49] PROBLEM - Check systemd state on kafka-jumbo1013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T348183)', diff saved to https://phabricator.wikimedia.org/P53399 and previous config saved to /var/cache/conftool/dbconfig/20231114-120108-arnaudb.json [12:01:10] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:01:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::presto::server [12:01:23] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [12:01:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T348183)', diff saved to https://phabricator.wikimedia.org/P53400 and previous config saved to /var/cache/conftool/dbconfig/20231114-120129-arnaudb.json [12:01:33] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:02:04] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T348183)', diff saved to https://phabricator.wikimedia.org/P53401 and previous config saved to /var/cache/conftool/dbconfig/20231114-120401-arnaudb.json [12:04:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:05:32] 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Urbanecm) [12:05:35] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [12:06:03] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [12:06:36] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [12:06:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: gitlab [12:06:50] 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Urbanecm) FTR, I'm currently working on automating the various MediaWiki accesses (group membership, accounts on private wikis, etc.), but I... [12:07:03] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [12:07:04] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:08:05] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [12:08:31] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [12:08:33] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:59] 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10OSefu-WMF) Approved! [12:09:17] (03PS1) 10Muehlenhoff: Switch gitlab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974163 (https://phabricator.wikimedia.org/T349619) [12:11:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974163 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:11:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) >>! In T351181#9329892, @jbond wrote: >> >> edit: or possibly this one https://gith... [12:11:25] (03CR) 10Jelto: [C: 03+1] "lgtm, tests on gitlab1003 were ok" [puppet] - 10https://gerrit.wikimedia.org/r/974163 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:13:05] (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:13:55] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:15:13] (03CR) 10JMeybohm: [C: 04-1] "I would propose to create a calico networkpolicy instead to not have to not introduce another use of kubernetesMasters.cidrs (ideally that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:16:18] (KubernetesCalicoDown) resolved: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:17:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: gitlab [12:19:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P53402 and previous config saved to /var/cache/conftool/dbconfig/20231114-121908-arnaudb.json [12:19:29] (03CR) 10Jbond: [C: 04-1] "thanks for all the work CR looks good but some minor things around style guide issues and ode placement" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [12:20:44] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bullseye [12:22:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans) [12:22:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:22:47] (03CR) 10Jbond: [C: 03+2] puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:23:10] (03CR) 10Jbond: [C: 03+2] webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:29:39] (03Merged) 10jenkins-bot: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [12:32:22] (03PS1) 10Btullis: Increase the size of the innodb pool on analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150) [12:33:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::analytics::backup [12:33:44] (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:34:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P53403 and previous config saved to /var/cache/conftool/dbconfig/20231114-123414-arnaudb.json [12:35:14] (03PS1) 10Btullis: Enable notifications for new analytics_meta hosts [puppet] - 10https://gerrit.wikimedia.org/r/974165 (https://phabricator.wikimedia.org/T284150) [12:35:25] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:35:46] (03PS1) 10Hashar: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) [12:36:00] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974165 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:36:16] (03Merged) 10jenkins-bot: kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:36:28] (03CR) 10Hashar: [C: 04-1] "-1 since the link to the google form is a placeholder." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar) [12:36:55] (03CR) 10Hashar: [C: 03+2] Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar) [12:37:15] !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:37:28] (03Merged) 10jenkins-bot: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar) [12:37:51] !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:38:55] (03CR) 10Klausman: [C: 03+1] profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [12:39:39] (03PS1) 10Btullis: Promote an-mariadb1001 to be the new primary for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974167 (https://phabricator.wikimedia.org/T284150) [12:41:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::analytics::backup [12:42:31] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:42:59] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:45:22] (03CR) 10Jbond: peopleweb: migrate role to puppet 7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn) [12:46:00] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for 16,17th rounds of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974169 (https://phabricator.wikimedia.org/T308142) [12:46:13] !log hashar@deploy2002 Started deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981 [12:46:13] (03PS1) 10Muehlenhoff: Switch mariadb::misc::analytics::backup to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974170 (https://phabricator.wikimedia.org/T349619) [12:46:17] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981 (duration: 00m 04s) [12:47:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:48:21] !log hashar@deploy2002 Started deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981 [12:48:28] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981 (duration: 00m 07s) [12:48:35] (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::misc::analytics::backup to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974170 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:49:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T348183)', diff saved to https://phabricator.wikimedia.org/P53404 and previous config saved to /var/cache/conftool/dbconfig/20231114-124921-arnaudb.json [12:49:23] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [12:49:26] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:49:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [12:49:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T348183)', diff saved to https://phabricator.wikimedia.org/P53405 and previous config saved to /var/cache/conftool/dbconfig/20231114-124942-arnaudb.json [12:51:05] (03PS1) 10Kamila Součková: kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) [12:51:25] (03PS1) 10Btullis: WIP - Temporarily disable the production jobs that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974172 (https://phabricator.wikimedia.org/T284150) [12:51:27] (03PS1) 10Btullis: WIP Re-enable the production pipelines that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974173 (https://phabricator.wikimedia.org/T284150) [12:52:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T348183)', diff saved to https://phabricator.wikimedia.org/P53406 and previous config saved to /var/cache/conftool/dbconfig/20231114-125214-arnaudb.json [12:52:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc::analytics::backup [12:55:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:55:09] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) I think we are gong to need to tweak this a bit more: ` -rw-rw---- 1 mysql mysql 61G Nov 14 12:44 syslog.ibd ` 61GB is quite large for what this is, t... [12:55:50] (03PS1) 10Majavah: P:openstack: galera: fix firewall port [puppet] - 10https://gerrit.wikimedia.org/r/974175 (https://phabricator.wikimedia.org/T351061) [12:56:19] RECOVERY - Check systemd state on kafka-jumbo1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:39] (03PS2) 10Hnowlan: rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 [12:56:47] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [12:57:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/454/con" [puppet] - 10https://gerrit.wikimedia.org/r/974175 (https://phabricator.wikimedia.org/T351061) (owner: 10Majavah) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1300) [13:00:40] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: remove also from Puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans) [13:00:42] (03CR) 10Clément Goubert: [C: 03+1] rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 (owner: 10Hnowlan) [13:02:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::mariadb [13:03:54] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: galera: fix firewall port [puppet] - 10https://gerrit.wikimedia.org/r/974175 (https://phabricator.wikimedia.org/T351061) (owner: 10Majavah) [13:04:56] (03PS1) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) [13:04:59] (03Merged) 10jenkins-bot: sre.hosts.decommission: remove also from Puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans) [13:05:16] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm [13:05:27] (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:06:00] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet [13:06:11] (03PS1) 10Muehlenhoff: Switch analytics_cluster::mariadb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974178 (https://phabricator.wikimedia.org/T349619) [13:07:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P53407 and previous config saved to /var/cache/conftool/dbconfig/20231114-130721-arnaudb.json [13:09:15] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::mariadb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974178 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:10:02] !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet [13:10:09] (03CR) 10Sergio Gimeno: [C: 03+1] IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [13:11:00] (03CR) 10Majavah: "duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/973782?" [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond) [13:11:05] (03PS2) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) [13:12:01] (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:14:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:14:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::mariadb [13:17:07] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runners in codfw [puppet] - 10https://gerrit.wikimedia.org/r/974155 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [13:17:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:17:58] (03PS4) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) [13:19:05] (03PS3) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) [13:19:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: releases [13:19:24] (03CR) 10Elukey: [C: 03+2] changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [13:19:57] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet [13:20:00] (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:20:02] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet [13:20:21] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-cache2003.codfw.wmnet [13:21:24] (03PS1) 10Muehlenhoff: Switch releases to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974179 (https://phabricator.wikimedia.org/T349619) [13:21:46] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10Volans) [13:22:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10Volans) 05In progress→03Resolved This is now done. [13:22:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo) [13:22:24] (03PS2) 10Btullis: Switch datahub to use the new an-mariadb servers instead of an-coord [deployment-charts] - 10https://gerrit.wikimedia.org/r/972823 (https://phabricator.wikimedia.org/T284150) [13:22:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P53408 and previous config saved to /var/cache/conftool/dbconfig/20231114-132227-arnaudb.json [13:22:44] (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runners in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/974135 (https://phabricator.wikimedia.org/T344951) [13:22:52] (03CR) 10Klausman: [C: 03+2] hiera: Migrate ml-cache2003.codfw.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974180 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [13:24:02] (03CR) 10Muehlenhoff: [C: 03+2] Switch releases to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974179 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:24:28] (03PS4) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) [13:25:27] (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:26:09] (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: unregister gitlab-runners in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/974135 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [13:26:33] (03PS5) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) [13:26:43] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-cache2003.codfw.wmnet [13:26:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:26:53] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:27:32] (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:28:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/459/console" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [13:29:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: releases [13:30:19] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:30:48] !log taavi@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcontrol2005-dev.codfw.wmnet [13:32:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:33:46] (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-cache2003.codfw.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974182 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [13:34:18] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-cache1003.eqiad.wmnet [13:37:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T348183)', diff saved to https://phabricator.wikimedia.org/P53409 and previous config saved to /var/cache/conftool/dbconfig/20231114-133734-arnaudb.json [13:37:36] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [13:37:39] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:37:50] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [13:37:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T348183)', diff saved to https://phabricator.wikimedia.org/P53410 and previous config saved to /var/cache/conftool/dbconfig/20231114-133755-arnaudb.json [13:38:21] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-cache1003.eqiad.wmnet [13:39:33] (03PS6) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) [13:40:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T348183)', diff saved to https://phabricator.wikimedia.org/P53411 and previous config saved to /var/cache/conftool/dbconfig/20231114-134028-arnaudb.json [13:41:28] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_cache::storage [13:42:11] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [13:42:26] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [13:42:59] (03Abandoned) 10Jbond: bird::anycast: move firewall rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond) [13:43:07] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [13:43:22] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [13:43:26] (03CR) 10Jbond: bird::anycast: move firewall rules to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond) [13:43:49] (03PS1) 10Muehlenhoff: Switch phab-test1001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) [13:43:56] (03CR) 10Majavah: [V: 03+1] P:bird::anycast: migrate to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [13:44:05] (03CR) 10Klausman: [C: 03+2] hiera: migrate ML cache/cassandara role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974183 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [13:44:15] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [13:45:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1138.eqiad.wmnet onto db1238.eqiad.wmnet [13:47:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [13:47:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/460/con" [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:48:25] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_cache::storage [13:50:43] (03CR) 10Majavah: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:51:13] (03CR) 10Brouberol: "I have reverted the recent changes on the subnet files, that I will get to in another CR. This one was getting out of hand." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:51:23] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [13:53:52] (03CR) 10Majavah: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/973841 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:55:20] (03CR) 10Majavah: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/973842 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:55:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P53412 and previous config saved to /var/cache/conftool/dbconfig/20231114-135534-arnaudb.json [13:55:43] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Cleanup of temporary overrides for Puppet v7 migration [puppet] - 10https://gerrit.wikimedia.org/r/974185 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [13:57:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [13:57:27] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7, AS13030/IPv4: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:34] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/463/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:59:26] (03CR) 10Klausman: [C: 03+1] profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1400) [14:00:05] apergos: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] (03PS65) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [14:00:24] I can’t deploy, sorry [14:00:26] no it's not, I think I removed it from the calendar [14:00:38] I still see it there [14:00:50] ok seriously? every single thing I touch these days I do wrong [14:00:56] trying again to remove it [14:02:01] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/464/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:03:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 15%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53413 and previous config saved to /var/cache/conftool/dbconfig/20231114-140325-arnaudb.json [14:04:03] all right Lucas_WMDE it is now gone, I am sure of it [14:04:12] yay ^^ [14:04:14] nothing to deploy then [14:04:37] apergos: and i was thinking "why are we cancelling the window" :)) [14:04:46] someone mind me stealing it? [14:04:51] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1004.eqiad.wmnet with OS bullseye [14:05:02] I sure don't mind :-D [14:05:32] (03PS4) 10Urbanecm: IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) [14:05:40] (03CR) 10Urbanecm: [C: 03+2] IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:06:34] (03Merged) 10jenkins-bot: IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:06:39] (03PS1) 10Urbanecm: IP Masking: Expire temporary accounts in 1 year [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974143 (https://phabricator.wikimedia.org/T344695) [14:06:46] (03CR) 10Urbanecm: [C: 03+2] IP Masking: Expire temporary accounts in 1 year [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974143 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:07:28] 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10MatthewVernon) Thanks. I just need @kzimmerman to approve your access and then I can proceed. [14:08:48] (03PS1) 10Urbanecm: TempUser: Fix unchecked array access for optional key [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974191 [14:08:58] (03CR) 10Urbanecm: [C: 03+2] TempUser: Fix unchecked array access for optional key [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974191 (owner: 10Urbanecm) [14:09:20] (03PS1) 10Urbanecm: IP Masking: Add expireTemporaryAccounts.php [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974144 (https://phabricator.wikimedia.org/T344695) [14:10:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P53414 and previous config saved to /var/cache/conftool/dbconfig/20231114-141041-arnaudb.json [14:10:56] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:29] (03CR) 10Urbanecm: [C: 03+2] IP Masking: Add expireTemporaryAccounts.php [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974144 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:15:43] (03PS66) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [14:16:50] (03CR) 10Jbond: "LGTM just some changes on the rspec" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:17:15] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/465/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:18:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 30%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53415 and previous config saved to /var/cache/conftool/dbconfig/20231114-141830-arnaudb.json [14:18:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973739 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [14:18:44] (03CR) 10Ssingh: [C: 03+1] Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [14:19:09] (03PS67) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [14:19:32] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Jdforrester-WMF) [14:20:22] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Jdforrester-WMF) [14:20:33] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None [14:20:51] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet [14:20:55] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet [14:20:56] !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet [14:21:13] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/466/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:22:11] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage [14:23:11] (03Merged) 10jenkins-bot: IP Masking: Expire temporary accounts in 1 year [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974143 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:23:58] (03CR) 10Brouberol: [V: 03+1] "Final diff for netboot.cfg: https://phabricator.wikimedia.org/P53293" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:24:40] (03Merged) 10jenkins-bot: TempUser: Fix unchecked array access for optional key [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974191 (owner: 10Urbanecm) [14:24:41] (03Merged) 10jenkins-bot: IP Masking: Add expireTemporaryAccounts.php [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974144 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:24:52] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage [14:25:14] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974143|IP Masking: Expire temporary accounts in 1 year (T344695)]], [[gerrit:974191|TempUser: Fix unchecked array access for optional key]], [[gerrit:974144|IP Masking: Add expireTemporaryAccounts.php (T344695)]] [14:25:47] urbanecm@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [14:25:47] T344695: [IP Masking] Expire temporary accounts in 1 year - https://phabricator.wikimedia.org/T344695 [14:25:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T348183)', diff saved to https://phabricator.wikimedia.org/P53416 and previous config saved to /var/cache/conftool/dbconfig/20231114-142547-arnaudb.json [14:25:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [14:25:54] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:26:02] okay... [14:26:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [14:26:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1227 (T348183)', diff saved to https://phabricator.wikimedia.org/P53417 and previous config saved to /var/cache/conftool/dbconfig/20231114-142608-arnaudb.json [14:26:29] seems transient (and thank you, TheresNoTime, for https://bash.toolforge.org/quip/CGD9XYIBa_6PSCT9HbBu :D) [14:26:39] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:974143|IP Masking: Expire temporary accounts in 1 year (T344695)]], [[gerrit:974191|TempUser: Fix unchecked array access for optional key]], [[gerrit:974144|IP Masking: Add expireTemporaryAccounts.php (T344695)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:26:41] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1104.eqiad.wmnet [14:26:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1104.eqiad.wmnet [14:26:44] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:26:54] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:27:33] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1046.eqiad.wmnet with OS bookworm [14:28:03] (03CR) 10Jbond: [C: 03+2] openstack: update to use multiroot CA [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:28:32] !log swapped cp1104 <-> cp1079 (T349244) [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:36] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [14:28:55] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:56] (03CR) 10JMeybohm: "@Bking: This is what caused the diff you saw yesterday. Would you be so kind to rebase, merge and deploy?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking) [14:29:00] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:29:03] (03CR) 10Urbanecm: [C: 03+1] "should be ok to do any time, i ended up backporting the relevant code" [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:29:26] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:34] (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [14:30:05] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [14:30:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:30:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet [14:30:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T348183)', diff saved to https://phabricator.wikimedia.org/P53418 and previous config saved to /var/cache/conftool/dbconfig/20231114-143021-arnaudb.json [14:30:38] (03CR) 10Jbond: [C: 03+2] toolforge: update to use trsuted ca path [puppet] - 10https://gerrit.wikimedia.org/r/973841 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:30:41] (03CR) 10Jbond: [C: 03+2] wmcs::kubeadm: migrate to trusted ca path [puppet] - 10https://gerrit.wikimedia.org/r/973842 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:31:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1105.eqiad.wmnet [14:31:18] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1105.eqiad.wmnet [14:31:41] (03CR) 10Volans: "Makes sense to me but I'll leave it to the experts ;)" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:32:17] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974143|IP Masking: Expire temporary accounts in 1 year (T344695)]], [[gerrit:974191|TempUser: Fix unchecked array access for optional key]], [[gerrit:974144|IP Masking: Add expireTemporaryAccounts.php (T344695)]] (duration: 07m 03s) [14:32:21] T344695: [IP Masking] Expire temporary accounts in 1 year - https://phabricator.wikimedia.org/T344695 [14:32:29] !log swapped cp1105 <-> cp1080 (T349244) [14:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:14] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 (owner: 10Hnowlan) [14:33:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [14:33:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 45%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53420 and previous config saved to /var/cache/conftool/dbconfig/20231114-143335-arnaudb.json [14:34:03] (03Merged) 10jenkins-bot: rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 (owner: 10Hnowlan) [14:34:54] (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [14:37:38] (03Merged) 10jenkins-bot: kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [14:38:29] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:38:38] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:38:55] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:43] (03CR) 10Filippo Giunchedi: [C: 03+2] "Ack, thanks John" [puppet] - 10https://gerrit.wikimedia.org/r/973739 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [14:41:45] (03PS68) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [14:42:26] (03CR) 10CI reject: [V: 04-1] Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:42:38] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1004.eqiad.wmnet with OS bullseye [14:42:46] (03PS1) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607) [14:43:24] (03PS69) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [14:44:35] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:44:48] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:44:58] (03CR) 10Filippo Giunchedi: NTP: alert on ntp/time errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [14:45:18] (03PS1) 10Eevans: install_server: configure aqs1011 for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/974201 (https://phabricator.wikimedia.org/T347738) [14:45:27] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:45:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P53421 and previous config saved to /var/cache/conftool/dbconfig/20231114-144528-arnaudb.json [14:45:39] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:46:19] 10SRE, 10Infrastructure-Foundations, 10Observability-Logging, 10Patch-For-Review, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565 (10jbond) [14:46:47] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:46:57] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [14:46:58] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:47:34] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jbond) 05In progress→03Resolved a:03jbond volatile is now synced to all pupp... [14:48:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53423 and previous config saved to /var/cache/conftool/dbconfig/20231114-144840-arnaudb.json [14:48:55] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:04] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:49:26] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:29] (03CR) 10Herron: "this is great! please see a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [14:49:32] (03CR) 10Ssingh: [C: 03+1] "LGTM. We should be careful rolling this out even if it should be atomic as the nameservers are on bird as well. If you want someone to rol" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [14:50:08] (03CR) 10Filippo Giunchedi: prometheus-puppet-agent-stats: this timer sometime fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [14:50:11] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::backups [14:50:11] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [14:51:32] (03CR) 10Btullis: [V: 03+1 C: 03+2] "The privacy team has given us the go-ahead for this change: https://phabricator.wikimedia.org/T349910#9325309" [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [14:51:45] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:52:13] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974202 (https://phabricator.wikimedia.org/T349619) [14:52:24] (03CR) 10Elukey: [V: 03+1] profile::pyrra::filesystem: add Lift Wing pilot (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [14:52:28] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:52:41] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:53:02] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974202 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:53:09] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:53:19] (03CR) 10Herron: [C: 03+1] profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [14:53:55] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:17] (03CR) 10Herron: profile::pyrra::filesystem: add Lift Wing pilot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [14:55:24] (03CR) 10EoghanGaffney: [C: 03+1] Switch phab-test1001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:56:06] (03CR) 10EoghanGaffney: [C: 03+1] "It's possible we can decommission this host, but let's merge this for now and we'll work on clarifying what to do with it." [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:56:16] (03CR) 10Herron: profile::pyrra::filesystem: add Lift Wing pilot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [14:57:18] (03PS1) 10Elukey: services: remove num_workers from cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974204 [14:57:59] (03PS70) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [14:58:03] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::backups [15:00:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P53425 and previous config saved to /var/cache/conftool/dbconfig/20231114-150034-arnaudb.json [15:02:46] (03CR) 10Elukey: [C: 03+2] profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [15:03:40] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [15:03:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53426 and previous config saved to /var/cache/conftool/dbconfig/20231114-150345-arnaudb.json [15:05:34] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:10:05] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::analytics_replica [15:10:51] (03CR) 10Elukey: [C: 03+2] services: remove num_workers from cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974204 (owner: 10Elukey) [15:13:30] (03PS2) 10Giuseppe Lavagetto: mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) [15:13:39] (03PS1) 10Muehlenhoff: Switch mariadb::analytics_replica to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974206 (https://phabricator.wikimedia.org/T349619) [15:13:49] (03PS2) 10Giuseppe Lavagetto: mw-api-int: double the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/973183 (https://phabricator.wikimedia.org/T350846) [15:15:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T348183)', diff saved to https://phabricator.wikimedia.org/P53427 and previous config saved to /var/cache/conftool/dbconfig/20231114-151541-arnaudb.json [15:15:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [15:15:46] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:15:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::analytics_replica to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974206 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:15:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [15:16:03] (03PS2) 10Muehlenhoff: Switch mariadb::analytics_replica to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974206 (https://phabricator.wikimedia.org/T349619) [15:16:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1236 (T348183)', diff saved to https://phabricator.wikimedia.org/P53428 and previous config saved to /var/cache/conftool/dbconfig/20231114-151602-arnaudb.json [15:16:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003'] [15:16:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003'] [15:16:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003'] [15:17:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003'] [15:17:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003'] [15:17:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003'] [15:18:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044'] [15:18:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 90%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53430 and previous config saved to /var/cache/conftool/dbconfig/20231114-151850-arnaudb.json [15:19:57] (03PS1) 10Hnowlan: trafficserver: restore traffic to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/974207 (https://phabricator.wikimedia.org/T350708) [15:20:17] (03CR) 10Muehlenhoff: [C: 03+2] Switch phab-test1001 to insetup::buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:20:24] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043'] [15:21:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [15:22:50] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1046'] [15:22:56] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:23:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974201 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [15:23:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::analytics_replica [15:23:32] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:25:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044'] [15:26:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:26:19] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044'] [15:27:01] (03Merged) 10jenkins-bot: mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:28:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043'] [15:28:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043'] [15:29:12] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:29:32] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:29:44] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [15:30:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:30:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [15:32:07] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:32:18] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:33:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P53431 and previous config saved to /var/cache/conftool/dbconfig/20231114-153344-arnaudb.json [15:33:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53432 and previous config saved to /var/cache/conftool/dbconfig/20231114-153355-arnaudb.json [15:33:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host vrts1002.eqiad.wmnet [15:34:44] (03PS3) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) [15:34:54] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:55] (03CR) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [15:35:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044'] [15:35:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044'] [15:36:48] (03CR) 10Bking: [C: 03+2] search-loader: remove references to search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/973880 (https://phabricator.wikimedia.org/T351123) (owner: 10Bking) [15:37:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043'] [15:37:28] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043'] [15:38:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044'] [15:38:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043'] [15:38:35] (03PS3) 10Hnowlan: api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) [15:39:08] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044'] [15:39:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1046'] [15:39:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043'] [15:39:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044'] [15:39:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043'] [15:39:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1046'] [15:39:59] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044'] [15:40:19] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043'] [15:40:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043'] [15:40:29] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1046'] [15:40:31] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044'] [15:40:37] (03CR) 10Herron: [C: 03+1] "LGTM thanks for piloting this! 🚀" [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [15:41:19] (03CR) 10Hnowlan: [C: 03+2] api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [15:41:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) [15:42:11] (03Merged) 10jenkins-bot: api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [15:42:30] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:42] (03PS1) 10Muehlenhoff: Switch vrts1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974210 (https://phabricator.wikimedia.org/T349619) [15:43:44] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for search-loader - https://phabricator.wikimedia.org/T346273 (10bking) 05Open→03Resolved a:03bking This is done...closing out ticket. [15:44:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch vrts1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974210 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:44:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) [15:46:49] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:47:05] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:48:28] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm [15:48:29] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [15:48:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P53433 and previous config saved to /var/cache/conftool/dbconfig/20231114-154850-arnaudb.json [15:49:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host vrts1002.eqiad.wmnet [15:49:39] (03CR) 10Eevans: [C: 03+2] install_server: configure aqs1011 for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/974201 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [15:50:14] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:50:33] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:51:06] (03PS1) 10Muehlenhoff: Apply Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/974211 (https://phabricator.wikimedia.org/T346039) [15:51:09] (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [15:53:07] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:53:23] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:53:55] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:59:53] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1044.eqiad.wmnet with OS bookworm [15:59:53] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1043.eqiad.wmnet with OS bookworm [16:00:05] eoghan, jelto, and arnoldokoth: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1600) [16:00:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm [16:00:36] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [16:01:12] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::serviceops_collab [16:02:38] (03PS1) 10Muehlenhoff: Switch insetup::serviceops_collab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974213 (https://phabricator.wikimedia.org/T349619) [16:03:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T348183)', diff saved to https://phabricator.wikimedia.org/P53434 and previous config saved to /var/cache/conftool/dbconfig/20231114-160356-arnaudb.json [16:03:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:04:02] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:04:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:04:53] (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::serviceops_collab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974213 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:06:19] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [16:06:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [16:07:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:55] !log brennen@deploy2002 Started deploy [phabricator/deployment@0b76984]: test deploy to phab2002 for T350876 [16:08:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [16:08:59] T350876: Deploy Phabricator/Phorge 2023-11-14 - https://phabricator.wikimedia.org/T350876 [16:09:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [16:09:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:09:27] !log brennen@deploy2002 Finished deploy [phabricator/deployment@0b76984]: test deploy to phab2002 for T350876 (duration: 00m 32s) [16:09:53] (03PS1) 10Elukey: profile::pyrra::filesystem: remove grouping for lift wing [puppet] - 10https://gerrit.wikimedia.org/r/974214 [16:09:56] !log brennen@deploy2002 Started deploy [phabricator/deployment@0b76984]: deploy to phab1004 for T350876 [16:09:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:10:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:11:00] !log brennen@deploy2002 Finished deploy [phabricator/deployment@0b76984]: deploy to phab1004 for T350876 (duration: 01m 04s) [16:11:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::serviceops_collab [16:11:37] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [16:11:51] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [16:11:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T348183)', diff saved to https://phabricator.wikimedia.org/P53435 and previous config saved to /var/cache/conftool/dbconfig/20231114-161157-arnaudb.json [16:12:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:12:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Jhancock.wm) [16:13:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) [16:13:10] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Issues which should be fixed by puppet7 upgrade - https://phabricator.wikimedia.org/T351104 (10jbond) p:05Triage→03Medium [16:13:18] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) p:05Triage→03High [16:14:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm [16:14:23] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [16:15:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) [16:16:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T348183)', diff saved to https://phabricator.wikimedia.org/P53436 and previous config saved to /var/cache/conftool/dbconfig/20231114-161617-arnaudb.json [16:16:34] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Jhancock.wm) [16:17:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Jhancock.wm) [16:17:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [16:19:09] (03PS1) 10MVernon: swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) [16:21:54] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [16:25:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (Not sure if you're aware but I added a script to the puppetdb hosts to check whether a server is compatible with nftable" [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [16:25:20] 10SRE, 10Infrastructure-Foundations, 10Maps, 10Puppet-Infrastructure, and 2 others: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) 05Open→03Resolved a:03jbond going to close this as i think its resolved but please reopen if not [16:26:27] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1002.eqiad.wmnet [16:26:38] (03CR) 10Majavah: [C: 03+2] hieradata: migrate codfw cloudlb to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [16:28:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [16:29:49] (03CR) 10Majavah: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [16:30:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Jhancock.wm) We received the servers and need racking details please. @Clement_Goubert or @Joe Thank you! [16:30:21] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1002.eqiad.wmnet [16:30:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:31:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P53437 and previous config saved to /var/cache/conftool/dbconfig/20231114-163123-arnaudb.json [16:34:45] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@017fbf1]: search: clean wcqs revision map [16:35:14] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@017fbf1]: search: clean wcqs revision map (duration: 00m 29s) [16:35:37] thanks! ^ [16:37:18] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm [16:42:00] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Ladsgroup) FWIW, the rows are almost all like this: ` +-----------+----------+----------+-------+------------+---------------------+---------+-----------------------... [16:44:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bookworm [16:46:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P53438 and previous config saved to /var/cache/conftool/dbconfig/20231114-164630-arnaudb.json [16:47:43] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [16:47:56] (03PS1) 10Ladsgroup: beta: Set pagelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974221 (https://phabricator.wikimedia.org/T351237) [16:47:56] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [16:50:02] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@0ae1184]: make cirrus index imports world readable in hdfs [16:50:30] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@0ae1184]: make cirrus index imports world readable in hdfs (duration: 00m 28s) [16:53:40] (03CR) 10Andrew Bogott: [C: 03+1] "seems to help!" [puppet] - 10https://gerrit.wikimedia.org/r/973847 (https://phabricator.wikimedia.org/T349695) (owner: 10FNegri) [16:55:10] (03Abandoned) 10Elukey: profile::pyrra::filesystem: remove grouping for lift wing [puppet] - 10https://gerrit.wikimedia.org/r/974214 (owner: 10Elukey) [16:58:20] (03CR) 10Fabfur: [C: 03+1] trafficserver: restore traffic to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/974207 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [17:00:05] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1700). [17:00:05] urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:12] here! [17:00:29] urbanecm: give me a sec [17:00:33] sure [17:01:21] (03CR) 10Hnowlan: [C: 03+2] trafficserver: restore traffic to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/974207 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan) [17:01:33] (03CR) 10Jbond: [C: 03+2] mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [17:01:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T348183)', diff saved to https://phabricator.wikimedia.org/P53440 and previous config saved to /var/cache/conftool/dbconfig/20231114-170136-arnaudb.json [17:01:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [17:01:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [17:01:56] urbanecm: do you want me to deploy it anywhre specific o you can test? [17:01:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T348183)', diff saved to https://phabricator.wikimedia.org/P53441 and previous config saved to /var/cache/conftool/dbconfig/20231114-170158-arnaudb.json [17:01:59] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:02:08] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [17:02:29] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [17:02:38] jbond: it'd run tomorrow anyway, so i don't think that's needed :). let's wait for puppet. [17:02:52] urbanecm: ack sgtm, then all donw [17:02:54] urbanecm: ack sgtm, then all done [17:02:58] thanks [17:03:01] np [17:03:25] jbond: did your merge happen to pick up my changes to profile::trafficserver? [17:04:00] hnowlan: yes i just noticed it was your cr not mine i merged [17:04:03] sorry about that [17:04:09] no worries, was just about to merge it [17:04:15] (03PS1) 10Majavah: team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 [17:04:15] ok cool :) [17:04:56] (03CR) 10Andrew Bogott: [C: 03+1] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah) [17:05:09] (03CR) 10Majavah: [C: 03+2] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah) [17:06:04] (03CR) 10CI reject: [V: 04-1] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah) [17:06:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T348183)', diff saved to https://phabricator.wikimedia.org/P53442 and previous config saved to /var/cache/conftool/dbconfig/20231114-170621-arnaudb.json [17:09:21] (03PS2) 10Majavah: team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 [17:09:35] (03CR) 10Majavah: [C: 03+2] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah) [17:11:24] (03Merged) 10jenkins-bot: team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah) [17:12:01] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1046.eqiad.wmnet with OS bookworm [17:12:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm [17:14:05] jbond: actually... i tried `run-puppet-agent` at `deployment-mwmaint02` (as beta's where we'd like to QA the job first), and i don't see the timer added there. is that beta being broken, or puppet code not done correctly? [17:14:47] urbanecm: just about to junmp on a call will check in ~30 mins if thats ok [17:14:53] absolutely. [17:16:03] (03CR) 10Krinkle: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [17:16:34] (03CR) 10Krinkle: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [17:18:09] urbanecm: why is that timer running once for every wiki? are centralauth temporary accounts not global? [17:19:37] it's complicated... At least for a transitionary period they need to be local as wikis don't want surprises [17:20:07] taavi: original reason was that a temp account doesn't need to exist everywhere. but...not sure if we actually need to run it everywhere. [17:20:19] Amir1: i think technically, they'd be in `globaluser` no matter what? [17:20:49] yeah, they'll be but they don't exist in every wiki [17:20:58] even if they visit them [17:21:15] but indeed it doesn't need to be on every wiki [17:21:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P53444 and previous config saved to /var/cache/conftool/dbconfig/20231114-172127-arnaudb.json [17:21:48] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1043.eqiad.wmnet with OS bookworm [17:23:25] taavi: it calls `AuthManager::revokeAccessForUser( UserIdentity $tempAcc )`, and i don't think i can construct an user identity for a temp account that doesn't exist locally. so, i think it needs to run everywhere. [17:24:40] urbanecm: ok, should expireTemporaryAccounts.php have some filters to only process accounts attached to that wiki in that case? [17:25:15] possibly yes [17:27:28] (03CR) 10FNegri: [C: 03+2] [toolsdb] Lower innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/973847 (https://phabricator.wikimedia.org/T349695) (owner: 10FNegri) [17:27:32] the job seems to have made it to beta in the meantime, you probably ran puppet before the git-sync-upstream timer had ran on deployment-puppetmaster [17:27:58] gotcha [17:29:21] can I deploy a patch? [17:31:12] no objection from me [17:32:55] coooolio [17:33:06] (03CR) 10Ladsgroup: [C: 03+2] beta: Set pagelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974221 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [17:33:49] (03Merged) 10jenkins-bot: beta: Set pagelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974221 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [17:36:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P53445 and previous config saved to /var/cache/conftool/dbconfig/20231114-173634-arnaudb.json [17:42:21] PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:01] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::control [17:43:59] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10hnowlan) [17:45:11] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [17:45:47] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::control: migrate to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974224 [17:46:20] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::control: migrate to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974224 (owner: 10Jbond) [17:47:06] 10SRE, 10Phabricator maintenance bot, 10collaboration-services, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Aklapper) [17:48:56] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10hnowlan) [17:51:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T348183)', diff saved to https://phabricator.wikimedia.org/P53446 and previous config saved to /var/cache/conftool/dbconfig/20231114-175140-arnaudb.json [17:51:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:51:45] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:51:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [17:52:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T348183)', diff saved to https://phabricator.wikimedia.org/P53447 and previous config saved to /var/cache/conftool/dbconfig/20231114-175202-arnaudb.json [17:53:59] (03PS1) 10Jbond: Revert "wmcs::openstack::codfw1dev::control: migrate to puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/974226 [17:54:11] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: wmcs::openstack::codfw1dev::control [17:54:20] (03CR) 10Jbond: [C: 03+2] Revert "wmcs::openstack::codfw1dev::control: migrate to puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/974226 (owner: 10Jbond) [17:55:23] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [17:55:24] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [17:56:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T348183)', diff saved to https://phabricator.wikimedia.org/P53448 and previous config saved to /var/cache/conftool/dbconfig/20231114-175623-arnaudb.json [17:59:19] (03CR) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [17:59:22] (03CR) 10Jbond: [C: 03+2] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1800) [18:01:24] splunk told me that I'm not oncall anymore, going to party! Nothing to report [18:04:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bookworm [18:06:23] (03PS7) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) [18:06:25] (03PS4) 10Majavah: Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) [18:06:27] (03PS4) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [18:06:30] (03PS10) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [18:09:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [18:11:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS bookworm [18:11:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P53449 and previous config saved to /var/cache/conftool/dbconfig/20231114-181130-arnaudb.json [18:14:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/468/con" [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [18:19:54] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage [18:22:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage [18:23:00] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup) The biggest problem for that is the reorgs, a lot of teams we set to own something might not exist in a couple of years, generally I think it's better to keep at the discretion of the DBA wh... [18:26:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P53450 and previous config saved to /var/cache/conftool/dbconfig/20231114-182636-arnaudb.json [18:27:50] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage [18:32:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage [18:33:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1046.eqiad.wmnet with OS bookworm [18:36:31] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1011.eqiad.wmnet with OS bullseye [18:41:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T348183)', diff saved to https://phabricator.wikimedia.org/P53451 and previous config saved to /var/cache/conftool/dbconfig/20231114-184142-arnaudb.json [18:41:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [18:41:48] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:41:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [18:42:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T348183)', diff saved to https://phabricator.wikimedia.org/P53452 and previous config saved to /var/cache/conftool/dbconfig/20231114-184204-arnaudb.json [18:45:53] RECOVERY - Check systemd state on ganeti1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T348183)', diff saved to https://phabricator.wikimedia.org/P53453 and previous config saved to /var/cache/conftool/dbconfig/20231114-184637-arnaudb.json [18:50:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 44.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:50:51] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1011.eqiad.wmnet with reason: host reimage [18:50:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1047.eqiad.wmnet with OS bookworm [18:53:19] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1011.eqiad.wmnet with reason: host reimage [18:53:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS bookworm [18:53:55] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:54:51] PROBLEM - ensure kvm processes are running on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:55:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 45.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:56:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS bookworm [18:57:33] RECOVERY - ensure kvm processes are running on cloudvirt1048 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:19] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: stewards [19:00:05] jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1900). [19:01:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P53454 and previous config saved to /var/cache/conftool/dbconfig/20231114-190143-arnaudb.json [19:04:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: stewards [19:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:09:31] The train is rolling [19:09:45] despite lack of bot messages [19:12:43] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage [19:14:13] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.5 refs T350081 [19:14:18] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [19:15:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage [19:16:19] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1011.eqiad.wmnet with OS bullseye [19:16:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P53455 and previous config saved to /var/cache/conftool/dbconfig/20231114-191649-arnaudb.json [19:18:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS bookworm [19:22:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on moscovium.eqiad.wmnet with reason: maintenance [19:22:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moscovium.eqiad.wmnet with reason: maintenance [19:25:07] !log sfaci@deploy2002 Started deploy [analytics/refinery@2f94afe]: Regular analytics weekly train [analytics/refinery@2f94afe0] [19:31:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T348183)', diff saved to https://phabricator.wikimedia.org/P53456 and previous config saved to /var/cache/conftool/dbconfig/20231114-193156-arnaudb.json [19:31:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [19:32:10] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:32:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [19:32:12] !log sfaci@deploy2002 Finished deploy [analytics/refinery@2f94afe]: Regular analytics weekly train [analytics/refinery@2f94afe0] (duration: 07m 04s) [19:32:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T348183)', diff saved to https://phabricator.wikimedia.org/P53457 and previous config saved to /var/cache/conftool/dbconfig/20231114-193217-arnaudb.json [19:33:45] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [19:34:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [19:35:48] !log sfaci@deploy2002 Started deploy [analytics/refinery@2f94afe] (thin): Regular analytics weekly train THIN [analytics/refinery@2f94afe0] [19:35:54] !log sfaci@deploy2002 Finished deploy [analytics/refinery@2f94afe] (thin): Regular analytics weekly train THIN [analytics/refinery@2f94afe0] (duration: 00m 06s) [19:36:03] !log sfaci@deploy2002 Started deploy [analytics/refinery@2f94afe] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f94afe0] [19:36:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T348183)', diff saved to https://phabricator.wikimedia.org/P53458 and previous config saved to /var/cache/conftool/dbconfig/20231114-193635-arnaudb.json [19:36:53] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [19:39:18] !log sfaci@deploy2002 Finished deploy [analytics/refinery@2f94afe] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f94afe0] (duration: 03m 14s) [19:40:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS bookworm [19:41:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1052.eqiad.wmnet with OS bookworm [19:51:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P53459 and previous config saved to /var/cache/conftool/dbconfig/20231114-195141-arnaudb.json [19:52:12] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: etherpad [19:53:55] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:56:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) [19:57:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) stewards: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973863 peopleweb: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973855 etherp... [19:57:08] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage [19:57:10] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:57:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etherpad [19:59:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS bookworm [20:01:06] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage [20:02:11] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host doc2002.codfw.wmnet [20:03:51] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043'] [20:04:42] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043'] [20:06:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P53460 and previous config saved to /var/cache/conftool/dbconfig/20231114-200648-arnaudb.json [20:07:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host doc2002.codfw.wmnet [20:09:23] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye [20:11:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bookworm [20:17:05] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: doc [20:19:26] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:01] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:21:11] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:21:38] PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: doc [20:21:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T348183)', diff saved to https://phabricator.wikimedia.org/P53461 and previous config saved to /var/cache/conftool/dbconfig/20231114-202154-arnaudb.json [20:21:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:22:00] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:22:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:22:12] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:13] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:22:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:22:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T348183)', diff saved to https://phabricator.wikimedia.org/P53462 and previous config saved to /var/cache/conftool/dbconfig/20231114-202232-arnaudb.json [20:24:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on people1004.eqiad.wmnet with reason: maintenance [20:24:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people1004.eqiad.wmnet with reason: maintenance [20:24:48] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bookworm [20:25:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on people2003.codfw.wmnet with reason: maintenance [20:25:40] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [20:25:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people2003.codfw.wmnet with reason: maintenance [20:25:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bookworm [20:26:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T348183)', diff saved to https://phabricator.wikimedia.org/P53463 and previous config saved to /var/cache/conftool/dbconfig/20231114-202650-arnaudb.json [20:27:06] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [20:29:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on doc2002.codfw.wmnet with reason: maintenance [20:30:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doc2002.codfw.wmnet with reason: maintenance [20:30:27] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bookworm [20:31:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on doc1003.eqiad.wmnet with reason: maintenance [20:31:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doc1003.eqiad.wmnet with reason: maintenance [20:32:03] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1043.eqiad.wmnet with OS bullseye [20:33:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) [20:33:32] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye [20:39:33] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [20:40:14] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:22] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:33] !log doc2002 - systemctl start rsync-doc-host-data-sync - failed unit after maintenance reboot [20:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P53464 and previous config saved to /var/cache/conftool/dbconfig/20231114-204156-arnaudb.json [20:42:09] !log destroying phab-test1001.eqiad.wmnet - T351115 [20:42:09] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage [20:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:13] T351115: decom phab-test1001 - https://phabricator.wikimedia.org/T351115 [20:43:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts phab-test1001.eqiad.wmnet [20:44:08] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [20:44:51] (03PS1) 10Dzahn: site/hiera: remove decom'ed phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/974266 (https://phabricator.wikimedia.org/T351115) [20:46:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [20:47:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [20:47:30] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:47:34] (03CR) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [20:49:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [20:49:50] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab-test1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [20:51:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab-test1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001" [20:51:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:51:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab-test1001.eqiad.wmnet [20:51:36] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:10] (03CR) 10Dzahn: [C: 03+2] site/hiera: remove decom'ed phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/974266 (https://phabricator.wikimedia.org/T351115) (owner: 10Dzahn) [20:54:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bookworm [20:55:40] (03CR) 10Dzahn: "thanks for merging this after manager approval :)" [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [20:55:54] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10ATsay-WMF) I approve this as Grace's manager. Thanks! [20:56:48] 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Peachey88) {T279023} [20:57:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P53465 and previous config saved to /var/cache/conftool/dbconfig/20231114-205703-arnaudb.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T2100) [21:00:04] Kizule, danisztls, ebernhardson, and jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:40] \o [21:00:42] o/ [21:00:52] o/ [21:01:27] (03PS2) 10Jdrewniak: [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) [21:03:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bookworm [21:07:35] so, whos running the deploy window? [21:09:20] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bookworm [21:09:26] (03CR) 10Muehlenhoff: [C: 03+1] install_server: configure reuse for all aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/974259 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [21:09:29] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bookworm [21:11:05] ebernhardson: i prefer not to, but since there's no one else, let's do that [21:11:36] urbanecm: I appreciate it, thanks [21:11:39] (03CR) 10Urbanecm: [C: 03+2] [Zebra] Remove underline from pages with blank title [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974227 (https://phabricator.wikimedia.org/T351119) (owner: 10Jdrewniak) [21:12:01] jan_drewniak: can you advise whether deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/974264/ before the backport would be a good or bad idea? [21:12:07] (i assume bad, since they seem to be touching the same area) [21:12:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T348183)', diff saved to https://phabricator.wikimedia.org/P53466 and previous config saved to /var/cache/conftool/dbconfig/20231114-211209-arnaudb.json [21:12:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bullseye [21:12:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [21:12:16] (03CR) 10Eevans: [C: 03+2] install_server: configure reuse for all aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/974259 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [21:12:21] (03PS2) 10Urbanecm: Deploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974254 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:12:25] (03CR) 10Urbanecm: [C: 03+2] Deploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974254 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:12:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [21:12:27] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:12:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53467 and previous config saved to /var/cache/conftool/dbconfig/20231114-211231-arnaudb.json [21:12:52] urbanecm: Its be better if the vector patch goes first, then the config [21:12:59] okay, noted. [21:13:09] (03Merged) 10jenkins-bot: Deploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974254 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:13:16] (03PS2) 10Urbanecm: throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21) [21:13:19] (03CR) 10Urbanecm: [C: 03+2] throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21) [21:13:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21) [21:14:04] (03Merged) 10jenkins-bot: throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21) [21:14:24] (03CR) 10Urbanecm: [C: 03+2] PageRerenderSerializer: Match stream name with conventions [extensions/CirrusSearch] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974228 (owner: 10Ebernhardson) [21:14:29] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974254|Deploy Reader Demographics 2 survey on enwiki (T344393)]], [[gerrit:973369|throttle.php: Cleanup old rules, add new one (T351002)]] [21:14:45] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:14:45] T351002: Lift IP cap on 2023-11-23 for Editathon Czechoslovakia - cs.wikipedia - https://phabricator.wikimedia.org/T351002 [21:15:34] urbanecm: regarding mine, there's nothing to test as it just increases coverage [21:15:47] ack [21:15:50] !log urbanecm@deploy2002 dani and urbanecm and zoranzoki21: Backport for [[gerrit:974254|Deploy Reader Demographics 2 survey on enwiki (T344393)]], [[gerrit:973369|throttle.php: Cleanup old rules, add new one (T351002)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:52] !log urbanecm@deploy2002 dani and urbanecm and zoranzoki21: Continuing with sync [21:15:55] proceeding then [21:16:20] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm [21:17:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53468 and previous config saved to /var/cache/conftool/dbconfig/20231114-211700-arnaudb.json [21:21:18] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974254|Deploy Reader Demographics 2 survey on enwiki (T344393)]], [[gerrit:973369|throttle.php: Cleanup old rules, add new one (T351002)]] (duration: 06m 49s) [21:21:24] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:21:25] T351002: Lift IP cap on 2023-11-23 for Editathon Czechoslovakia - cs.wikipedia - https://phabricator.wikimedia.org/T351002 [21:21:35] danisztls: should be live [21:23:16] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [21:23:23] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm [21:25:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [21:26:48] (03CR) 10Dzahn: "This VM has now been deleted" [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [21:28:33] (03Merged) 10jenkins-bot: [Zebra] Remove underline from pages with blank title [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974227 (https://phabricator.wikimedia.org/T351119) (owner: 10Jdrewniak) [21:29:01] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974169 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno) [21:29:47] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974227|[Zebra] Remove underline from pages with blank title (T351119)]] [21:29:52] T351119: Zebra - Pages with blank titles shouldn't have underlines - https://phabricator.wikimedia.org/T351119 [21:30:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [21:31:11] !log urbanecm@deploy2002 urbanecm and jdrewniak: Backport for [[gerrit:974227|[Zebra] Remove underline from pages with blank title (T351119)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:31:29] jan_drewniak: can you test the backport please? [21:32:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P53469 and previous config saved to /var/cache/conftool/dbconfig/20231114-213207-arnaudb.json [21:32:08] Wybór Łysek i na sposób i bólu niż sekrecie sposób wiele innego liczba Chonan [21:32:23] (03Merged) 10jenkins-bot: PageRerenderSerializer: Match stream name with conventions [extensions/CirrusSearch] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974228 (owner: 10Ebernhardson) [21:32:28] Wow autocorrect dictation in Polish... [21:32:43] :D [21:33:11] Let me tell you that the bride has no secrets for the bride as soon as possible Chonan, according to translator [21:33:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage [21:34:08] urbanecm: in other words, patch looks good to sync :P [21:34:19] * urbanecm adds that to my dictionary [21:34:21] !log urbanecm@deploy2002 urbanecm and jdrewniak: Continuing with sync [21:35:07] urbanecm: mine isn't testable, it changes a string which is only used in job's related to page updates [21:35:13] ack [21:35:18] will deploy once it merges [21:35:28] oh, it merged [21:35:43] (03PS3) 10Urbanecm: [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak) [21:35:50] (03CR) 10Urbanecm: [C: 03+2] [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak) [21:36:39] (03Merged) 10jenkins-bot: [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak) [21:39:47] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974227|[Zebra] Remove underline from pages with blank title (T351119)]] (duration: 09m 59s) [21:39:52] T351119: Zebra - Pages with blank titles shouldn't have underlines - https://phabricator.wikimedia.org/T351119 [21:40:42] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974264|[Vector] enable Zebra CSS module on test wikis (T347711)]], [[gerrit:974228|PageRerenderSerializer: Match stream name with conventions]] [21:40:46] T347711: [Zebra] Enable refactored Zebra on certain wikis for testing purposes - https://phabricator.wikimedia.org/T347711 [21:42:17] !log urbanecm@deploy2002 urbanecm and jdrewniak and ebernhardson: Backport for [[gerrit:974264|[Vector] enable Zebra CSS module on test wikis (T347711)]], [[gerrit:974228|PageRerenderSerializer: Match stream name with conventions]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:42:25] jan_drewniak: can you test please? :) [21:42:45] urbanecm: yup, I see it, looks good to sync :) [21:42:51] good, syncing [21:42:53] !log urbanecm@deploy2002 urbanecm and jdrewniak and ebernhardson: Continuing with sync [21:46:15] (03PS1) 10Fabfur: haproxy: re-set varnish maxconn on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) [21:47:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P53470 and previous config saved to /var/cache/conftool/dbconfig/20231114-214713-arnaudb.json [21:48:18] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974264|[Vector] enable Zebra CSS module on test wikis (T347711)]], [[gerrit:974228|PageRerenderSerializer: Match stream name with conventions]] (duration: 07m 36s) [21:48:23] T347711: [Zebra] Enable refactored Zebra on certain wikis for testing purposes - https://phabricator.wikimedia.org/T347711 [21:49:10] should be all done! :) [21:49:41] urbanecm: thanks again [21:49:53] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 10 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [21:49:55] np [21:52:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1056.eqiad.wmnet with OS bookworm [22:00:23] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) @Dwisehaupt I think we have all data now except the hostname, see my earlier comment. crm1001 or something else? [22:00:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bookworm [22:02:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53471 and previous config saved to /var/cache/conftool/dbconfig/20231114-220220-arnaudb.json [22:02:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [22:02:26] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:02:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [22:02:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53472 and previous config saved to /var/cache/conftool/dbconfig/20231114-220241-arnaudb.json [22:05:38] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1046.eqiad.wmnet with OS bookworm [22:07:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53473 and previous config saved to /var/cache/conftool/dbconfig/20231114-220717-arnaudb.json [22:07:21] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [22:19:30] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [22:22:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P53474 and previous config saved to /var/cache/conftool/dbconfig/20231114-222224-arnaudb.json [22:23:24] (03PS1) 10Eevans: install_server: actually use the aqs reuse config (breakfix) [puppet] - 10https://gerrit.wikimedia.org/r/974274 (https://phabricator.wikimedia.org/T347738) [22:24:07] (03CR) 10Eevans: [C: 03+2] install_server: actually use the aqs reuse config (breakfix) [puppet] - 10https://gerrit.wikimedia.org/r/974274 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [22:30:48] (03PS7) 10Krinkle: Enable $wgStatsTarget for new Stats lib for requests to kube-mw-debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:30:52] (03CR) 10Krinkle: [C: 03+1] Enable $wgStatsTarget for new Stats lib for requests to kube-mw-debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:31:10] (03CR) 10Krinkle: [C: 03+1] Enable $wgStatsTarget for new Stats lib for requests to kube-mw-debug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:32:43] (03PS8) 10Krinkle: Enable $wgStatsTarget for requests to kube-mw-debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [22:33:22] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [22:37:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P53476 and previous config saved to /var/cache/conftool/dbconfig/20231114-223730-arnaudb.json [22:47:31] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/973213/471/" [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [22:50:27] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [22:52:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53477 and previous config saved to /var/cache/conftool/dbconfig/20231114-225236-arnaudb.json [22:52:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:52:41] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:52:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:52:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T348183)', diff saved to https://phabricator.wikimedia.org/P53478 and previous config saved to /var/cache/conftool/dbconfig/20231114-225258-arnaudb.json [22:53:08] 10SRE, 10Data Pipelines, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac) Realizing I never linked any code for this in case folks wanted to work with the data but here's an example where I'm trying to grab both sources:... [22:53:55] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:54:00] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) Sorry, I forgot to respond to that. crm1001 is good. [22:56:53] PROBLEM - Disk space on druid1009 is CRITICAL: DISK CRITICAL - free space: /srv 47486 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1009&var-datasource=eqiad+prometheus/ops [22:57:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:58:19] PROBLEM - Disk space on druid1011 is CRITICAL: DISK CRITICAL - free space: /srv 51183 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1011&var-datasource=eqiad+prometheus/ops [23:01:11] PROBLEM - Disk space on druid1010 is CRITICAL: DISK CRITICAL - free space: /srv 48820 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1010&var-datasource=eqiad+prometheus/ops [23:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:08:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:11:17] (03PS1) 10Dzahn: phabricator::main: add support for PHP versions other than 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/974280 (https://phabricator.wikimedia.org/T327068) [23:12:04] (03PS1) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [23:14:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [23:15:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [23:17:10] (03PS1) 10JHathaway: puppetserver: change ssldir to a concat fragment [puppet] - 10https://gerrit.wikimedia.org/r/974282 [23:17:12] (03PS1) 10JHathaway: puppetserver: cache code [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) [23:17:44] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/974282 (owner: 10JHathaway) [23:17:55] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [23:20:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T348183)', diff saved to https://phabricator.wikimedia.org/P53479 and previous config saved to /var/cache/conftool/dbconfig/20231114-232026-arnaudb.json [23:20:33] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:21:27] (03CR) 10Cwhite: "Thank you for having a look and for the clarification!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [23:23:43] (03PS2) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [23:23:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [23:26:30] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye [23:28:37] (03PS1) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [23:29:09] (03CR) 10CI reject: [V: 04-1] wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [23:33:07] (03PS2) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [23:33:43] PROBLEM - ensure kvm processes are running on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:34:51] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/974280/472/" [puppet] - 10https://gerrit.wikimedia.org/r/974280 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [23:35:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P53480 and previous config saved to /var/cache/conftool/dbconfig/20231114-233532-arnaudb.json [23:37:17] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [23:37:49] RECOVERY - ensure kvm processes are running on cloudvirt1043 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:38:07] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop in prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/974280 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [23:47:43] (03PS1) 10Dzahn: php: add templates to support php8.2 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/974286 (https://phabricator.wikimedia.org/T327068) [23:50:39] (03PS3) 10MVernon: swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) [23:50:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P53481 and previous config saved to /var/cache/conftool/dbconfig/20231114-235039-arnaudb.json [23:51:40] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [23:53:55] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure