[00:16:15] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [00:29:52] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9882730 (10bd808) >>! In T366763#9866184, @thcipriani wrote: > - Should probably use [[ https://gerrit.wikimedia.org/r/Documentation/config-robot-comments.html | robot... [00:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:39:25] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9882743 (10bd808) `lang=irc [23:50] < bd808> Feedback on what style of comment schedule-deployment should leave on a Gerrit change to signify that the change is sch... [00:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:57:08] 10wikitech.wikimedia.org, 10Wikimedia-Site-requests: Remove namespace 666 from Wikitech - https://phabricator.wikimedia.org/T367254 (10Bugreporter) 03NEW [02:03:18] 10wikitech.wikimedia.org, 10Wikimedia-Site-requests: Remove namespace 666 from Wikitech - https://phabricator.wikimedia.org/T367254#9882860 (10Bugreporter) [02:37:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:42:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:47:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:52:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:15:56] FIRING: SystemdUnitDown: The service unit opentofu-infra-diff.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:16:15] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [05:07:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:10:56] FIRING: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:11:01] 06cloud-services-team: SystemdUnitDown Unit opentofu-infra-diff.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T367263 (10phaultfinder) 03NEW [05:12:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:17:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:22:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:01:59] 10Cloud-VPS (Quota-requests): Add one floating ip to webperformancetest - https://phabricator.wikimedia.org/T367266 (10Peter) 03NEW [07:12:19] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883216 (10dcaro) Timing out waiting for the dns entry: ` root@cloudcontrol1006:~# echo "$(sudo journalctl -n 5000 -u nova-... [07:16:49] 06cloud-services-team: SystemdUnitDown Unit opentofu-infra-diff.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T367263#9883228 (10dcaro) It's complaining that the new flavors have not been applied: ` root@cloudcontrol1007:/srv/tofu-infra# /usr/local/bin/tofu plan -d... [07:19:25] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883233 (10dcaro) there's many more errors around though, it seems it's also reaching the limit of VMs, will cleanup to let... [07:25:24] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883237 (10dcaro) Seems related to T309929 [07:30:54] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883242 (10dcaro) Designate now is running on cloudcontrols though (updated the docs). It seems that there it's having issue... [07:34:16] (03update) 10sstefanova: [envvars-cli] remove unused code [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/41 (owner: 10raymond-ndibe) [07:37:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:41:48] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883274 (10dcaro) For now I have moved the known hosts key to start with a fresh new, and restarted th... [07:42:42] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883275 (10dcaro) it timed out again though :/, looking ` Jun 12 07:41:40 cloudcontrol1006 nova-fulls... [07:44:42] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883277 (10dcaro) I do find the log now on the designate-sink side: ` Jun 12 07:40:36 cloudcontrol1005... [07:47:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:49:44] (03open) 10sstefanova: d/changelog: bump to 0.0.16 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/75 (https://phabricator.wikimedia.org/T363808) [07:50:09] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883280 (10dcaro) There are errors on all the designate-mdns services (one for each cloudcontrol), two... [07:51:41] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883282 (10dcaro) I did it one-by-one just in case: ` root@cumin1002:~# cumin cloudcontrol1007* 'syste... [08:04:37] (03update) 10dcaro: api: auth and proxy requests to the backend APIs [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/23 (https://phabricator.wikimedia.org/T363983) [08:09:10] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883302 (10dcaro) Looking into the pdns database, the last fullstack record is: ` mysql:root@localhos... [08:10:56] RESOLVED: SystemdUnitDown: The service unit opentofu-infra-diff.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:13:56] 06cloud-services-team: wmcs-dns-floating-ip-updater: failing to find project webperformancetest - https://phabricator.wikimedia.org/T367268 (10dcaro) 03NEW [08:15:26] RESOLVED: SystemdUnitDown: The systemd unit opentofu-infra-diff.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:16:01] 06cloud-services-team: wmcs-dns-floating-ip-updater: failing to find project webperformancetest - https://phabricator.wikimedia.org/T367268#9883325 (10taavi) a:03taavi This seems like a project_id/project_name mismatch. I'll have a look. [08:16:25] (03update) 10sstefanova: d/changelog: bump to 0.0.16 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/75 (https://phabricator.wikimedia.org/T363808) [08:16:47] (03update) 10sstefanova: d/changelog: bump to 0.0.16 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/75 (https://phabricator.wikimedia.org/T363808) [08:16:51] (03approved) 10sstefanova: d/changelog: bump to 0.0.16 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/75 (https://phabricator.wikimedia.org/T363808) [08:16:57] (03merge) 10sstefanova: d/changelog: bump to 0.0.16 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/75 (https://phabricator.wikimedia.org/T363808) [08:17:19] 06cloud-services-team: wmcs-dns-floating-ip-updater: failing to find project webperformancetest - https://phabricator.wikimedia.org/T367268#9883330 (10dcaro) yep, the project exists: ` root@cloudcontrol1006:~# openstack project list | grep -i webperformance | 933ad3ff1e264aada56e6bc3ed9e08f3 | webpe... [08:24:48] 10Toolforge (Toolforge iteration 11): [envvars-api, envvars-cli] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363809#9883338 (10Slst2020) [08:46:02] (03open) 10sstefanova: api: remove unprefixed endpoints [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/33 [08:49:48] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883394 (10dcaro) Started a new run tailing logs on cloudcontrols, but the only logs I see are: ` Jun... [08:56:39] 06cloud-services-team, 13Patch-For-Review: wmcs-dns-floating-ip-updater: failing to find project webperformancetest - https://phabricator.wikimedia.org/T367268#9883398 (10taavi) 05Open→03Resolved [08:56:54] (03update) 10sstefanova: api: remove unprefixed endpoints [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/33 [09:07:08] 06cloud-services-team, 10Toolforge: Update maintain_kubeusers to use the toolstate database - https://phabricator.wikimedia.org/T334629#9883435 (10taavi) a:05taavi→03None [09:11:12] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883456 (10dcaro) Okok, checking that the zone transfer from pdns does not actually have the fullstac... [09:23:18] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883472 (10dcaro) So the domain to transfer is `eqiad1.wmcloud.org` that's owned by the cloudinfra pro... [09:24:45] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [builds-api, builds-cli] Prefix all endpoints with `/tool/` - https://phabricator.wikimedia.org/T363808#9883473 (10Slst2020) [09:26:02] (03open) 10sstefanova: api: remove unprefixed endpoints [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/97 [09:27:41] (03CR) 10Majavah: [C:03+2] toolforge: k8s: reboot: Add nfs only option [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1041696 (owner: 10Majavah) [09:28:32] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [jobs-api] move jobs load feature to the backend - https://phabricator.wikimedia.org/T366209#9883478 (10Slst2020) 05Open→03In progress [09:30:47] (03Merged) 10jenkins-bot: toolforge: k8s: reboot: Add nfs only option [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1041696 (owner: 10Majavah) [09:30:53] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883488 (10dcaro) I was able to force pdns to re-retrieve the zone: ` root@cloudservices1006:/# pdns_c... [09:31:24] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9883491 (10taavi) [09:39:01] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883512 (10dcaro) I see this on the logs: ` Jun 12 09:32:09 cloudservices1006 pdns_server[1430]: Domai... [09:45:10] (03update) 10sstefanova: api: remove unprefixed endpoints [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/97 [09:50:02] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1031.eqiad.wmnet' [09:53:49] (03PS1) 10Majavah: openstack: Drop add_flavor cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1042190 (https://phabricator.wikimedia.org/T364458) [09:55:47] (03update) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [09:56:46] (03CR) 10CI reject: [V:04-1] openstack: Drop add_flavor cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1042190 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [09:57:32] (03PS2) 10Majavah: openstack: Drop add_flavor cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1042190 (https://phabricator.wikimedia.org/T364458) [10:02:01] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883576 (10dcaro) So it seems designate-worker is having issues trying to connect to the designate dat... [10:02:56] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:07:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:07:56] RESOLVED: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:08:22] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1031.eqiad.wmnet' [10:12:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:12:53] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883607 (10dcaro) found errors on designate-central service too, rabbitmq ones yesterday and database... [10:15:10] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883624 (10dcaro) Just checked on the 3 galera nodes, just in case, same serial. [10:17:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:22:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:23:14] (03update) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [10:23:54] PROBLEM - mysqld processes on clouddb1019 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:35:54] RECOVERY - mysqld processes on clouddb1019 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:46:02] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9883704 (10kostajh) >>! In T366763#9882611, @bd808 wrote: > My personal preference is for the plain comment version. I find the robot comment to be very nice as an inli... [11:14:34] (03update) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [11:24:50] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883889 (10dcaro) I ended up restarting all the designate processes in all the nodes: ` root@cumin1002... [11:29:14] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1032.eqiad.wmnet' [11:30:33] 10wikitech.wikimedia.org, 07LDAP: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287 (10taavi) 03NEW [11:30:53] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883907 (10dcaro) And novafullstack is now passing :) [11:30:57] 10wikitech.wikimedia.org, 07LDAP: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287#9883908 (10taavi) [11:31:08] 06cloud-services-team, 10MediaWiki-extensions-OpenStackManager, 10wikitech.wikimedia.org, 13Patch-For-Review: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553#9883909 (10taavi) [11:33:11] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07LDAP: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287#9883915 (10taavi) [11:34:52] (03merge) 10aborrero: maintain_kubeusers: add support for kyverno policies [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18 (https://phabricator.wikimedia.org/T279110) [11:35:20] 06cloud-services-team, 10Toolforge: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bullseye - https://phabricator.wikimedia.org/T311905#9883933 (10taavi) a:05taavi→03None [11:35:25] 06cloud-services-team, 10Toolforge: Upgrade Toolforge apt repository (tools-services hosts) to Debian Bullseye or later - https://phabricator.wikimedia.org/T311914#9883934 (10taavi) a:05taavi→03None [11:37:16] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.148-20240612113501-fa8bd88a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/326 (https://phabricator.wikimedia.org/T279110) [11:37:56] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1032.eqiad.wmnet' [11:38:30] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [11:39:28] FIRING: [2x] InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-control-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:41:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-52 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-control-9 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [11:45:49] (03update) 10aborrero: kubernetes: introduce securityContext in the pod template [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/37 (https://phabricator.wikimedia.org/T362050) [11:47:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-52 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [11:47:32] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1031.eqiad.wmnet' [11:47:54] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Also remove dummy keytabs for decommed stat servers [labs/private] - 10https://gerrit.wikimedia.org/r/1041686 (owner: 10Muehlenhoff) [11:49:44] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1031.eqiad.wmnet' [11:49:54] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:50:00] (03open) 10sstefanova: cli: use prefixed endpoints [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/38 [11:50:04] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:50:23] FIRING: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-control-9 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [11:52:02] 06cloud-services-team, 13Patch-For-Review: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T367235#9883986 (10dcaro) 05Open→03Resolved a:03dcaro Deplpoyed [11:52:37] (03update) 10dcaro: cli: centralize context management [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/44 (owner: 10sstefanova) [11:56:17] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [11:58:09] (03approved) 10dcaro: cli: centralize context management [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/44 (owner: 10sstefanova) [11:58:11] (03update) 10dcaro: cli: centralize context management [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/44 (owner: 10sstefanova) [11:58:56] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [12:00:30] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [12:00:43] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [12:01:26] (03update) 10sstefanova: cli: use prefixed endpoints [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/38 [12:01:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-52 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:01:58] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-52 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:02:56] (03merge) 10sstefanova: cli: centralize context management [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/44 [12:06:43] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-52 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:27:15] (03update) 10sstefanova: cli: use prefixed endpoints [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/38 [12:28:44] (03update) 10sstefanova: cli: use prefixed endpoints [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/38 [12:28:55] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9884091 (10IKhitron) Well, for the last 21 hours no querry could run. So it's pretty bad. [12:31:43] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-52 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:32:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-52 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [12:34:28] RESOLVED: [2x] InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-control-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:35:12] (03approved) 10dcaro: cli: use prefixed endpoints [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/38 (owner: 10sstefanova) [12:35:15] (03update) 10dcaro: cli: use prefixed endpoints [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/38 (owner: 10sstefanova) [12:35:23] RESOLVED: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node toolsbeta-test-k8s-control-9 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [12:45:28] (03CR) 10FNegri: [C:03+1] openstack: Drop add_flavor cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1042190 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [12:46:16] (03CR) 10Majavah: [C:03+2] openstack: Drop add_flavor cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1042190 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [12:48:55] (03Merged) 10jenkins-bot: openstack: Drop add_flavor cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1042190 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [12:49:21] FIRING: MaintainKubeusersHang: maintain-kubeusers last finished run is 28.64M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [12:51:40] 10Toolforge (Toolforge iteration 11): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9884134 (10Slst2020) 05Open→03In progress [12:54:21] RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 28.64M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [12:57:49] FIRING: NeutronAgentDownForLong: Neutron neutron-linuxbridge-agent on cloudvirt1031 has been down for more than 2h - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDownForLong [12:57:56] 06cloud-services-team: NeutronAgentDownForLong A Neutron agent has been down for more than 2h, VMs will have connectivity issues - https://phabricator.wikimedia.org/T365461#9884158 (10phaultfinder) [12:58:12] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [12:58:50] FIRING: NeutronAgentDown: Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [12:59:30] 10Toolforge (Toolforge iteration 11): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9884164 (10Slst2020) [12:59:38] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [13:04:54] (03open) 10aborrero: maintain-kubeusers: adjust k8s probes [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/41 [13:05:03] (03update) 10sstefanova: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 (owner: 10dcaro) [13:05:52] (03approved) 10sstefanova: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 (owner: 10dcaro) [13:06:05] (03update) 10aborrero: maintain-kubeusers: adjust k8s probes [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/41 [13:07:15] (03update) 10aborrero: maintain-kubeusers: adjust k8s probes [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/41 [13:07:54] 10Toolforge (Toolforge iteration 11): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9884182 (10Slst2020) Flask blueprints default to introducing a trailing slash at the "root" URL for each blueprint in jobs-api. Is this something we want to remove too? [13:08:46] (03update) 10sstefanova: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 (owner: 10dcaro) [13:13:48] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [13:14:30] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [13:20:38] 10Toolforge (Toolforge iteration 11): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9884258 (10dcaro) >>! In T365014#9884182, @Slst2020 wrote: > Flask blueprints default to introducing a trailing slash at the "root" URL for each blueprint in jobs-api. Is this... [13:22:10] (03approved) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [13:22:12] (03update) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [13:22:14] (03merge) 10dcaro: create: add nicer error when quota is reached [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/29 [13:22:49] RESOLVED: NeutronAgentDownForLong: Neutron neutron-linuxbridge-agent on cloudvirt1031 has been down for more than 2h - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDownForLong [13:23:50] RESOLVED: NeutronAgentDown: Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:24:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:26:31] dhinus opened https://github.com/toolforge/quarry/pull/46 [13:29:28] FIRING: InstanceDown: Project tools instance tools-k8s-control-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:29:51] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:31:03] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component envvars-api [13:31:14] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component envvars-api [13:31:51] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9884291 (10fnegri) > Ah well, looks like the deployment of https://github.com/toolforge/quarry/pull/40 didn't fully go through. The keys TOOLS_DB_USER and TOOLS_DB_PASSWORD are missing i... [13:34:28] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-control-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:36:21] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:42:48] (03update) 10dcaro: vm: set fs.inotify.max_user_instances=1024 [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/141 [13:46:21] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:48:38] (03approved) 10dcaro: [jobs-api] move simple job validations to pydantic [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/89 (https://phabricator.wikimedia.org/T366209) (owner: 10raymond-ndibe) [13:48:41] (03update) 10dcaro: [jobs-api] move simple job validations to pydantic [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/89 (https://phabricator.wikimedia.org/T366209) (owner: 10raymond-ndibe) [13:49:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-control-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:52:16] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component envvars-api [13:52:28] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component envvars-api [14:00:17] (03approved) 10dcaro: envvars-api: bump to 0.0.49-20240612132227-4cbdd42c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/327 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:00:20] (03update) 10dcaro: envvars-api: bump to 0.0.49-20240612132227-4cbdd42c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/327 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:00:20] (03merge) 10dcaro: envvars-api: bump to 0.0.49-20240612132227-4cbdd42c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/327 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [14:07:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:08:03] 10Toolforge (Toolforge iteration 11): [builds-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367182#9884525 (10dcaro) p:05Triage→03High [14:08:04] 10Toolforge (Toolforge iteration 11): [envvars-api] Remove authentication and use api-gateway provided headers - https://phabricator.wikimedia.org/T367181#9884526 (10dcaro) p:05Triage→03High [14:08:09] 10Toolforge (Toolforge iteration 11): [jobs-api] Remove authentication and use the api-gateway provided headers - https://phabricator.wikimedia.org/T367180#9884529 (10dcaro) p:05Triage→03High [14:10:13] 10Toolforge (Toolforge iteration 11): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169#9884532 (10dcaro) (note that the quotas have been bumped to 64 secrets, and now the errors says that the quota is exceeded clearly, all mitigations though not full... [14:10:36] 10Toolforge (Toolforge iteration 11): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169#9884531 (10dcaro) The issue is that in tools/toolsbeta there's a few envvars that we don't want to delete like toolsdb pass and such, so we can't just delete everyt... [14:11:34] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287#9884539 (10taavi) a:03taavi [14:12:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 11): toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition) - https://phabricator.wikimedia.org/T309782#9884534 (10dcaro) 05In progress→03Stalled [14:12:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:17:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:22:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:38:05] 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9884857 (10Xaosflux) p:05Triage→03High Problem still occurring today and all recent requests have failed (see https://quarry.wmcloud.org/query/runs/all) [14:38:37] 10Toolforge (Toolforge iteration 11): [jobs-api,builds-api,envvars-api] consolidate api paths - https://phabricator.wikimedia.org/T365014#9884872 (10Slst2020) Shouldn't be too complicated hopefully, I think there's an option when creating the Blueprint. [14:46:45] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9884949 (10fnegri) 05Open→03In progress a:05GTrang→03fnegri [14:47:46] dhinus closed https://github.com/toolforge/quarry/pull/46 [14:50:23] 10Toolforge (Toolforge iteration 11): [toolforge-deploy] envvars functional tests fail when out of quota - https://phabricator.wikimedia.org/T367169#9884987 (10Slst2020) >>! In T367169#9884531, @dcaro wrote: > The issue is that in tools/toolsbeta there's a few envvars that we don't want to delete like toolsdb pa... [14:56:54] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9885001 (10bd808) >>! In T366763#9883704, @kostajh wrote: > I think a plain comment is better. However, I think it would make sense to add `autogenerated:scheduledeploy... [14:59:38] (03merge) 10aborrero: maintain-kubeusers: adjust k8s probes [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/41 [15:00:38] (03update) 10aborrero: maintain-kubeusers: bump to 0.0.148-20240612113501-fa8bd88a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/326 (https://phabricator.wikimedia.org/T279110) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:01:06] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.148-20240612113501-fa8bd88a [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/326 (https://phabricator.wikimedia.org/T279110) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:01:52] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [15:02:03] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [15:03:30] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [15:03:44] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [15:08:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:10:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9885050 (10github-toolforge-bot) siddharthvp opened https://github.com/toolforge/quarry/pull/47 [15:10:36] siddharthvp opened https://github.com/toolforge/quarry/pull/47 [15:11:06] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9885051 (10SD0001) >>! In T365374#9880226, @SD0001 wrote: > But this doesn't explain why the issue occurs when querying the wiki replicas. F... [15:11:55] (03approved) 10aborrero: maintain-kubeusers: bump to 0.0.149-20240612145951-b637ff58 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/328 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:11:58] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.149-20240612145951-b637ff58 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/328 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [15:12:56] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9885069 (10fnegri) https://github.com/toolforge/quarry/pull/46 and https://github.com/toolforge/quarry/pull/47 should probably fix the issue,... [15:16:32] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9885089 (10SD0001) I'm not sure either about how to deploy in the new k8s-based setup. However, `/home/rook/quarry` seems to be the git check... [15:17:51] (03open) 10aborrero: kyverno: increase memory and CPU limits in general [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/329 [15:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:18:49] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9885105 (10fnegri) I'm confused because that directory has a checkout of a non-main branch: ` root@quarry-bastion:/home/rook/quarry# sudo -u... [15:19:11] (03update) 10aborrero: kyverno: increase memory and CPU limits in general [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/329 [15:20:54] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [15:21:02] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [15:22:43] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component kyverno [15:23:04] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component kyverno [15:24:05] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component kyverno [15:24:29] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component kyverno [15:28:01] (03merge) 10aborrero: kyverno: increase memory and CPU limits in general [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/329 [15:42:51] (03open) 10aborrero: kyverno: reduce length of some fields [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 [15:47:51] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:52:51] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:52:53] 06cloud-services-team, 10Toolforge: toolforge maintain-kubeusers backtrace - https://phabricator.wikimedia.org/T367332 (10aborrero) 03NEW [15:53:01] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287#9885310 (10taavi) Oh. Right. Good point, I forgot about that. [15:53:05] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287#9885311 (10taavi) 05Open→03Stalled [15:53:48] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07LDAP, 13Patch-For-Review: Update Wikitech's LDAP credentials to be read-only - https://phabricator.wikimedia.org/T367287#9885313 (10taavi) [15:55:28] siddharthvp closed https://github.com/toolforge/quarry/pull/47 [15:57:28] FIRING: InstanceDown: Project tools instance tools-k8s-control-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:59:54] (03open) 10aborrero: resources: better handle state configmap read failures [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/43 (https://phabricator.wikimedia.org/T367332) [16:03:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-control-9 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [16:13:08] (03update) 10aborrero: kyverno: reduce length of some fields [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 [16:13:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-control-9 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [16:16:37] (03update) 10aborrero: kyverno: reduce length of some fields [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 [16:17:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-control-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:21:35] (03approved) 10dcaro: resources: better handle state configmap read failures [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/43 (https://phabricator.wikimedia.org/T367332) (owner: 10aborrero) [16:23:17] (03approved) 10dcaro: kyverno: reduce length of some fields [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 (owner: 10aborrero) [16:23:19] (03update) 10dcaro: kyverno: reduce length of some fields [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 (owner: 10aborrero) [16:23:23] (03merge) 10aborrero: kyverno: reduce length of some fields [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 [16:24:26] (03update) 10aborrero: resources: better handle state configmap read failures [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/43 (https://phabricator.wikimedia.org/T367332) [16:26:00] 06cloud-services-team: [grafana,ceph] Add both ends of switch links to the error/discard dashboards and include them also in the health section - https://phabricator.wikimedia.org/T367336#9885500 (10dcaro) p:05Triage→03High [16:26:57] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [16:27:08] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [16:28:33] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [16:28:46] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [16:29:41] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.150-20240612162335-3208f6fa [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/330 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [16:29:51] (03merge) 10aborrero: resources: better handle state configmap read failures [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/43 (https://phabricator.wikimedia.org/T367332) [16:32:53] vivian-rook opened https://github.com/toolforge/paws/pull/432 [16:33:51] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:36:28] FIRING: InstanceDown: Project tools instance tools-k8s-control-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:38:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:41:28] FIRING: [3x] InstanceDown: Project tools instance tools-k8s-control-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:42:16] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:42:18] !log dcaro@urcuchillay tools END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [16:42:19] FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [16:42:23] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [16:42:37] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:42:42] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [16:42:53] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:42:59] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [16:43:05] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [16:43:51] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:47:19] RESOLVED: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [16:47:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [16:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:52:55] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [16:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:53:51] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:56:23] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [16:56:28] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-control-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:57:19] FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [16:57:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [16:58:51] FIRING: [4x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:01:28] RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-control-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:08:51] RESOLVED: [4x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:11:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [17:12:19] RESOLVED: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [17:12:21] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [17:13:23] (03open) 10taavi: templates: Fix webhook namespace label selector [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/5 [17:13:27] (03update) 10taavi: templates: Fix webhook namespace label selector [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/5 [17:13:52] (03approved) 10dcaro: templates: Fix webhook namespace label selector [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/5 (owner: 10taavi) [17:16:30] (03merge) 10taavi: templates: Fix webhook namespace label selector [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/5 [17:17:13] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component registry-admission [17:17:21] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component registry-admission [17:18:51] FIRING: [2x] MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [17:18:55] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: registry-admission: bump to 0.0.42-20240612171641-a8eaf992 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/332 [17:19:47] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component registry-admission [17:19:58] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component registry-admission [17:20:29] (03merge) 10taavi: registry-admission: bump to 0.0.42-20240612171641-a8eaf992 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/332 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [17:20:48] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component registry-admission [17:20:57] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component registry-admission [17:21:08] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component registry-admission [17:21:18] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component registry-admission [17:28:51] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [17:33:56] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9885899 (10bd808) >>! In T366763#9882742, @bd808 wrote: > `lang=irc > [00:05] < MatmaRex> bd808: i think it would be nice to use a normal comment style that allows rep... [17:35:43] 10Toolforge (Toolforge iteration 11), 13Patch-For-Review: [toolforge] Investigate authentication - https://phabricator.wikimedia.org/T363983#9885907 (10dcaro) [17:38:38] 10Tool-schedule-deployment: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9885916 (10bd808) >>! In T366763#9883704, @kostajh wrote: > On a related note, I wonder if the ScheduleDeploymentBot should also comment on the patch on master that had... [18:12:45] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9886036 (10Liz) Some of my queries are running okay now, others are just spinning their wheels, like they are running but they never conclude... [18:14:41] 10Toolforge: webservice-runner: Also read uwsgi.ini from ~/www/python/src/ - https://phabricator.wikimedia.org/T367345 (10LucasWerkmeister) 03NEW [18:17:30] (03PS1) 10Lucas Werkmeister: python: Also look for ~/www/python/src/uwsgi.ini [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1042332 (https://phabricator.wikimedia.org/T367345) [18:28:28] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Quarry: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374#9886138 (10KylieTastic) >>! In T365374#9886036, @Liz wrote: > Some of my queries are running okay now, others are just spinning their wheels,... [18:51:50] 10Cloud-Services, 06cloud-services-team, 10Sustainability (Incident Followup): Incident: 2024-06-12 toolforge k8s control plane - https://phabricator.wikimedia.org/T367348 (10Andrew) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikime... [18:52:30] 10Cloud-Services, 06cloud-services-team, 10Sustainability (Incident Followup): Incident: 2024-06-12 toolforge k8s control plane - https://phabricator.wikimedia.org/T367348#9886209 (10Andrew) [18:52:31] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): toolforge: scale up coredns replicas - https://phabricator.wikimedia.org/T333934#9886210 (10Andrew) [18:53:25] 10Cloud-Services, 06cloud-services-team, 10Sustainability (Incident Followup): Fix HA proxy load-balancer health check monitor to not poll nodes where the API is not responding - https://phabricator.wikimedia.org/T367349 (10Andrew) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Pl... [18:54:23] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Incident: 2024-06-12 toolforge k8s control plane - https://phabricator.wikimedia.org/T367348#9886231 (10Andrew) [18:55:10] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Verify that kyverno policies match our namespace - https://phabricator.wikimedia.org/T367350 (10Andrew) 03NEW [18:56:27] siddharthvp closed https://github.com/toolforge/quarry/pull/42 [19:23:57] 10Quarry: query runs forever - https://phabricator.wikimedia.org/T366909#9886318 (10SD0001) This could be because of a worker going down. [19:24:18] 10Quarry: query runs forever - https://phabricator.wikimedia.org/T366909#9886320 (10SD0001) →14Duplicate dup:03T278583 [19:25:04] 06cloud-services-team, 10Quarry: Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583#9886322 (10SD0001) [19:32:58] 10Quarry, 07patch-welcome: Timer that counts up as the query is running - https://phabricator.wikimedia.org/T353690#9886357 (10SD0001) To implement this, we can record the start time as another attribute in the extra_info JSON blob in query_run table when the query status is set to running, and use JS client-s... [19:38:09] (03CR) 10BryanDavis: [C:03+2] "LGTM." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1042332 (https://phabricator.wikimedia.org/T367345) (owner: 10Lucas Werkmeister) [19:38:41] (03Merged) 10jenkins-bot: python: Also look for ~/www/python/src/uwsgi.ini [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1042332 (https://phabricator.wikimedia.org/T367345) (owner: 10Lucas Werkmeister) [19:44:45] 10Toolforge, 13Patch-For-Review: webservice-runner: Also read uwsgi.ini from ~/www/python/src/ - https://phabricator.wikimedia.org/T367345#9886390 (10bd808) 05Open→03In progress a:03LucasWerkmeister [20:32:00] 10Toolforge: webservice-runner: Also read uwsgi.ini from ~/www/python/src/ - https://phabricator.wikimedia.org/T367345#9886622 (10LucasWerkmeister) 05In progress→03Resolved Seems to work fine as far as I can tell \o/ [20:32:37] 10Cloud-VPS, 10Quarry: [bug] Lot of queries stuck in queued state for hours and days - https://phabricator.wikimedia.org/T365136#9886632 (10SD0001) No, but the PR fixes the UI so that the actual error shows up. [20:39:07] 10Quarry: Unbreak Quarry query killer - https://phabricator.wikimedia.org/T367363 (10SD0001) 03NEW [20:48:24] siddharthvp opened https://github.com/toolforge/quarry/pull/48 [20:52:12] 10Quarry: Unbreak Quarry query killer - https://phabricator.wikimedia.org/T367363#9886698 (10SD0001) https://github.com/toolforge/quarry/pull/48 [21:10:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:20:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:07:41] FIRING: [2x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:08:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudgw1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:12:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:17:41] FIRING: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:22:41] RESOLVED: [3x] CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:29:59] (03open) 10bd808: gerrit: Leave a comment on the Gerrit change after scheduling [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/9 (https://phabricator.wikimedia.org/T366763) [22:32:17] (03update) 10bd808: gerrit: Leave a comment on the Gerrit change after scheduling [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/9 (https://phabricator.wikimedia.org/T366763) [22:36:16] vivian-rook opened https://github.com/toolforge/quarry/pull/49 [22:42:27] (03update) 10bd808: gerrit: Leave a comment on the Gerrit change after scheduling [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/9 (https://phabricator.wikimedia.org/T366763) [22:46:15] vivian-rook closed https://github.com/toolforge/quarry/pull/49 [22:46:57] (03update) 10bd808: gerrit: Leave a comment on the Gerrit change after scheduling [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/9 (https://phabricator.wikimedia.org/T366763) [22:47:03] (03approved) 10bd808: gerrit: Leave a comment on the Gerrit change after scheduling [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/9 (https://phabricator.wikimedia.org/T366763) [22:47:10] (03merge) 10bd808: gerrit: Leave a comment on the Gerrit change after scheduling [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/9 (https://phabricator.wikimedia.org/T366763) [22:48:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudgw1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:49:35] (03update) 10bd808: Add note about using local time zone [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/8 (owner: 10matmarex) [22:52:45] (03approved) 10bd808: Add note about using local time zone [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/8 (owner: 10matmarex) [22:53:37] (03update) 10bd808: Add note about using local time zone [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/8 (owner: 10matmarex) [22:53:44] (03merge) 10bd808: Add note about using local time zone [toolforge-repos/schedule-deployment] - 10https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/merge_requests/8 (owner: 10matmarex) [23:02:44] 10Tool-schedule-deployment, 13Patch-For-Review: Leave a comment on the Gerrit change when it is scheduled for a backport - https://phabricator.wikimedia.org/T366763#9886918 (10bd808) 05In progress→03Resolved [23:05:48] 10Tool-schedule-deployment: Add comment on original change when a cherry-pick is scheduled for backport deployment - https://phabricator.wikimedia.org/T367368 (10bd808) 03NEW [23:28:35] 10Tool-schedule-deployment: ScheduleDeploymentBot should refuse to add more than 6 patches to a backport window - https://phabricator.wikimedia.org/T367229#9887013 (10bd808) The implementation for this may end up being fragile, but I think it is worth attempting. The fragility I'm worried about is due to the way...