[00:16:28] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:28] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:13:00] (03PS1) 10Andrew Bogott: Add another profile::openstack::eqiad1::nova::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1016053 [01:13:22] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add another profile::openstack::eqiad1::nova::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/1016053 (owner: 10Andrew Bogott) [01:18:56] (CloudVPSDesignateLeaks) firing: (5) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:24:03] (03PS1) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:28:48] (03PS2) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:36:04] 10Cloud-Services, 10Beta-Cluster-Infrastructure: Launching new bullseye deployment-prep instances fails, no sudo access - https://phabricator.wikimedia.org/T361536 (10thcipriani) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.or... [01:36:34] 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Launching new bullseye deployment-prep instances fails, no sudo access - https://phabricator.wikimedia.org/T361536#9678194 (10thcipriani) [01:42:10] (03PS3) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:45:17] (03PS4) 10Andrew Bogott: openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 [01:48:52] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] openstack: move some passwords from eqiad to common [labs/private] - 10https://gerrit.wikimedia.org/r/1016055 (owner: 10Andrew Bogott) [02:12:28] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/labs/private on instance cloudinfra-internal-puppetserver-1 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [02:31:34] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1015966 (owner: 10L10n-bot) [03:12:21] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537 (10Andrew) 03NEW [03:16:00] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537#9678267 (10Andrew) [03:17:52] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: connectivity from cloudbackup200[34] and eqiad ceph - https://phabricator.wikimedia.org/T361537#9678269 (10Andrew) @cmooney do you recall if we have special secret routing set up someplace to make this work for the old cloudbackup hosts? [03:55:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:00:56] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:47:38] (03PS1) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [05:05:18] (03PS2) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [05:18:56] (CloudVPSDesignateLeaks) firing: (5) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:32:50] 10VPS-project-Codesearch: 14Sort repo search results alphabetically - 14https://phabricator.wikimedia.org/T339191#9678545 (10Sebastian_Berlin-WMSE) 05Open→03Invalid 14That makes sense. Thanks for the explanation. [07:38:31] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: 14connectivity from cloudbackup200[34] and eqiad ceph - 14https://phabricator.wikimedia.org/T361537#9678727 (10taavi) 05Open→03Resolved [07:39:35] 06cloud-services-team, 06DC-Ops, 10ops-codfw, 06SRE: 14connectivity from cloudbackup200[34] and eqiad ceph - 14https://phabricator.wikimedia.org/T361537#9678725 (10taavi) a:05Andrew→03taavi 14I ran the Capirca netbox script and that updated the firewall policy on `cr*-eqiad`: `lang=diff [edit fire... [07:42:59] (03CR) 10Majavah: "I don't have concerns with this specific extension, but unless there's a specific reason not to I'd like to only add new functionality to " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [07:51:22] 10Cloud-VPS, 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; e... - https://phabricator.wikimedia.org/T361218#9678784 [07:53:28] 10Toolforge (Toolforge iteration 07): Upgrade Toolforge front proxies to Bookworm - https://phabricator.wikimedia.org/T361223#9678787 (10taavi) [07:54:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-docker-registry-05 [07:54:13] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-docker-registry-05 [07:54:34] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-docker-registry-02 [07:54:41] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-docker-registry-02 [07:55:32] 10Toolforge (Toolforge iteration 07): 14Upgrade Toolforge Docker registry to bookworm - 14https://phabricator.wikimedia.org/T361030#9678792 (10taavi) 05In progress→03Resolved [07:56:02] (ProbeDown) firing: Service tools-proxy-06:443 has failed probes (http_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-proxy-06:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:01:02] (ProbeDown) resolved: Service tools-proxy-06:443 has failed probes (http_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-proxy-06:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:01:28] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 1 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [08:30:09] 10Tool-kuwikibot: Invalid source code and issues URL on https://toolsadmin.wikimedia.org/tools/id/kuwikibot - https://phabricator.wikimedia.org/T361553 (10Aklapper) 03NEW [08:30:36] (03PS1) 10Muehlenhoff: Remove stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016295 (https://phabricator.wikimedia.org/T360413) [08:32:20] 10ToolforgeBundle: Upgrade to Symfony 7 - https://phabricator.wikimedia.org/T361554 (10Samwilson) 03NEW [08:34:17] 10ToolforgeBundle: Upgrade to Symfony 7 - https://phabricator.wikimedia.org/T361554#9678919 (10Samwilson) [08:52:08] 10Tool-cycling-init-bot: Cycling-init-bot has two source code locations which are out of sync - https://phabricator.wikimedia.org/T361562 (10Aklapper) 03NEW [08:55:28] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563 (10dcaro) 03NEW p:05Triage→03High [09:04:27] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563#9679086 (10dcaro) The expired cert is: ` root@enc-2:~# openssl x509 -in /etc/ssl/certs/Puppet_Internal_CA.pem -text... [09:11:11] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563#9679116 (10dcaro) It seems that the problem is with the cached certificates, this forced the host to get the newer o... [09:18:57] (CloudVPSDesignateLeaks) firing: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:21:36] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563#9679150 (10dcaro) After re-arming the keyholder, ran the command and puppet is back running as usual: ` root@cloud-c... [09:22:48] 10Cloud-VPS (Project-requests): Request creation of o11y VPS project to replace monitoring - https://phabricator.wikimedia.org/T361566 (10fgiunchedi) 03NEW [09:47:58] (PuppetAgentStaleLastRun) resolved: (6) Last Puppet run was over 24 hours ago on instance cloud-cumin-04 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:00:51] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: 14[cloudinfra] puppet CA cert expired - 14https://phabricator.wikimedia.org/T361563#9679317 (10dcaro) 05Open→03Resolved a:03dcaro [10:04:58] (PuppetSyncFailure) resolved: Failed to update Puppet repository /srv/git/labs/private on instance cloudinfra-internal-puppetserver-1 in project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [10:16:11] 06cloud-services-team, 10Toolforge, 07Kubernetes, 13Patch-For-Review: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207#9679368 (10dcaro) @Andrew the last patch was not enough it seems, the certificate files need to be sorted out too: ` root@toolsbeta-test-k8s... [10:28:40] 06cloud-services-team, 10Cloud-VPS: Request to add catalyst-qte.wmcloud.org webproxy subdomain for the catalyst-qte CloudVPS project - https://phabricator.wikimedia.org/T361517#9679407 (10taavi) Documentation for setting this up is at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy#Enable_... [11:32:41] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1035.eqiad.wmnet' (T319184) [11:32:46] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [11:45:34] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1035.eqiad.wmnet' (T319184) [11:45:39] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [12:23:42] (CloudVPSDesignateLeaks) firing: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:28:42] (CloudVPSDesignateLeaks) firing: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:36:29] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [12:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:36:46] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [12:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:37:01] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [12:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:37:27] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [12:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:38:44] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-db-2 (T344717) [12:38:47] T344717: [toolsdb] test creating a new replica host - https://phabricator.wikimedia.org/T344717 [12:39:34] !log fnegri@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-2 (T344717) [12:44:09] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [12:46:45] (03PS1) 10Muehlenhoff: Remove obsolete stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016312 (https://phabricator.wikimedia.org/T360412) [12:48:09] (03PS1) 10Muehlenhoff: schema: Remove dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1016316 (https://phabricator.wikimedia.org/T360412) [12:51:53] 10Quarry, 10Toolforge, 10ChangeProp, 06collaboration-services, and 9 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#9679718 (10jijiki) >>! In T360596#9676049, @akosiaris wrote: > > My 2, operationally minded, cents says to wait for... [12:55:47] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 05Goal, 13Patch-For-Review: 14[toolsdb] test creating a new replica host - 14https://phabricator.wikimedia.org/T344717#9679961 (10fnegri) 14The new replica `tools-db-3` is now in sync with the primary. I deleted the old replica `tools-db-2`. [12:57:06] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [toolsdb] set gtid_domain_id to 0 - https://phabricator.wikimedia.org/T357341#9679972 (10fnegri) [12:57:07] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 05Goal, 13Patch-For-Review: 14[toolsdb] test creating a new replica host - 14https://phabricator.wikimedia.org/T344717#9679973 (10fnegri) [12:58:05] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 05Goal, 13Patch-For-Review: 14[toolsdb] test creating a new replica host - 14https://phabricator.wikimedia.org/T344717#9679963 (10fnegri) 05In progress→03Resolved [12:59:56] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1035'] [13:00:03] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1035'] [13:00:19] 10PAWS: Remove paws-123-12 cluster - https://phabricator.wikimedia.org/T360916#9679987 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/396 [13:00:44] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [13:00:50] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [13:00:52] vivian-rook opened https://github.com/toolforge/paws/pull/396 [13:05:41] 10PAWS: Remove paws-123-12 cluster - https://phabricator.wikimedia.org/T360916#9680007 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/396 [13:05:49] vivian-rook closed https://github.com/toolforge/paws/pull/396 [13:05:53] 10PAWS: 14Remove paws-123-12 cluster - 14https://phabricator.wikimedia.org/T360916#9680008 (10rook) 05Open→03Resolved [13:07:55] 10superset.wmcloud.org: Remove superset-123-3 cluster - https://phabricator.wikimedia.org/T355707#9680011 (10rook) a:03rook [13:09:02] 10superset.wmcloud.org: Remove superset-123-3 cluster - https://phabricator.wikimedia.org/T355707#9680014 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/superset-deploy/pull/19 [13:09:11] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/19 [13:11:39] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:13:42] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:17:25] 10Cloud-VPS (Project-requests): 14Request creation of o11y VPS project to replace monitoring - 14https://phabricator.wikimedia.org/T361566#9680047 (10fgiunchedi) 14Thank you folks for the quick action on this! Appreciate it >>! In T361566#9679784, @dcaro wrote: > +1, please make sure report back when the... [13:18:42] (CloudVPSDesignateLeaks) firing: (5) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:19:42] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: 14Replace deployment-ores02 - 14https://phabricator.wikimedia.org/T361385#9680074 (10Andrew) 05Open→03Resolved 14Yep, it's gone now. [13:19:48] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up - https://phabricator.wikimedia.org/T358774#9680078 (10fnegri) [13:19:50] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 10Data-Services, 13Patch-For-Review: 14[cinder] [toolsdb] Deleting snapshot does not work - 14https://phabricator.wikimedia.org/T356904#9680079 (10fnegri) [13:20:02] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: [wmcs-backup] Race condition between backup and cleanup timers - https://phabricator.wikimedia.org/T358780#9680082 (10fnegri) [13:20:05] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 10Data-Services, 13Patch-For-Review: 14[cinder] [toolsdb] Deleting snapshot does not work - 14https://phabricator.wikimedia.org/T356904#9680083 (10fnegri) [13:21:00] 10superset.wmcloud.org: 14Remove superset-123-3 cluster - 14https://phabricator.wikimedia.org/T355707#9680090 (10rook) 05Open→03Resolved [13:21:02] 10Cloud-VPS (Project-requests): 14Request creation of o11y VPS project to replace monitoring - 14https://phabricator.wikimedia.org/T361566#9680089 (10taavi) 14>>! In T361566#9680047, @fgiunchedi wrote: >> Just a note that we now have https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_... [13:21:05] 10superset.wmcloud.org: 14Remove superset-123-3 cluster - 14https://phabricator.wikimedia.org/T355707#9680091 (10github-toolforge-bot) 14vivian-rook closed https://github.com/toolforge/superset-deploy/pull/19 [13:21:19] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/19 [13:30:58] 06cloud-services-team, 10Toolforge, 07Kubernetes, 13Patch-For-Review: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207#9680115 (10Andrew) Yeah, there's a lot of self-contradictory explicit ordering in this code. I wish I knew if it was put there to solve anyt... [13:44:40] 06cloud-services-team, 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: Toolforge: Introduce grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#9680166 (10taavi) a:05taavi→03dcaro [13:46:21] 10Toolforge (Toolforge iteration 07): Rust image build on toolforge fails - https://phabricator.wikimedia.org/T358552#9680179 (10dcaro) @Magnus we have done a few changes in the proxy config to alleviate this issue, are you still seeing the errors? [13:49:41] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update pki project puppetmaster - https://phabricator.wikimedia.org/T361591 (10Andrew) 03NEW [13:49:42] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update monitoring project puppetmaster - https://phabricator.wikimedia.org/T361592 (10Andrew) 03NEW [13:49:44] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update puppet-dev project puppetmaster - https://phabricator.wikimedia.org/T361593 (10Andrew) 03NEW [13:50:56] 06cloud-services-team, 10VPS-Projects, 13Patch-For-Review, 10Puppet (Puppet 7.0): Migrate per-project Puppet servers to Puppet 7 - https://phabricator.wikimedia.org/T351452#9680246 (10Andrew) [13:52:30] 10Toolforge (Toolforge iteration 07): [builds-api,envvars-api] bump the version in the openapi definition when bumping the package version - https://phabricator.wikimedia.org/T356972#9680264 (10dcaro) from @aborrero: we can write a script to detect changes in the openapi definition and complain if there's no ver... [13:52:40] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update mariadbtest project puppetmaster - https://phabricator.wikimedia.org/T361594 (10Andrew) 03NEW [13:52:50] 10Cloud-VPS (Project-requests): 14Request creation of o11y VPS project to replace monitoring - 14https://phabricator.wikimedia.org/T361566#9680283 (10fgiunchedi) 14>>! In T361566#9680089, @taavi wrote: >>>! In T361566#9680047, @fgiunchedi wrote: >>> Just a note that we now have https://wikitech.wikimedia.or... [13:52:58] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update puppet civicrm-prototype puppetmaster - https://phabricator.wikimedia.org/T361595 (10Andrew) 03NEW [13:53:18] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update puppet wikidata-query puppetmaster - https://phabricator.wikimedia.org/T361596 (10Andrew) 03NEW [13:55:07] 06cloud-services-team, 10VPS-Projects, 13Patch-For-Review, 10Puppet (Puppet 7.0): Migrate per-project Puppet servers to Puppet 7 - https://phabricator.wikimedia.org/T351452#9680315 (10Andrew) [14:01:22] 06cloud-services-team, 10Toolforge (Toolforge iteration 07): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9680341 (10dcaro) Waiting for user input to see if this happens again (might go and check the logs too) [14:03:44] 10Toolforge (Toolforge iteration 07), 07Software-Licensing: [builds-api] builds-api is missing a software license - https://phabricator.wikimedia.org/T361007#9680345 (10dcaro) [14:05:32] 06cloud-services-team, 10Toolforge: Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110#9680368 (10dcaro) p:05Medium→03High [14:08:53] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662#9680382 (10aborrero) I'm surprised (in a... [14:20:03] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [14:26:15] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [14:26:23] 10Quarry: 14Move quarry to magnum - 14https://phabricator.wikimedia.org/T349029#9680448 (10rook) 05Open→03Resolved [14:27:53] 10superset.wmcloud.org: Upgrade to 3.1.1 - https://phabricator.wikimedia.org/T361601 (10rook) 03NEW [14:31:51] 06cloud-services-team, 10VPS-Projects, 06collaboration-services, 10Puppet (Puppet 7.0): 14Update gitlab-runners project puppetmaster - 14https://phabricator.wikimedia.org/T360459#9680485 (10Jelto) 05Open→03Resolved p:05Triage→03Medium a:03Jelto 14Thanks @Andrew ! The new puppetserver looks... [14:31:53] 10superset.wmcloud.org: Upgrade to 3.1.1 - https://phabricator.wikimedia.org/T361601#9680499 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/superset-deploy/pull/20 [14:32:10] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/20 [14:32:28] (InstanceDown) firing: Project toolsbeta instance toolsbeta-test-k8s-etcd-23 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:32:58] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance toolsbeta-test-k8s-etcd-23 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:33:57] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [14:34:01] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [14:37:28] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-test-k8s-etcd-23 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:37:41] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): create and deploy new Elastic Curator deb package - https://phabricator.wikimedia.org/T361105#9680514 (10bking) a:05RKemper→03bking [14:43:30] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1036.eqiad.wmnet' (T319184) [14:43:35] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [14:44:16] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: 14[wmcs-cookbook] increase_quota cookbook fails - 14https://phabricator.wikimedia.org/T352840#9680553 (10fnegri) a:05fnegri→03dcaro [14:44:40] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [14:50:06] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [14:55:17] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [14:59:53] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [15:00:13] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1036.eqiad.wmnet' (T319184) [15:00:17] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [15:00:25] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:00:33] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [15:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:00:57] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [15:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:01:33] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) [15:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:06:28] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [15:06:39] (ProbeDown) firing: (2) Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:06:42] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [15:06:45] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [15:08:42] (03PS1) 10Elukey: Remove profile::pki::client::auth_key from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) [15:08:57] (03CR) 10Elukey: [V:03+2 C:03+2] Remove profile::pki::client::auth_key from common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [15:11:39] (ProbeDown) resolved: (2) Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:16:11] (03CR) 10Majavah: "Cloud VPS instances don't read profile hiera so removing this means that provisioning any new instances sing profile::pki::client will be " [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [15:16:14] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [15:16:44] (03PS1) 10Elukey: Revert "Remove profile::pki::client::auth_key from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1016044 [15:16:49] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "Remove profile::pki::client::auth_key from common.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/1016044 (owner: 10Elukey) [15:20:08] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/20 [15:20:52] (03CR) 10Elukey: [V:03+2 C:03+2] "Duly noted, reverted :)" [labs/private] - 10https://gerrit.wikimedia.org/r/1016364 (https://phabricator.wikimedia.org/T360595) (owner: 10Elukey) [15:21:13] 10superset.wmcloud.org: 14Upgrade to 3.1.1 - 14https://phabricator.wikimedia.org/T361601#9680775 (10github-toolforge-bot) 14vivian-rook closed https://github.com/toolforge/superset-deploy/pull/20 [15:22:45] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [15:24:49] 10superset.wmcloud.org: Remove superset-123-4 cluster - https://phabricator.wikimedia.org/T361606 (10rook) 03NEW [15:28:27] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [15:30:52] (03PS1) 10Elukey: Remove profile::pki::client's specific hiera config [labs/private] - 10https://gerrit.wikimedia.org/r/1016386 (https://phabricator.wikimedia.org/T360595) [15:31:53] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-test-localdisk [15:31:59] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-test-localdisk [15:36:24] (PuppetAgentNoResources) resolved: No Puppet resources found on instance toolsbeta-test-localdisk on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:57] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9681003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1036.eqiad.wmnet... [15:46:56] (03CR) 10Dzahn: [C:03+2] Remove stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016295 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff) [15:46:57] (03CR) 10Dzahn: [V:03+2 C:03+2] Remove stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1016295 (https://phabricator.wikimedia.org/T360413) (owner: 10Muehlenhoff) [15:52:19] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9681068 (10aborrero) [16:08:36] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [16:15:41] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [16:22:27] 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 06SRE, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9681261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1036.eqiad.wmnet with... [16:22:28] (InstanceDown) firing: Project toolsbeta instance toolsbeta-test-k8s-etcd-23 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:22:28] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [16:22:32] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [16:23:42] (CloudVPSDesignateLeaks) firing: (5) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:27:28] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-test-k8s-etcd-23 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:28:42] (CloudVPSDesignateLeaks) resolved: (5) Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:33:55] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=99) [16:41:55] 10Toolforge: [buildservice] "failed to create fsnotify watcher: too many open files" and "unable to open destination: open /tekton/home/.docker/config.json: permission denied" - https://phabricator.wikimedia.org/T361519#9681361 (10bd808) `lang=shell-session tools.wikibugs-testing@tools-sgebastion-10:~$ toolforge... [16:54:33] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [17:00:39] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [17:01:43] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [17:08:14] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [17:12:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:14:08] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [17:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:19:22] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [17:22:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:25:42] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [17:25:46] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [17:27:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:33:30] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#9681677 (10Dreamy_Jazz) [17:33:43] 06cloud-services-team: update labtestwiki user and password - https://phabricator.wikimedia.org/T328289#9681681 (10Dreamy_Jazz) [17:39:16] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [17:51:58] (03PS1) 10Andrew Bogott: k8s/etcd/add_node_to_cluster: increase etcd sleeps [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016409 [17:53:13] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [17:53:17] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [17:58:09] (03PS1) 10Dzahn: delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) [17:58:42] (03CR) 10Dzahn: [V:03+2 C:03+2] delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [17:58:47] (03PS2) 10Dzahn: delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) [17:59:07] (03CR) 10Dzahn: [V:03+2 C:03+2] delete webserver-misc-* dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/1016412 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [18:08:55] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [18:12:41] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9681841 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforg... [18:13:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:20:07] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.add_k8s_etcd_node (T349207) [18:20:11] T349207: [infra] Upgrade Toolforge K8s etcd nodes to Bullseye - https://phabricator.wikimedia.org/T349207 [18:23:36] (03CR) 10Andrew Bogott: [C:03+2] k8s/etcd/add_node_to_cluster: increase etcd sleeps [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016409 (owner: 10Andrew Bogott) [18:23:41] (CloudVPSDesignateLeaks) resolved: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:26:56] (03Merged) 10jenkins-bot: k8s/etcd/add_node_to_cluster: increase etcd sleeps [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1016409 (owner: 10Andrew Bogott) [18:30:02] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] FIgure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9681918 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforg... [18:36:32] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_etcd_node (exit_code=0) [18:37:00] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [jobs-api,jobs-cli] Support services in jobs - https://phabricator.wikimedia.org/T348758#9681926 (10Raymond_Ndibe) [18:37:26] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [18:44:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:46:29] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [18:49:02] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [18:54:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:57:59] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [19:04:27] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [19:13:01] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=99) [19:13:20] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [19:22:20] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [19:22:58] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.remove_k8s_etcd_node [19:23:05] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [toolforge-cd] remove duplicated run on tag and push to master (just do one if possible) - https://phabricator.wikimedia.org/T353563#9682039 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/... [19:30:58] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.remove_k8s_etcd_node (exit_code=0) [19:32:33] PROBLEM - ensure kvm processes are running on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:34:40] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott drained for T319184 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:35:37] I did my best to silence the noise about cloudvirt1036 [19:37:28] (InstanceDown) firing: Project toolsbeta instance toolsbeta-test-k8s-etcd-22 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:40:41] (CloudVPSDesignateLeaks) firing: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:41:37] (03PS3) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [19:42:04] (03PS4) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [19:42:28] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-test-k8s-etcd-22 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:45:41] (CloudVPSDesignateLeaks) firing: (5) Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:55:13] (03PS2) 10Krinkle: php82-sssd: add php-yaml [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) [19:55:26] (03CR) 10Krinkle: "OK. I went for consistency instead, but either works for me." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1015690 (https://phabricator.wikimedia.org/T361457) (owner: 10Krinkle) [20:02:03] (03PS5) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [20:11:47] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 10Data-Platform-SRE (2024.03.25 - 2024.04.14): 14create and deploy new Elastic Curator deb package - 14https://phabricator.wikimedia.org/T361105#9682101 (10bking) 05Open→03Resolved 14While working... [20:12:34] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9682106 (10bking) Per subtask, we no longer need to cut a custom package... [20:17:56] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9682112 (10Volans) @bking I'm not sure what do you mean. As mentioned ear... [20:33:47] (03PS6) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [20:40:32] (03PS7) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [20:46:47] (03PS8) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [21:13:06] 06cloud-services-team, 10Cloud-VPS: Linting problems found for NovafullstackSustainedFailures - https://phabricator.wikimedia.org/T351698#9682226 (10Andrew) is this now resolved? [21:14:04] (03PS9) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [21:14:16] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic: Set up a bitu instance for codfw1dev - https://phabricator.wikimedia.org/T360795#9682227 (10Andrew) @SLyngshede-WMF, how is bitu currently deployed? Ganeti, k8s, somewhere else? [21:17:14] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: Remove elasticsearch-curator dependency from Elastic cookbooks - https://phabricator.wikimedia.org/T361647 (10bking) 03NEW [21:19:16] (03PS10) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [21:19:19] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9682261 (10bking) a:05RKemper→03None [21:19:57] 10cloud-services-team (FY2023/2024-Q3-Q4), 06Infrastructure-Foundations, 10Spicerack, 10SRE-tools, 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9682264 (10bking) Oops, thank you for pointing that out. I'm discussing... [21:20:32] 06cloud-services-team, 10Cloud-VPS: Use cloudbackup100[12]-dev for cinder backup test/dev - https://phabricator.wikimedia.org/T358855#9682269 (10Andrew) [21:20:41] (CloudVPSDesignateLeaks) firing: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:25:41] (CloudVPSDesignateLeaks) firing: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:30:41] (CloudVPSDesignateLeaks) resolved: (5) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:35:19] (03PS11) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [21:50:34] (03PS12) 10Krinkle: frontend: Server-side rendering [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1016075 [21:54:10] 10Toolforge: "ftl" tool's perl5.32 webservice pod being frequently killed due to liveness probe failures - https://phabricator.wikimedia.org/T361652 (10bd808) 03NEW [22:16:50] (03PS1) 10Catrope: releases: Bump Codex to 1.3.6 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/1016454 (https://phabricator.wikimedia.org/T361472) [22:19:05] 10Toolforge (Toolforge iteration 07): [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9682441 (10Raymond_Ndibe) a:03Raymond_Ndibe [22:28:16] 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure: Launching new bullseye deployment-prep instances fails, no sudo access - https://phabricator.wikimedia.org/T361536#9682456 (10thcipriani) [22:37:38] 10Toolforge (Toolforge iteration 07), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9682466 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-... [22:52:45] (03PS1) 10Andrew Bogott: Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 [23:06:17] (03PS2) 10Andrew Bogott: Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 [23:11:35] (03PS3) 10Andrew Bogott: Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 [23:14:06] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Move some codfw1dev passwords from 'codfw' site to 'common' [labs/private] - 10https://gerrit.wikimedia.org/r/1016460 (owner: 10Andrew Bogott) [23:40:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:45:41] (CloudVPSDesignateLeaks) firing: (4) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:55:54] 10Toolforge: "ftl" tool's perl5.32 webservice pod being frequently killed due to liveness probe failures - https://phabricator.wikimedia.org/T361652#9682576 (10bd808) My initial hunch was that the [[https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Health_checks|liveness probe]] was interacting badly with t...