[01:00:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance clouddb-wikireplicas-query-1 on project clouddb-services - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [01:20:28] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:58:04] 10Toolforge, 10cloud-services-team: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479 (10Soda) I'm having/had similar issues with Redis on `qpqtool` it seems like provisioning a fresh connection per request help, but that does not seem to be the recommended method... [02:59:40] 10Toolforge, 10cloud-services-team: Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479 (10Soda) https://phabricator.wikimedia.org/P56878 is the timeout error in case that is interesting :) [03:28:00] 10Grid-Engine-to-K8s-Migration: Migrate legobot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319854 (10Legoktm) ` tools.legobot@tools-sgebastion-10:~$ toolforge jobs list Job name: Job type: Status: ---------------- -------------------- ------------... [04:00:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance clouddb-wikireplicas-query-1 on project clouddb-services - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [04:12:04] 10Grid-Engine-to-K8s-Migration: Migrate legobot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319854 (10Legoktm) 05Open→03Resolved ` tools.legobot@tools-sgebastion-10:~$ crontab -r tools.legobot@tools-sgebastion-10:~$ crontab -l no crontab for tools.legobot ` [04:20:28] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [04:36:28] 10Grid-Engine-to-K8s-Migration: Migrate apersonbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319561 (10Legoktm) I moved over the Rust-looking jobs just now: ` enterpriseybot-afc-backlog-graphs schedule: 0 */12 * * * Waiting for scheduled time enterpriseybot-cat-tr... [04:49:33] 10Grid-Engine-to-K8s-Migration: Migrate apersonbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319561 (10Legoktm) Ok, he said: botreq status had someone else take over, I recall verify-login was for pywikibot afc-cat-track we don't need wir-report is low priority pendi... [05:05:14] 10ToolforgeBundle, 10SVG Translate Tool, 10Community-Tech (CommTech-Kanban), 10Patch-Needs-Improvement: Git tag/version fetching times out - https://phabricator.wikimedia.org/T334454 (10Samwilson) The patches have been failing due to a peculiar issue with PHP changing how it handled XML prefixes. For examp... [07:05:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance clouddb-wikireplicas-query-1 on project clouddb-services - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:20:28] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:54:25] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-77 [08:55:06] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-77 [08:55:22] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [08:57:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:01:15] 10Cloud-VPS (Quota-requests): request temporary quota increase for project iiab - https://phabricator.wikimedia.org/T357694 (10dcaro) +1 [09:05:11] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-33.tools.eqiad1.wikimedia.cloud to the cluster [09:05:11] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:05:36] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-78 [09:06:16] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-78 [09:12:51] 10Cloud-VPS (Quota-requests): request temporary quota increase for project iiab - https://phabricator.wikimedia.org/T357694 (10Slst2020) a:03Slst2020 [09:13:21] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:24:31] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-34.tools.eqiad1.wikimedia.cloud to the cluster [09:24:31] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:25:28] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:30:28] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:34:53] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-79 [09:35:28] (PuppetAgentNoResources) resolved: (3) No Puppet resources found on instance bastion on project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:35:35] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-79 [09:35:59] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:41:41] (CloudVPSDesignateLeaks) firing: Detected 7 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:45:40] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-35.tools.eqiad1.wikimedia.cloud to the cluster [09:45:40] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [09:46:41] (CloudVPSDesignateLeaks) firing: (2) Detected 7 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:47:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:49:09] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-80 [09:49:50] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-80 [09:49:55] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [09:49:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:54:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:59:25] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-36.tools.eqiad1.wikimedia.cloud to the cluster [09:59:25] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [10:05:28] (PuppetAgentNoResources) firing: No Puppet resources found on instance clouddb-wikireplicas-query-1 on project clouddb-services - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:14:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:25:34] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.729% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [10:31:21] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:31:29] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [10:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:31:54] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:31:59] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [10:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:32:12] !log dcaro@urcuchillay tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:37:17] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [10:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:37:56] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:42:56] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:48:26] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:51:41] (CloudVPSDesignateLeaks) firing: (2) Detected 14 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:53:26] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:56:41] (CloudVPSDesignateLeaks) resolved: (2) Detected 14 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:58:26] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:00:34] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.896% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:08:26] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:50:58] (PuppetAgentNoResources) resolved: No Puppet resources found on instance clouddb-wikireplicas-query-1 on project clouddb-services - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:12:20] (03CR) 10CI reject: [V: 04-1] openstack: cloudvirt: add pre-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [12:14:07] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the tools cluster [12:21:44] !log taavi@cloudcumin1001 tools Added a new k8s ingress tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud to the cluster [12:21:44] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a ingress role in the tools cluster [12:24:18] 10Grid-Engine-to-K8s-Migration: Migrate commons-android-app from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319638 (10whym) 05Open→03Resolved No subsequent breakage as far as I know, so I guess we are done! [12:48:04] 10Cloud-VPS (Quota-requests): request temporary quota increase for project iiab - https://phabricator.wikimedia.org/T357694 (10Slst2020) 05Open→03Resolved Done! ` sstefanova@cloudcontrol1005:~$ sudo wmcs-openstack quota show iiab | grep ram | ram | 32768 | sstefanova@cloudcontrol1005:~$ su... [12:54:16] !log iiab increase quota to 16 cores and 32768 ram T357694 [12:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Iiab/SAL [12:54:21] T357694: request temporary quota increase for project iiab - https://phabricator.wikimedia.org/T357694 [13:14:09] (03PS3) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add pre-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) [13:34:05] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [13:37:01] (03CR) 10CI reject: [V: 04-1] openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [13:38:07] (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [13:46:52] (03CR) 10Majavah: [C: 04-1] openstack: cloudvirt: add pre-reimage cookbook (037 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [13:48:02] (03CR) 10Majavah: openstack: cloudvirt: add post-reimage cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [14:19:29] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) The database is back up with about 10G of empty space. [14:21:40] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) Starting a `VACUUM` everywhere to hopefully free up some space: ` ubuntu@dbapp:~$ docker exec -it database vacuumdb -U postgres --all --echo ` [14:24:28] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) Postgres seems somewhat unhappy with persisting data to disk. ` 2024-02-16 14:22:55.253 UTC [18] WARNING: archiving write-ahead log file "000000010000065D00000033"... [14:36:07] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) That WAL file exists both in `pg_wal` and `wal_archive`: ` ubuntu@dbapp:/var/lib/postgresql/data$ sudo ls -la pgdata/pg_wal/000000010000065D00000033 wal_archive/000... [14:36:25] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) That did something, and now the disk space is going down rapidly. [14:44:56] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) ` I have no name!@550490a61acf:/var/lib/postgresql/data$ pg_controldata | grep "Latest checkpoint's REDO WAL file" Latest checkpoint's REDO WAL file: 00000001000... [14:48:31] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) Ok, the database is back up now with some breathing room. ` /dev/sdb 501G 348G 131G 73% /var/lib/postgresql ` wal_archive itself is tiny, but seems to be... [15:01:07] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: Toolforge: Introduce grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665 (10dcaro) p:05Triage→03Medium [15:12:00] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) The archive folder is growing at a very worrying rate: ` root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mount... [15:15:01] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/197 maintain-kubeusers: increase quo... [15:21:12] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) I repeated T355138#9550687. ` root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G... [15:27:31] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [15:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:28:01] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [15:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:34:38] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209 (10dcaro) 05Open→03Resolved a:03dcaro Done: ` root@tools-k8s-control-6:~# kubectl -n tool-wd-shex-infer get resourcequotas tool-wd-shex-infer -o js... [15:34:40] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140 (10dcaro) [15:34:42] 10PAWS: Increase prometheus retention time - https://phabricator.wikimedia.org/T357786 (10rook) [15:44:53] 10cloud-services-team: InterfaceSpeedError brq7425e328-56 on cloudvirt1044:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357604 (10dcaro) 05Open→03Resolved a:03dcaro This is fixed now, probably an artifact of the rebuild: {F41930577} [15:47:31] 10cloud-services-team: InterfaceSpeedError brq7425e328-56 on cloudvirt1067:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357579 (10dcaro) 05Open→03Resolved a:03dcaro This was probably an artifact of the rebuild: {F41930608} [15:54:48] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) Disk usage seems to finally have stabilized at this: `lang=shell-session root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used... [15:56:11] (03PS1) 10CDanis: Add faux secret for jaeger in idp [labs/private] - 10https://gerrit.wikimedia.org/r/1004164 [15:58:02] (03CR) 10Filippo Giunchedi: [C: 03+1] Add faux secret for jaeger in idp [labs/private] - 10https://gerrit.wikimedia.org/r/1004164 (owner: 10CDanis) [15:59:18] (03CR) 10CDanis: [V: 03+2 C: 03+2] Add faux secret for jaeger in idp [labs/private] - 10https://gerrit.wikimedia.org/r/1004164 (owner: 10CDanis) [16:10:16] (03CR) 10JHathaway: [C: 03+1] Add faux secret for jaeger in idp [labs/private] - 10https://gerrit.wikimedia.org/r/1004164 (owner: 10CDanis) [16:20:49] (03PS3) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:21:42] (03CR) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [16:24:27] (03CR) 10CI reject: [V: 04-1] openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [16:29:14] (03PS4) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:30:42] (03PS5) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:32:16] (03PS6) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:45:46] (03PS4) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add pre-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) [16:45:48] (03PS7) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:46:48] (03PS5) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add pre-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) [16:46:50] (03PS8) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:52:35] (03PS6) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add pre-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) [16:52:37] (03PS9) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add post-reimage cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004116 (https://phabricator.wikimedia.org/T357765) [16:53:04] (03CR) 10Arturo Borrero Gonzalez: openstack: cloudvirt: add pre-reimage cookbook (037 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1004088 (https://phabricator.wikimedia.org/T357765) (owner: 10Arturo Borrero Gonzalez) [17:47:42] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) Something has happened and now most of the disk is free: `lang=shell-session root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size U... [18:09:48] 10cloud-services-team, 10Observability-Alerting: Alertmanager Phabricator integration for WMCS alerts is too spammy - https://phabricator.wikimedia.org/T352059 (10taavi) 05Open→03Resolved a:03taavi [18:10:26] 10Cloud-Services: Wikimedia Israel GLAM Wiki Dashboard needs more storage space - https://phabricator.wikimedia.org/T357773 (10thcipriani) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more... [18:12:13] 10Cloud-VPS (Quota-requests): Wikimedia Israel GLAM Wiki Dashboard needs more storage space - https://phabricator.wikimedia.org/T357773 (10JJMC89) [18:14:00] 10Cloud-VPS (Quota-requests): Wikimedia Israel GLAM Wiki Dashboard needs more storage space - https://phabricator.wikimedia.org/T357773 (10taavi) [18:14:10] 10Cloud-VPS (Quota-requests): Wikimedia Israel GLAM Wiki Dashboard needs more storage space - https://phabricator.wikimedia.org/T357773 (10JJMC89) [18:14:16] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10JJMC89) [18:15:11] 10Cloud-VPS, 10cloud-services-team: Rescue DBapp trove instance in glamwikidashboard project - https://phabricator.wikimedia.org/T355138 (10taavi) [18:15:46] 10Wikibugs: wikibugs having a hard time staying connected to libera.chat IRC network - https://phabricator.wikimedia.org/T357729 (10valhallasw) Is anything visible from other IRC clients? i.e. do they see a connection reset by peer, a forced close due to X, ...? There could also be a server-specific component;... [19:12:07] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140 (10LucasWerkmeister) [19:12:36] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209 (10LucasWerkmeister) 05Resolved→03Open `requests.memory` is now set to 5 Gi, rather than 8 Gi as I requested. Is this intentional? [19:15:36] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Patch-For-Review: [toolsdb] [cinder] [ceph] Deleting snapshot does not work - https://phabricator.wikimedia.org/T356904 (10fnegri) I reproduced the problem again and I understood better what's happening: * I created a snapshot from... [19:31:18] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:26:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:27:21] 10Wikibugs: wikibugs having a hard time staying connected to libera.chat IRC network - https://phabricator.wikimedia.org/T357729 (10AntiCompositeNumber) `lang=irc Feb 06 07:06:09 ◀━━ Quits: wikibugs (~wikibugs2@wikimedia/bot/pywikibugs) (Remote host closed the connection) Feb 06 08:00:59 ◀━━ Quits: wikibugs (~wi... [21:31:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:36:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:37:48] 10Wikibugs: wikibugs having a hard time staying connected to libera.chat IRC network - https://phabricator.wikimedia.org/T357729 (10bd808) p:05Triage→03High >>! In T357729#9551508, @valhallasw wrote: > There could also be a server-specific component; from the SGE era I vaguely remember that wikibugs had trou... [21:41:56] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:43:14] 10Wikibugs, 10User-bd808: wikibugs having a hard time staying connected to libera.chat IRC network - https://phabricator.wikimedia.org/T357729 (10bd808) a:03bd808 Claiming this as a signal that I'm actively looking at the code and the logs to see if there is anything that I can do to make things more stable.... [21:48:44] 10Wikibugs: host "tools-sgebastion-07.tools.eqiad.wmflabs" is not an admin host - https://phabricator.wikimedia.org/T262268 (10bd808) 05Open→03Declined The bot's processes run on Kubernetes via the #toolforge_jobs_framework these days which makes the old grid engine management issues moot. [21:49:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:50:29] 10Wikibugs: Improve tags used in IRC messages - https://phabricator.wikimedia.org/T161249 (10bd808) [21:54:56] (SystemdUnitDown) resolved: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:56:59] 10Wikibugs: Changes to wikibugs' IRC channel configuration for Babel, Bluespice, Connected-Open-Heritage - https://phabricator.wikimedia.org/T155165 (10bd808) 05Open→03Resolved a:03Paladox Let's resolve this very old request. @Paladox the the constructive work asked for. The `#wikmedia-dev` libera.chat cha... [22:03:45] 10Wikibugs: All phabricator tags emitted with blue color - https://phabricator.wikimedia.org/T357828 (10bd808) [22:06:57] 10Wikibugs: Tag detection is broken - https://phabricator.wikimedia.org/T166951 (10bd808) 05Open→03Resolved a:03valhallasw I split {T357828} into it's own bug report. That regression may have started with the fix @valhallasw made for the parse error, but it i very much it's own problem at this point. [22:11:50] 10Wikibugs: All phabricator tags emitted with blue color - https://phabricator.wikimedia.org/T357828 (10bd808) p:05Triage→03Low Implementing {T1176} could be one way to fix this "regression" that I believe has now been the functional output longer than distinct tag colors were used. Tags have been hard coded... [22:16:22] 10Wikibugs: All phabricator tags emitted with blue color - https://phabricator.wikimedia.org/T357828 (10bd808) [22:16:24] 10Wikibugs: Get icon and color from API instead of screen scraping - https://phabricator.wikimedia.org/T1176 (10bd808) [23:47:54] (03CR) 10BryanDavis: [C: 03+2] wikibugs: Extract XACT to page anchor mappings from data-javelin-init-data [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) (owner: 10BryanDavis) [23:48:35] (03Merged) 10jenkins-bot: wikibugs: Extract XACT to page anchor mappings from data-javelin-init-data [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/1003127 (https://phabricator.wikimedia.org/T199007) (owner: 10BryanDavis) [23:52:23] 10Wikibugs: wb2-irc: reconnect to redis after errors - https://phabricator.wikimedia.org/T89480 (10bd808) 05Open→03Resolved a:03Legoktm {40da2458da0424251533ab37b166d7062bf60d72} [23:56:28] 10Wikibugs: get redis2irc to show formatting errors - https://phabricator.wikimedia.org/T89674 (10bd808) @valhallasw I bet you knew exactly what this meant in 2015, but I'm wondering if you can expand on the issue now or if we should just decline it?