[02:51:10] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:52:25] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [04:53:25] (03CR) 10Lokal Profil: [C:03+2] "recheck" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1262320 (owner: 10Akoopal) [04:55:12] (03Merged) 10jenkins-bot: docker setup for importing .ssh dir for proxy [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1262320 (owner: 10Akoopal) [05:06:46] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11812621 (10Marostegui) Thank you - let me know if I can help [06:12:47] (03open) 10r4356th: Preserve content inside tt tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/15 [06:16:43] (03update) 10r4356th: Preserve content inside tt tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/15 [06:23:52] (03open) 10vriaa: fix: improve image URL tooltip with clearer instructions [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/51 [06:24:37] (03open) 10vriaa: fix: use :deep() to style CdxMenu elements after Codex update [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/52 [06:30:38] (03open) 10vriaa: feat: add descriptions to viewport menu breakpoint items [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/53 [06:39:46] (03update) 10r4356th: Preserve content inside tt tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/15 [06:44:47] (03update) 10r4356th: Preserve content inside tt tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/15 [06:49:39] (03update) 10r4356th: Preserve content inside tt tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/15 [06:50:37] (03merge) 10r4356th: Preserve content inside tt tags [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/15 [07:17:46] 10Tools, 07PHP 8.4 support, 07PHP 8.5 support: labs/tools/blankpages fails its CI tests in PHP 8.4 and 8.5 - https://phabricator.wikimedia.org/T419076#11812762 (10A_smart_kitten) cc @krinkle as maintainer (FWICS from ) [07:19:15] 10Tools, 07PHP 8.4 support, 07PHP 8.5 support: labs/tools/blankpages fails its CI tests in PHP 8.4 and 8.5 - https://phabricator.wikimedia.org/T419076#11812768 (10Krinkle) p:05Triage→03Low [07:33:17] 06cloud-services-team, 10Cloud-VPS: Keystone logs no longer appearing in logstash - https://phabricator.wikimedia.org/T421911#11812813 (10fgiunchedi) There isn't very much more info though I opened {T422830} [07:55:30] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Toolforge: cadvisor-reported Istio network usage is way too high - https://phabricator.wikimedia.org/T421386#11812851 (10taavi) 05Open→03Resolved [08:53:13] 10Tool-curator: Curator: Select every Xth file of Mapillary sequences - https://phabricator.wikimedia.org/T423095 (10PantheraLeo1359531) 03NEW [08:56:53] 10Tool-curator: Curator: Select every Xth file of Mapillary sequences - https://phabricator.wikimedia.org/T423095#11813336 (10PantheraLeo1359531) A good example is sequence uf1zk5b946n7ixsj1bp604 (https://www.mapillary.com/app/user/sermersooq?pKey=2929205344063427&lat=64.18562599999998&lng=-51.721243000000015&z=... [09:41:27] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Move trove DB instances to rabbitmq transient quorum queues - https://phabricator.wikimedia.org/T421857#11813527 (10fgiunchedi) I have finished applying the configuration change to all eqiad1 trove instances, this time around more or less manually. Next u... [09:59:32] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Cloud-VPS: Move trove DB instances to rabbitmq transient quorum queues - https://phabricator.wikimedia.org/T421857#11813558 (10fgiunchedi) This is completed in eqiad1, codfw next [10:37:25] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [12:10:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [12:26:56] FIRING: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:27:18] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/68 [12:27:19] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/39 [12:28:52] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1270427 (owner: 10L10n-bot) [12:28:55] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/watch-translations] - 10https://gerrit.wikimedia.org/r/1270431 (owner: 10L10n-bot) [12:38:54] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [12:48:12] 10Tool-schedule-deployment: ScheduleDeploymentBot should escape wikitext in commit message ({{deploy}} |title= parameter) - https://phabricator.wikimedia.org/T423124 (10Lucas_Werkmeister_WMDE) 03NEW [13:15:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [13:16:25] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [13:24:13] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814236 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: - cloudd... [13:24:49] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [13:26:56] RESOLVED: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:35:05] 10Tool-humaniki-2: Setup push to deploy - https://phabricator.wikimedia.org/T422347#11814287 (10Raymond_Ndibe) Yes Danya, push-to-deploy doesn't support webservice yet. We are working on that, though it won't be out like say, tomorrow [13:35:38] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814289 (10Jclark-ctr) Found a fried circuit on the board. Replaced the board and moved the CPUs over since the new ones did not match. The fault still continued on the new b... [13:41:14] (03update) 10raymond-ndibe: common.yaml: set max_query_length [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1211 (https://phabricator.wikimedia.org/T422453) [13:43:05] (03merge) 10raymond-ndibe: common.yaml: set max_query_length [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1211 (https://phabricator.wikimedia.org/T422453) [13:47:03] 10Toolforge, 06tools-platform-team, 13Patch-For-Review: logs-api: handle exception raised when query range exceeds max_query_length - https://phabricator.wikimedia.org/T422454#11814353 (10Raymond_Ndibe) 05In progress→03Resolved [13:48:45] (03update) 10fnegri: Replace only views that need updating [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/9 (https://phabricator.wikimedia.org/T351637) [13:48:45] (03update) 10fnegri: Add --diff-mode and remove --dry-run [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/10 (https://phabricator.wikimedia.org/T351637) [13:48:46] (03update) 10fnegri: Add summary with counts [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/11 (https://phabricator.wikimedia.org/T351637) [13:48:46] (03open) 10fnegri: Catch SQL errors [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/12 (https://phabricator.wikimedia.org/T351637) [13:48:47] (03update) 10fnegri: Catch SQL errors [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/12 (https://phabricator.wikimedia.org/T351637) [13:49:02] (03update) 10fnegri: Catch SQL errors [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/12 (https://phabricator.wikimedia.org/T351637) [13:49:07] (03update) 10fnegri: Add summary with counts [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/11 (https://phabricator.wikimedia.org/T351637) [13:49:09] (03update) 10fnegri: Add --diff-mode and remove --dry-run [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/10 (https://phabricator.wikimedia.org/T351637) [13:49:10] (03update) 10fnegri: Replace only views that need updating [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/9 (https://phabricator.wikimedia.org/T351637) [13:52:08] (03update) 10fnegri: Replace only views that need updating [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/9 (https://phabricator.wikimedia.org/T351637) [14:01:10] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [14:01:24] (03update) 10fnegri: Catch SQL errors [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/12 (https://phabricator.wikimedia.org/T351637) [14:01:25] (03update) 10fnegri: Add --diff-mode and remove --dry-run [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/10 (https://phabricator.wikimedia.org/T351637) [14:01:28] (03update) 10fnegri: Add summary with counts [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/11 (https://phabricator.wikimedia.org/T351637) [14:05:05] 10Toolforge, 06tools-platform-team: clis: only create tag on merge of the release patch - https://phabricator.wikimedia.org/T422452#11814492 (10Raymond_Ndibe) ` Oppose. Pushing a tag should be the action that triggers the release pipeline. ` The tagging is still the action triggering the release. This task is... [14:09:22] (03open) 10r4356th: Correctly handle relative font sizes [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/16 [14:09:36] (03merge) 10r4356th: Correctly handle relative font sizes [toolforge-repos/delintbot] - 10https://gitlab.wikimedia.org/toolforge-repos/delintbot/-/merge_requests/16 [14:10:56] 10Toolforge, 06tools-platform-team, 13Patch-For-Review: logs-api fails with cryptic error if query range is too far in the past e.g. --since 1000d - https://phabricator.wikimedia.org/T422453#11814521 (10Raymond_Ndibe) 05Open→03Resolved [14:28:49] 10VPS-project-Phabricator, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance - https://phabricator.wikimedia.org/T422559#11814640 (10LSobanski) p:05Triage→03Low [14:32:56] FIRING: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:44:22] (03PS1) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [14:47:57] (03CR) 10CI reject: [V:04-1] roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 (owner: 10Andrew Bogott) [14:51:50] (03PS2) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [14:53:47] (03PS3) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [14:54:51] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11814863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie [14:57:14] (03CR) 10CI reject: [V:04-1] roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 (owner: 10Andrew Bogott) [14:58:24] (03PS4) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [14:59:35] (03PS5) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [15:01:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [15:01:19] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [15:03:42] (03CR) 10CI reject: [V:04-1] roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 (owner: 10Andrew Bogott) [15:07:42] (03PS6) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [15:09:51] 10Tools: sal csp violation not showing on csp-report - https://phabricator.wikimedia.org/T422916#11814952 (10bd808) I think [[https://gitlab.wikimedia.org/toolforge-repos/csp-report/-/blob/5bcc60ee0412023d37daa20e874efee773cf0fa9/csp/__init__.py#L125|this rule in the report processing logic]] may have discarded... [15:11:04] (03CR) 10CI reject: [V:04-1] roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 (owner: 10Andrew Bogott) [15:12:42] (03PS7) 10Andrew Bogott: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 [15:15:16] 10Tools: CSP violations with known domains in the blocked-uri are not collected by csp-report - https://phabricator.wikimedia.org/T422916#11815009 (10bd808) [15:17:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [15:17:08] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [15:17:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [15:17:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [15:17:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [15:20:14] PROBLEM - Host cloudcephosd1037 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:59] 10Tools: CSP violations with known domains in the blocked-uri are not collected by csp-report - https://phabricator.wikimedia.org/T422916#11815044 (10bd808) I guess the question to ask know is if the me that decided this rule to filter out some obviously false positive reports from I assume older User-Agents is... [15:23:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:23:44] RECOVERY - Host cloudcephosd1037 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:28:09] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815072 (10Marostegui) 05Open→03Resolved Thanks John for trying swapping many parts - unfortunately it didn't work so I am going to close this task and open a new one... [15:28:55] 06cloud-services-team, 10Data-Services, 06DBA, 10decommission-hardware: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151 (10Marostegui) 03NEW [15:29:08] 06cloud-services-team, 10Data-Services, 06DBA, 10decommission-hardware: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11815089 (10Marostegui) [15:29:10] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815088 (10Marostegui) [15:29:23] 06cloud-services-team, 10Data-Services, 06DBA, 10decommission-hardware: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11815091 (10Marostegui) [15:29:31] 06cloud-services-team, 10Data-Services, 06DBA, 10decommission-hardware: decommission clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T423151#11815093 (10Marostegui) a:03Marostegui [15:38:51] 10Tools: CSP violations with known domains in the blocked-uri are not collected by csp-report - https://phabricator.wikimedia.org/T422916#11815121 (10bd808) {T422829} seems to have been the problem behind the scenes here that led to the discarded report. [15:44:47] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06tools-platform-team, 06Data-Persistence: Install a clouddb hosts with Debian Trixie - https://phabricator.wikimedia.org/T415165#11815165 (10fnegri) > should we go back to reimage clouddb1015 clouddb1015 is the only clouddb with s4 and s6, until... [15:49:36] (03update) 10fnegri: refactor image parsing and handling [repos/cloud/toolforge/jobs-api] (improve_image_parsing_tests) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/273 (https://phabricator.wikimedia.org/T415322) (owner: 10raymond-ndibe) [15:49:38] (03update) 10fnegri: improve image parsing tests [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/282 (https://phabricator.wikimedia.org/T415322) (owner: 10raymond-ndibe) [15:55:53] 10Toolforge, 06tools-platform-team: Continuous job failed to start due to missing envvar specified in secrets specification - https://phabricator.wikimedia.org/T422929#11815265 (10bd808) [16:09:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [16:14:23] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06tools-platform-team, 06Data-Persistence: Install a clouddb hosts with Debian Trixie - https://phabricator.wikimedia.org/T415165#11815351 (10Marostegui) sounds good from my side yes! [16:14:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [16:15:01] 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11815363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors: - cl... [16:15:22] 06cloud-services-team (FY2025/2026-Q3-Q4), 10Data-Services, 06tools-platform-team, 06Data-Persistence: Install a clouddb hosts with Debian Trixie - https://phabricator.wikimedia.org/T415165#11815365 (10fnegri) a:05Marostegui→03fnegri Re-claiming this task, I'll start with clouddb1022 then. [16:20:58] (03CR) 10Jean-Frédéric: [C:03+1] Switch to Python 3.11 as default interpreter (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1251441 (https://phabricator.wikimedia.org/T409003) (owner: 10Jean-Frédéric) [16:23:17] (03CR) 10Jean-Frédéric: [C:03+2] Added progress parameter to update_database.py [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1268098 (owner: 10Akoopal) [16:24:04] 10VPS-project-Codesearch: Codesearch stuck at Feb 12th? - https://phabricator.wikimedia.org/T421147#11815410 (10Daimona) 05Resolved→03Open >>! In T421147#11810517, @A_smart_kitten wrote: > Is this reoccuring (or e.g. did it never stop occurring, at least for `operations/mediawiki-config` in the 'Everything'... [16:27:56] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:28:21] (03Merged) 10jenkins-bot: Added progress parameter to update_database.py [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1268098 (owner: 10Akoopal) [17:03:23] 10Tool-etherpad-backup: Setup proof of concept storage and retrieval from WMCS object storage - https://phabricator.wikimedia.org/T422958#11815796 (10bd808) I meant to stay in this rabbit hole a bit longer with testing things, but my wiggles got the best of me over the weekend and I jumped past testing and into... [17:04:30] 10Tool-humaniki-2: Setup push to deploy - https://phabricator.wikimedia.org/T422347#11815802 (10Danya) >>! In T422347#11814287, @Raymond_Ndibe wrote: > Yes Danya, push-to-deploy doesn't support webservice yet. We are working on that, though it won't be out like say, tomorrow No problems ^^ I just didn’t see the... [17:33:19] !log tools.cluebotng-editsets Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/24357384007 (https://github.com/cluebotng/component-configs/commits/80829bf9249aa902be5209b117674eccf8283939) [17:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-editsets/SAL [17:33:58] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/24357383991 (https://github.com/cluebotng/component-configs/commits/80829bf9249aa902be5209b117674eccf8283939) [17:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [17:38:32] 06cloud-services-team (FY2025/2026-Q1-Q2), 10Cloud-VPS, 10Toolforge (Toolforge iteration 22), 07Incident Severity 3, 07Wikimedia-Incident: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281#11816045 (10MLechvien-WMF) [17:39:14] 06cloud-services-team, 10Toolforge: lighttpd: starting a webserver logs "WARNING: unknown config-key: server.dir-listing (ignored)" to error.log - https://phabricator.wikimedia.org/T423019#11816055 (10bd808) The config in question comes from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-imag... [17:44:35] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/68 (owner: 10l10n-bot) [17:44:43] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/68 (owner: 10l10n-bot) [17:44:49] (03CR) 10Andrew Bogott: [C:03+2] roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 (owner: 10Andrew Bogott) [17:45:33] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/39 (owner: 10l10n-bot) [17:45:42] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/39 (owner: 10l10n-bot) [17:48:16] (03Merged) 10jenkins-bot: roll_reboot_osds: Add an argument to resume a partial roll [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1270468 (owner: 10Andrew Bogott) [17:52:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [17:52:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [17:52:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [17:52:57] 06cloud-services-team, 10Toolforge: lighttpd: starting a webserver logs "WARNING: unknown config-key: server.dir-listing (ignored)" to error.log - https://phabricator.wikimedia.org/T423019#11816191 (10bd808) Reading https://redmine.lighttpd.net/projects/lighttpd/wiki/Mod_dirlisting it looks like the modern set... [18:06:06] (03open) 10apaskulin: Draft: Copyedits [toolforge-repos/wikimedia-attribution] - 10https://gitlab.wikimedia.org/toolforge-repos/wikimedia-attribution/-/merge_requests/2 [18:06:54] (03CR) 10Alien4444: [C:03+2] Localisation updates from https://translatewiki.net. [labs/xtools] - 10https://gerrit.wikimedia.org/r/1270425 (owner: 10L10n-bot) [18:10:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [18:10:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [18:11:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [18:22:43] (03update) 10apaskulin: Draft: Copyedits [toolforge-repos/wikimedia-attribution] - 10https://gitlab.wikimedia.org/toolforge-repos/wikimedia-attribution/-/merge_requests/2 [18:26:34] 06cloud-services-team, 10Toolforge, 06Privacy Engineering, 07ContentSecurityPolicy: Add Content-Security-Policy header enforcing 3rd party web interaction restrictions to proxy responses - https://phabricator.wikimedia.org/T130748#11816376 (10A_smart_kitten) [18:30:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [18:40:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [18:45:24] (03update) 10apaskulin: Copyedits [toolforge-repos/wikimedia-attribution] - 10https://gitlab.wikimedia.org/toolforge-repos/wikimedia-attribution/-/merge_requests/2 [18:50:54] (03update) 10raymond-ndibe: jobs-api: test for proper handling of the diff variations of the --image argument [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1113 (https://phabricator.wikimedia.org/T414978 https://phabricator.wikimedia.org/T415322) [18:54:01] (03close) 10raymond-ndibe: images.py: match variants of the same image [repos/cloud/toolforge/jobs-api] (refactor_image_handling) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/284 (https://phabricator.wikimedia.org/T414978 https://phabricator.wikimedia.org/T415322) [18:55:43] (03update) 10raymond-ndibe: images.py: add tests for image variant matching [repos/cloud/toolforge/jobs-api] (refactor_image_handling) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/286 (https://phabricator.wikimedia.org/T415322) [18:57:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=99) [19:02:44] FIRING: MaintainDBUsersManyErrors: Maintain-dbusers is having sustained errors - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainDBUsersManyErrors - https://grafana.wikimedia.org/d/ae240a06-c13e-49f3-b12c-58432c551e85/wmcs-maintain-dbusers - https://alerts.wikimedia.org/?q=alertname%3DMaintainDBUsersManyErrors [19:06:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.roll_reboot_osds [19:09:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [19:30:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.roll_reboot_osds (exit_code=0) [19:41:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:53:12] (03update) 10raymond-ndibe: images.py: add tests for image variant matching [repos/cloud/toolforge/jobs-api] (refactor_image_handling) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/286 (https://phabricator.wikimedia.org/T415322) [19:58:36] 06cloud-services-team, 10Cloud-VPS: Consider enabling static website support in the WMCS radosgw implementation of S3 buckets - https://phabricator.wikimedia.org/T423194 (10bd808) 03NEW [19:58:45] 06cloud-services-team, 10Cloud-VPS: Consider enabling static website support in the WMCS radosgw implementation of S3 buckets - https://phabricator.wikimedia.org/T423194#11816702 (10bd808) [19:59:39] 06cloud-services-team, 10Cloud-VPS: Consider enabling static website support in the WMCS radosgw implementation of S3 buckets - https://phabricator.wikimedia.org/T423194#11816703 (10bd808) `lang=irc [16:22] < bd808> I was exploring our S3-compatible object storage stuff and found that `s3cmd ws-create s3:/... [19:59:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [20:00:36] 06cloud-services-team, 10Cloud-VPS: Consider enabling static website support in the WMCS radosgw implementation of S3 buckets - https://phabricator.wikimedia.org/T423194#11816708 (10bd808) [20:05:09] (03CR) 10Lokal Profil: [C:03+1] Switch to Python 3.11 as default interpreter (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1251441 (https://phabricator.wikimedia.org/T409003) (owner: 10Jean-Frédéric) [20:07:51] (03PS1) 10Lokal Profil: Fix SSH proxy entrypoint and README issues [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1270551 [20:09:34] (03update) 10raymond-ndibe: values.yaml: add image variant name to aliases [repos/cloud/toolforge/image-config] (replace_job_with_webservice_image_variants) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/21 (https://phabricator.wikimedia.org/T415322) [20:17:56] RESOLVED: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:17:56] RESOLVED: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:26:20] (03CR) 10Akoopal: [C:03+2] Fix SSH proxy entrypoint and README issues [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1270551 (owner: 10Lokal Profil) [20:28:13] (03Merged) 10jenkins-bot: Fix SSH proxy entrypoint and README issues [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1270551 (owner: 10Lokal Profil) [20:28:15] (03CR) 10Akoopal: [C:03+2] "Tested, working on my mac with using the SSH dir, code changes make sense." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1270551 (owner: 10Lokal Profil) [20:30:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [20:34:23] (03CR) 10Jean-Frédéric: [C:03+2] Switch to Python 3.11 as default interpreter [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1251441 (https://phabricator.wikimedia.org/T409003) (owner: 10Jean-Frédéric) [20:35:17] (03update) 10raymond-ndibe: images.py: add tests for image variant matching [repos/cloud/toolforge/jobs-api] (refactor_image_handling) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/286 (https://phabricator.wikimedia.org/T415322) [20:36:29] (03Merged) 10jenkins-bot: Switch to Python 3.11 as default interpreter [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1251441 (https://phabricator.wikimedia.org/T409003) (owner: 10Jean-Frédéric) [20:58:32] (03PS1) 10Lokal Profil: Fix removed (in 9+) pywikibot APIs in bot scripts [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1270569 (https://phabricator.wikimedia.org/T409003) [21:00:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [21:02:32] (03PS1) 10Lokal Profil: Bump pywikibot to >= 11.0.0 [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1270570 (https://phabricator.wikimedia.org/T409003) [21:30:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [21:35:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [21:36:25] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [22:02:48] (03PS1) 10Dzahn: add fake keys for new zuul to connect to gerrit [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) [22:03:18] (03PS2) 10Dzahn: add fake keys for new zuul to connect to gerrit [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) [22:04:03] (03CR) 10Dzahn: [V:03+2 C:03+2] "not-labs-not-private in labs/private" [labs/private] - 10https://gerrit.wikimedia.org/r/1270577 (https://phabricator.wikimedia.org/T422895) (owner: 10Dzahn) [22:12:56] FIRING: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:17:56] RESOLVED: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:50:03] (03PS1) 10Cwhite: logging: add ocsp secret [labs/private] - 10https://gerrit.wikimedia.org/r/1270586 (https://phabricator.wikimedia.org/T350516) [22:51:01] (03CR) 10Cwhite: [V:03+2 C:03+2] logging: add ocsp secret [labs/private] - 10https://gerrit.wikimedia.org/r/1270586 (https://phabricator.wikimedia.org/T350516) (owner: 10Cwhite) [22:56:12] (03PS1) 10Cwhite: Revert "logging: add dummy pki "secrets"" [labs/private] - 10https://gerrit.wikimedia.org/r/1270589 [22:56:51] (03CR) 10Cwhite: [V:03+2 C:03+2] Revert "logging: add dummy pki "secrets"" [labs/private] - 10https://gerrit.wikimedia.org/r/1270589 (owner: 10Cwhite) [23:02:59] FIRING: MaintainDBUsersManyErrors: Maintain-dbusers is having sustained errors - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainDBUsersManyErrors - https://grafana.wikimedia.org/d/ae240a06-c13e-49f3-b12c-58432c551e85/wmcs-maintain-dbusers - https://alerts.wikimedia.org/?q=alertname%3DMaintainDBUsersManyErrors [23:03:18] (03update) 10raymond-ndibe: replace job images with web images [repos/cloud/toolforge/jobs-api] (add_tests_for_image_variant_matching) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/263 (https://phabricator.wikimedia.org/T415322) [23:04:21] (03update) 10raymond-ndibe: replace job images with web images [repos/cloud/toolforge/jobs-api] (add_tests_for_image_variant_matching) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/263 (https://phabricator.wikimedia.org/T415322) [23:05:37] (03update) 10raymond-ndibe: jobs-api: use webservice image variants in one-off job tests [repos/cloud/toolforge/toolforge-deploy] (test_for_image_argument_handling_in_jobs) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1115 (https://phabricator.wikimedia.org/T415322) [23:11:12] (03update) 10raymond-ndibe: values.yaml: hoist web image variants to top of config [repos/cloud/toolforge/image-config] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/18 (https://phabricator.wikimedia.org/T415322) [23:11:24] (03update) 10raymond-ndibe: values.yaml: hoist web image variants to top of config [repos/cloud/toolforge/image-config] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/18 (https://phabricator.wikimedia.org/T415322) [23:11:51] (03update) 10raymond-ndibe: values.yaml: hoist web image variants to top of config [repos/cloud/toolforge/image-config] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/18 (https://phabricator.wikimedia.org/T415322) [23:12:03] 10Tool-schedule-deployment: ScheduleDeploymentBot should escape wikitext in commit message ({{deploy}} |title= parameter) - https://phabricator.wikimedia.org/T423124#11817338 (10bd808) https://gitlab.wikimedia.org/toolforge-repos/schedule-deployment/-/blob/801a3f492fb9159f845d3b7161104be14cb93a67/src/deployments... [23:13:26] (03update) 10raymond-ndibe: improve image parsing tests [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/282 (https://phabricator.wikimedia.org/T415322) [23:26:46] 10Tool-schedule-deployment: ScheduleDeploymentBot should escape wikitext in commit message ({{deploy}} |title= parameter) - https://phabricator.wikimedia.org/T423124#11817359 (10bd808) @Lucas_Werkmeister_WMDE I think I made a narrow solution for {T372750} in part because I was reflecting on #stashbot and how it... [23:35:14] (03update) 10raymond-ndibe: refactor image parsing and handling [repos/cloud/toolforge/jobs-api] (improve_image_parsing_tests) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/273 (https://phabricator.wikimedia.org/T415322) [23:35:50] (03update) 10raymond-ndibe: images.py: add tests for image variant matching [repos/cloud/toolforge/jobs-api] (refactor_image_handling) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/286 (https://phabricator.wikimedia.org/T415322) [23:35:57] (03update) 10raymond-ndibe: images.py: add tests for image variant matching [repos/cloud/toolforge/jobs-api] (refactor_image_handling) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/286 (https://phabricator.wikimedia.org/T415322) [23:36:59] (03update) 10raymond-ndibe: values.yaml: add image variant name to aliases [repos/cloud/toolforge/image-config] (replace_job_with_webservice_image_variants) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/21 (https://phabricator.wikimedia.org/T415322) [23:37:53] (03update) 10raymond-ndibe: jobs-api: test for proper handling of the diff variations of the --image argument [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1113 (https://phabricator.wikimedia.org/T414978 https://phabricator.wikimedia.org/T415322) [23:38:17] (03update) 10raymond-ndibe: replace job images with web images [repos/cloud/toolforge/jobs-api] (add_tests_for_image_variant_matching) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/263 (https://phabricator.wikimedia.org/T415322) [23:38:51] (03update) 10raymond-ndibe: replace job images with web images [repos/cloud/toolforge/jobs-api] (add_tests_for_image_variant_matching) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/263 (https://phabricator.wikimedia.org/T415322) [23:40:21] (03update) 10raymond-ndibe: jobs-api: use webservice image variants in one-off job tests [repos/cloud/toolforge/toolforge-deploy] (test_for_image_argument_handling_in_jobs) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1115 (https://phabricator.wikimedia.org/T415322) [23:56:10] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [23:56:25] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of memory - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity