[01:16:17] 10PAWS: microtask for T388234 - https://phabricator.wikimedia.org/T389577 (1013) 03NEW [01:17:11] 10PAWS: microtask for T388234 - https://phabricator.wikimedia.org/T389577#10660162 (1013) [01:46:08] 10wikitech.wikimedia.org, 07Wikimedia-production-error: Wikitech-static search DBQueryError - https://phabricator.wikimedia.org/T389578 (10Dylsss) 03NEW [03:29:32] 10PAWS: create a dynamic banner - microtask for T388234 - https://phabricator.wikimedia.org/T389577#10660245 (10Pppery) [03:29:35] 10PAWS: create a dynamic banner - microtask for T388234 - https://phabricator.wikimedia.org/T389577#10660246 (10Pppery) [07:34:40] 06Toolforge-standards-committee: Adoption request for "request" tool - https://phabricator.wikimedia.org/T389540#10660630 (10Tkarcher) > Merged on 2024-05-21. That's nearly a year ago. Maybe the view was there but dangling until recently? The errors started shortly after that: Last successful run was on [[ https... [08:30:12] 06Toolforge-standards-committee: Adoption request for "request" tool - https://phabricator.wikimedia.org/T389540#10660760 (10Ladsgroup) Yup, I removed it. The patch that removes it: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1025821 has a couple of examples of how to change it. e.g. http... [09:57:07] 10Cloud-Services, 06cloud-services-team, 10Elasticsearch, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 10SRE Observability (FY2024/2025-Q3): Cloudelastic alerts should route to data platform alerts, not wmcs - https://phabricator.wikimedia.org/T388270#10661016 (10Gehel) [10:01:10] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#10661115 (10Gehel) [10:03:27] 06cloud-services-team, 10Data-Services, 10Elasticsearch, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 10SRE Observability (FY2024/2025-Q3): Cloudelastic alerts should route to data platform alerts, not wmcs - https://phabricator.wikimedia.org/T388270#10661153 (10taavi) [11:51:26] 10Tool-fault-tolerance: Fault-tolerance tool should have a backend option - https://phabricator.wikimedia.org/T389612 (10MatthewVernon) 03NEW [12:05:58] 10Tool-fault-tolerance: Fault-tolerance tool should have a backend option - https://phabricator.wikimedia.org/T389612#10661495 (10Ladsgroup) Where can I find the list of them? For example, db sections can be found via https://noc.wikimedia.org/dbconfig/eqiad.json and front-end are being discovered via https://co... [12:12:52] 10Tool-fault-tolerance: fault-tolerance missing thanos-fe2004 - https://phabricator.wikimedia.org/T389615 (10MatthewVernon) 03NEW [12:17:09] 10Tool-fault-tolerance: Fault-tolerance tool should have a backend option - https://phabricator.wikimedia.org/T389612#10661570 (10MatthewVernon) They're in puppet - for swift it's `profile::swift::storagehosts:` (different per DC), for thanos it's `profile::thanos::swift::backends:`, and for apus it's in `cephad... [12:21:03] 10PAWS: create a dynamic banner - microtask for T388234: httpss://github.com/Jemeelah1/Dynamic-Banner - https://phabricator.wikimedia.org/T389577#10661576 (1013) [12:25:43] 10Tool-fault-tolerance: fault-tolerance missing thanos-fe2004 - https://phabricator.wikimedia.org/T389615#10661591 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Updated netbox file to the latest and it now shows up. Once this is productionized, it would be done automatically. [12:26:46] 10Tool-fault-tolerance: Fault-tolerance tool should have a backend option - https://phabricator.wikimedia.org/T389612#10661595 (10Ladsgroup) Reading from puppet repo is not super nice, it's possible but it's brittle, do you have an http endpoint I can reuse? [12:29:55] 10Tool-fault-tolerance: fault-tolerance missing thanos-fe2004 - https://phabricator.wikimedia.org/T389615#10661601 (10MatthewVernon) Thanks :) [12:35:59] 10Tools: [zoomviewer] Zoomviewer is to slow - https://phabricator.wikimedia.org/T389617 (10Kristbaum) 03NEW [12:36:14] 10Tools: [Zoomviewer] Zoomviewer is to slow - https://phabricator.wikimedia.org/T389617#10661639 (10Kristbaum) [12:36:56] 10Tools: [Zoomviewer] Zoomviewer is too slow - https://phabricator.wikimedia.org/T389617#10661645 (10Kristbaum) [13:27:12] 10Tool-fault-tolerance: Fault-tolerance tool should have a backend option - https://phabricator.wikimedia.org/T389612#10661859 (10MatthewVernon) I'm sorry, I don't think I do. The swift (ms and thanos) rings are on the puppetserver and thus available (but probably not by https), but you don't really want to be g... [13:34:44] 10Tool-fault-tolerance: Fault-tolerance tool should have a backend option - https://phabricator.wikimedia.org/T389612#10661875 (10MatthewVernon) My other dumb idea would be: netbox knows the answer, and has http endpoints, could that be the solution? [14:03:15] (03PS6) 10Slyngshede: Add Bitu container [labs/striker] - 10https://gerrit.wikimedia.org/r/1035718 (https://phabricator.wikimedia.org/T362318) [14:30:50] (03PS7) 10Slyngshede: Add Bitu container [labs/striker] - 10https://gerrit.wikimedia.org/r/1035718 (https://phabricator.wikimedia.org/T362318) [14:36:13] (03CR) 10Slyngshede: Add Bitu container (033 comments) [labs/striker] - 10https://gerrit.wikimedia.org/r/1035718 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [15:23:51] 06cloud-services-team, 10Data-Services, 10Elasticsearch, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 10SRE Observability (FY2024/2025-Q3): Cloudelastic alerts should route to data platform alerts, not wmcs - https://phabricator.wikimedia.org/T388270#10662404 (10bking) 05Open→03Resolved a:03bk... [17:14:13] 06Toolforge-standards-committee: Adoption request for "request" tool - https://phabricator.wikimedia.org/T389540#10663104 (10Tkarcher) >>! In T389540#10659244, @bd808 wrote: > Problem 0 here is that I can't find anything that looks remotely like a license inside of /data/project/request and the tool pre-dates th... [18:33:47] FIRING: [3x] NodeDown: Node cloudcephosd1033 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [18:34:09] FIRING: [11x] CloudVirtDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [18:34:14] 06cloud-services-team: CloudVirtDown - https://phabricator.wikimedia.org/T389668 (10phaultfinder) 03NEW [18:34:50] FIRING: TooManyCloudvirtsDown: #page Reduced availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudvirtsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudvirtsDown [18:34:57] 06cloud-services-team: TooManyCloudvirtsDown # page Reduced availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T389669 (10phaultfinder) 03NEW [18:35:09] FIRING: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [18:35:18] 06cloud-services-team: CephClusterInUnknown # page Ceph cluster in eqiad is in unknown status - https://phabricator.wikimedia.org/T389670 (10phaultfinder) 03NEW [18:36:56] FIRING: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:38:47] FIRING: [25x] NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [18:39:09] FIRING: [22x] CloudVirtDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [18:39:14] 06cloud-services-team: CloudVirtDown - https://phabricator.wikimedia.org/T389668#10663617 (10phaultfinder) [18:41:56] FIRING: [2x] SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:45:08] PROBLEM - Host cloudcephosd1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:14] PROBLEM - Host cloudvirt1058 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:14] PROBLEM - Host cloudvirt1066 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:16] RECOVERY - Host cloudcephosd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:45:20] RECOVERY - Host cloudvirt1066 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:45:36] RECOVERY - Host cloudvirt1058 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:46:52] FIRING: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:46:56] FIRING: [3x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:48:47] RESOLVED: [25x] NodeDown: Node cloudcephmon1005 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [18:49:09] RESOLVED: [22x] CloudVirtDown: Cloudvirt node cloudvirt1048 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CloudVirtDown - https://alerts.wikimedia.org/?q=alertname%3DCloudVirtDown [18:49:50] RESOLVED: TooManyCloudvirtsDown: #page Reduced availability for CloudVPS eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TooManyCloudvirtsDown - https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m - https://alerts.wikimedia.org/?q=alertname%3DTooManyCloudvirtsDown [18:50:09] RESOLVED: CephClusterInUnknown: #page Ceph cluster in eqiad is in unknown status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInUnknown - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInUnknown [18:51:27] RESOLVED: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:51:56] FIRING: [4x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:52:14] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672 (10Andrew) 03NEW [18:55:13] 06cloud-services-team: TooManyCloudvirtsDown # page Reduced availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T389669#10663676 (10Andrew) →14Duplicate dup:03T389672 [18:55:14] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672#10663678 (10Andrew) [18:55:41] 06cloud-services-team: CloudVirtDown - https://phabricator.wikimedia.org/T389668#10663687 (10Andrew) →14Duplicate dup:03T389672 [18:55:45] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672#10663689 (10Andrew) [18:56:22] 06cloud-services-team: CephClusterInUnknown # page Ceph cluster in eqiad is in unknown status - https://phabricator.wikimedia.org/T389670#10663699 (10Andrew) →14Duplicate dup:03T389672 [18:56:26] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672#10663701 (10Andrew) [18:57:28] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672#10663703 (10Andrew) a:03cmooney This is now resolved and indeed seems not to have been visible to users. I'm leaving open for Cathal's notes on what exactlyhappened. [18:58:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-44 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:01:56] RESOLVED: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:02:47] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672#10663716 (10cmooney) Background here was I enabled OSPF on pretty much all the IP-enabled interfaces on the cloud switches in eqiad - see P74297 That should not really have affected anything, as I didn't... [19:03:37] 06cloud-services-team, 10Cloud-VPS, 07IPv6: IPv6 support in cloud-private - https://phabricator.wikimedia.org/T379283#10663722 (10cmooney) @aborrero I made some progress on this today, but we're not quite there. Unfortunately when I enabled OSPF on the cloud switches it caused some problems - see https://ph... [19:08:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-68 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:52:08] 10PAWS: create a dynamic banner - microtask for T388234: httpss://github.com/Jemeelah1/Dynamic-Banner - https://phabricator.wikimedia.org/T389577#10664113 (10Aklapper) [22:18:56] 06cloud-services-team, 10Toolforge: toolforge build --envvar does not accept values containing equals character - https://phabricator.wikimedia.org/T389694 (10Don-vip) 03NEW [22:54:12] 10Tools, 10Wikidata, 07Security: Blocked Wikidata user sockpuppets are doing automated misconduct with QuickStatements - https://phabricator.wikimedia.org/T386978#10664435 (10Bluerasberry) The misconduct persists. [23:42:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [23:42:11] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T389701 (10phaultfinder) 03NEW