[01:21:28] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:07:20] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [05:21:28] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:07:20] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:43:51] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:48:50] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:54:09] 10VPS-project-Wikistats: Merge Gamepedia table with Wikia table (and perhaps rename Wikia table to Fandom as well?) - https://phabricator.wikimedia.org/T377549 (10GroupNebula563) 03NEW [08:27:47] (03CR) 10Brouberol: analytics_test_cluster: add secret (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [08:59:30] 06cloud-services-team, 10Toolforge (Toolforge iteration 16): Introduce health checks for Toolforge Jobs Framework cronjobs - https://phabricator.wikimedia.org/T377420#10241037 (10aborrero) >>! In T377420#10239144, @bd808 wrote: > I wonder if adding support for declaring `concurrencyPolicy: Replace` for a sched... [09:21:28] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:38:08] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1081132 (owner: 10L10n-bot) [10:07:20] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:08:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [10:13:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [10:31:32] 10Quarry: Quarry shows error: This web service cannot be reached - https://phabricator.wikimedia.org/T375988#10241364 (10rook) >>! In T375988#10236341, @GTrang wrote: >>>! In T375988#10186586, @rook wrote: >> Quarry is working again. Though I didn't have time to investigate what is happening so this may happen a... [10:36:47] vivian-rook opened https://github.com/toolforge/paws/pull/454 [10:39:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [10:44:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [10:51:10] 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10241410 (10aborrero) 05Open→03In progress p:05Medium→03High Today the oomkiller victim was mariadb, which is maybe even more concerning that rabbitmq getting killed. I'll raise prio... [11:08:04] 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10241501 (10aborrero) [11:13:12] 10cloud-services-team (Hardware): wmcs codfw hardware changes proposal - https://phabricator.wikimedia.org/T377568 (10aborrero) 03NEW [11:30:26] 06cloud-services-team: WMCS hardware services: 3-node HA redundancy model - https://phabricator.wikimedia.org/T377570 (10aborrero) 03NEW [11:31:08] 10cloud-services-team (Hardware), 05Goal: eqiad1: procure 1 additional cloudlb server - https://phabricator.wikimedia.org/T341062#10241560 (10aborrero) [11:31:09] 06cloud-services-team: WMCS hardware services: 3-node HA redundancy model - https://phabricator.wikimedia.org/T377570#10241561 (10aborrero) [11:31:39] 06cloud-services-team: WMCS hardware services: 3-node HA redundancy model - https://phabricator.wikimedia.org/T377570#10241558 (10aborrero) p:05Triage→03Medium [11:32:45] 06cloud-services-team: WMCS hardware services: 3-node HA redundancy model - https://phabricator.wikimedia.org/T377570#10241566 (10aborrero) [11:36:49] 06cloud-services-team: WMCS hardware services: 3-node HA redundancy model - https://phabricator.wikimedia.org/T377570#10241570 (10aborrero) cloudlb in codfw contains 3 nodes at the moment: * cloudlb2002-dev * cloudlb2003-dev * cloudlb2004-dev (replacing cloudlb2001-dev T377126) [11:40:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [11:55:53] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [11:56:59] 06cloud-services-team: cloud: introduce a kubernetes undercloud to run openstack (via openstack-helm) - https://phabricator.wikimedia.org/T342750#10241636 (10aborrero) Additional context. As of this writing, we have nothing in the roadmap with: * expanding into more datacenters beyond eqiad for serving Clou... [12:05:16] 10cloud-services-team (Hardware): wmcs codfw hardware changes proposal - https://phabricator.wikimedia.org/T377568#10241663 (10aborrero) [12:06:36] 10cloud-services-team (Hardware): wmcs codfw hardware changes proposal - https://phabricator.wikimedia.org/T377568#10241669 (10aborrero) p:05Triage→03Medium please @joanna_borun and @RobH review the proposal in the ticket description. [12:10:05] 10cloud-services-team (Hardware): wmcs codfw hardware changes proposal - https://phabricator.wikimedia.org/T377568#10241689 (10aborrero) [12:38:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [12:38:28] 06cloud-services-team, 10Toolforge, 07Documentation, 07good first task: Find and fix inaccuracies in Toolforge Django tutorial - https://phabricator.wikimedia.org/T245683#10241747 (10Aklapper) a:05Chickenleaf→03None @Chickenleaf: I am resetting the assignee of this task because there has not been progr... [12:38:50] 10Cloud-VPS (Debian Buster Deprecation): Cloud VPS "wikicommunityhealth" project Buster deprecation - https://phabricator.wikimedia.org/T367560#10241753 (10Aklapper) @CristianCantoro: Please reply, otherwise data might get deleted. Thanks. [12:43:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:10:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:15:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:21:29] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:32:54] (03PS3) 10Bking: analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) [13:33:21] (03CR) 10Bking: analytics_test_cluster: add secret (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:34:29] (03CR) 10Btullis: [C:03+1] analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [13:38:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:43:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [13:55:29] (03CR) 10Bking: [V:03+2 C:03+2] analytics_test_cluster: add secret [labs/private] - 10https://gerrit.wikimedia.org/r/1081261 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:07:20] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:08:54] 10Quarry: Quarry shows error: This web service cannot be reached - https://phabricator.wikimedia.org/T375988#10242045 (10GTrang) 05Open→03Resolved And now Quarry is working again. [14:10:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [14:15:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [14:38:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [14:43:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [14:46:49] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T376556#10242264 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/454 [14:47:08] vivian-rook closed https://github.com/toolforge/paws/pull/454 [14:48:20] 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T376556#10242268 (10rook) 05Open→03Resolved a:03rook [14:54:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-27 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:04:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-27 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:08:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [15:13:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [15:38:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [15:43:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [15:49:52] 10VPS-project-Wikistats: Merge Gamepedia table with Wikia table (and perhaps rename Wikia table to Fandom as well?) - https://phabricator.wikimedia.org/T377549#10242588 (10Dzahn) I see, thanks for reporting. So.. we have an issue with the wikia table because it's just too many wikis to update them in a reasonab... [15:57:42] 06cloud-services-team, 10VPS-project-Codesearch, 06Security-Team, 07SecTeam-Processed, and 2 others: XSS - codesearch.wmcloud.org - https://phabricator.wikimedia.org/T377168#10242616 (10sbassett) 05Open→03Resolved p:05Triage→03Medium a:03Bawolff [17:21:29] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:23:52] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS: Decision Request - How to do the Cloud VPS VXLAN/IPv6 migration - https://phabricator.wikimedia.org/T377467#10242944 (10taavi) [17:23:59] 06cloud-services-team, 10Cloud-VPS, 07Epic, 07IPv6: Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947#10242943 (10taavi) [17:53:19] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [17:53:49] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.55 ms [18:07:20] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:53:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [19:58:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [20:11:48] 10VPS-project-Wikistats: Merge Gamepedia table with Wikia table (and perhaps rename Wikia table to Fandom as well?) - https://phabricator.wikimedia.org/T377549#10243422 (10GroupNebula563) I was discussing with RhinosF1 a potential sort of API wherein wikis could POST to a specific URL to request addition and/or... [21:21:29] FIRING: CloudVPSDesignateLeaks: Detected 5 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:07:20] FIRING: PuppetZeroResources: Puppet has failed generate resources on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:09:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [22:14:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [22:39:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [22:44:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [22:47:35] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwv-builder-03.mediawiki-vagrant.eqiad.wmflabs is about to expire in 6d 23h 58m 34s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:53:14] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Cloud-VPS, 06Tech-Docs-Team, 07Documentation: WMCS: Document different types of root and admin privileges - https://phabricator.wikimedia.org/T375113#10243742 (10TBurmeister) The more symmetrical structure and consistent naming with prefixes looks good to me; I... [23:40:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [23:45:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM