[02:07:56] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [02:12:56] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [05:58:21] 10Data-Services: [toolsdb] Can't authenticate with Toolsdb - https://phabricator.wikimedia.org/T351410 (10Slst2020) Thank you @taavi [08:14:13] 10Grid-Engine-to-K8s-Migration: Migrate mbh from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319883 (10Ghuron) Thanks for the detailed responses, but I feel that one piece is still missing in the puzzle. As you can see, for instance [[ https://github.com/Saisengen/wikibots/b... [08:22:53] 10Grid-Engine-to-K8s-Migration: Migrate panoviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319953 (10tstarling) I would like to work on this, but I need to be [[https://toolsadmin.wikimedia.org/tools/id/panoviewer|added as a maintainer]]. Also, I would like to be a... [08:39:22] 10Cloud-VPS (Quota-requests): Quota increase for reading-web-staging - https://phabricator.wikimedia.org/T355453 (10dcaro) +1 [08:41:06] !log dcaro@urcuchillay reading-web-staging START - Cookbook wmcs.openstack.quota_increase (T355453) [08:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Reading-web-staging/SAL [08:41:11] T355453: Quota increase for reading-web-staging - https://phabricator.wikimedia.org/T355453 [08:41:12] !log dcaro@urcuchillay reading-web-staging END (FAIL) - Cookbook wmcs.openstack.quota_increase (exit_code=99) (T355453) [08:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Reading-web-staging/SAL [08:44:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:45:05] !log dcaro@urcuchillay reading-web-staging START - Cookbook wmcs.openstack.quota_increase (T355453) [08:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Reading-web-staging/SAL [08:45:15] !log dcaro@urcuchillay reading-web-staging END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T355453) [08:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Reading-web-staging/SAL [08:49:53] 10Cloud-VPS (Quota-requests): Quota increase for reading-web-staging - https://phabricator.wikimedia.org/T355453 (10dcaro) 05Open→03Resolved a:03dcaro Done :) [08:49:56] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:52:16] (03PS2) 10David Caro: ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 [08:52:18] (03PS2) 10David Caro: ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 [08:52:20] (03PS3) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [08:52:22] (03PS3) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [08:52:24] (03PS3) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [08:52:26] (03PS1) 10David Caro: common.run_one_as_dict: fix typo on error message [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992084 [08:52:28] (03PS1) 10David Caro: quota_show: fix change in openstack cli return value [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992085 [08:52:37] (03CR) 10David Caro: [C: 03+2] common.run_one_as_dict: fix typo on error message [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992084 (owner: 10David Caro) [08:56:14] (03CR) 10CI reject: [V: 04-1] ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 (owner: 10David Caro) [08:56:18] (03CR) 10CI reject: [V: 04-1] ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 (owner: 10David Caro) [08:56:21] (03Merged) 10jenkins-bot: common.run_one_as_dict: fix typo on error message [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992084 (owner: 10David Caro) [08:56:23] (03CR) 10CI reject: [V: 04-1] quota_show: fix change in openstack cli return value [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992085 (owner: 10David Caro) [08:56:25] (03CR) 10CI reject: [V: 04-1] ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 (owner: 10David Caro) [08:56:27] (03CR) 10CI reject: [V: 04-1] ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 (owner: 10David Caro) [08:56:29] (03CR) 10CI reject: [V: 04-1] ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 (owner: 10David Caro) [09:03:36] 10Data-Services, 10cloud-services-team, 10DBA, 10Data-Engineering, and 2 others: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10taavi) [09:04:26] 10Data-Services, 10cloud-services-team, 10Data-Engineering, 10Data-Persistence, 10Patch-For-Review: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact - https://phabricator.wikimedia.org/T337721 (10taavi) 05Open→03Invalid Pybal is no longer used here. [09:05:38] 10Data-Services, 10cloud-services-team: maintain-views table filter not working for custom views on multiple tables - https://phabricator.wikimedia.org/T311588 (10taavi) 05Open→03Resolved a:03taavi [09:06:10] 10Data-Services: Are there indexes in Wiki replicas? - https://phabricator.wikimedia.org/T260457 (10taavi) 05Open→03Resolved That is a view as you don't have access to the underlying non-`_p` tables directly. [09:30:56] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:35:56] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:37:28] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [09:41:51] (03PS2) 10David Caro: quota_show: fix change in openstack cli return value [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992085 [09:41:53] (03PS3) 10David Caro: ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 [09:41:55] (03PS3) 10David Caro: ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 [09:41:57] (03PS4) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [09:41:59] (03PS4) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [09:42:01] (03PS4) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [09:42:13] (03CR) 10CI reject: [V: 04-1] quota_show: fix change in openstack cli return value [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992085 (owner: 10David Caro) [09:42:27] (03CR) 10CI reject: [V: 04-1] ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 (owner: 10David Caro) [09:42:29] (03CR) 10CI reject: [V: 04-1] ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 (owner: 10David Caro) [09:42:31] (03CR) 10CI reject: [V: 04-1] ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 (owner: 10David Caro) [09:42:37] (03CR) 10CI reject: [V: 04-1] ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 (owner: 10David Caro) [09:42:41] (03CR) 10CI reject: [V: 04-1] ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 (owner: 10David Caro) [09:43:04] (03PS3) 10David Caro: quota_show: fix change in openstack cli return value [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992085 [09:43:06] (03PS4) 10David Caro: ceph: use timedelta instead of integers [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990975 [09:43:08] (03PS4) 10David Caro: ceph.drain_osd_node: improve logs [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990976 [09:43:10] (03PS5) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [09:43:12] (03PS5) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [09:43:14] (03PS5) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [09:53:57] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-01-19 - https://phabricator.wikimedia.org/T355411 (10fnegri) @dcaro let me know if the runbook is not clear enough on how to find... [09:57:28] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [10:39:45] 10Striker: Phabricator workboard creation is broken - https://phabricator.wikimedia.org/T355519 (10Soda) [10:40:31] 10Striker, 10Phabricator: Phabricator workboard creation is broken - https://phabricator.wikimedia.org/T355519 (10Soda) [10:40:51] 10Striker, 10Phabricator: Phabricator workboard creation is broken - https://phabricator.wikimedia.org/T355519 (10taavi) If I had to guess this is caused by the change to require a description on all newly created workboards. [10:41:45] 10Striker, 10Phabricator: Phabricator workboard creation is broken - https://phabricator.wikimedia.org/T355519 (10taavi) [10:41:51] 10Striker, 10Patch-For-Review: Set description to tool URL when creating project tags - https://phabricator.wikimedia.org/T320916 (10taavi) [10:54:44] 10Striker, 10Phabricator: Phabricator workboard creation is broken - https://phabricator.wikimedia.org/T355519 (10Peachey88) I was able to do it on #Tool-link-dispenser, I don't have anything other than #trusted-contributors similar to @Soda. @Soda what project did you attempt this one? [11:01:24] 10Striker, 10Phabricator: Phabricator workboard creation is broken - https://phabricator.wikimedia.org/T355519 (10Peachey88) Wait, just confirming if you were having trouble creating a Phabricator Project or the workboard on the project? [11:04:35] 10Striker: Striker dev env fails to start with `manage.py runserver: error: unrecognized arguments: --nostatic` - https://phabricator.wikimedia.org/T355522 (10taavi) [11:15:52] 10Striker: Striker dev env fails to link sulwiki account to phabricator - https://phabricator.wikimedia.org/T355523 (10taavi) [11:25:00] 10Striker: Striker dev env gitlab root credentials do not work - https://phabricator.wikimedia.org/T355525 (10taavi) [11:26:34] 10Striker: Striker dev env gitlab root credentials do not work - https://phabricator.wikimedia.org/T355525 (10taavi) Trying to follow https://docs.gitlab.com/ee/security/reset_user_password.html#reset-your-root-password... `lang=shell-session root@83a010dc3a39:/# gitlab-rake "gitlab:password:reset" Enter usernam... [11:30:55] 10Striker: Striker dev env gitlab root credentials do not work - https://phabricator.wikimedia.org/T355525 (10taavi) Workaround for now: ` root@83a010dc3a39:/# gitlab-rails console irb(main):001:0> user = User.find_by_username("strikerbot") => # irb(main):002:0> user.admin = true => true i... [12:00:41] 10Striker, 10cloud-services-team, 10Phabricator: Striker can't create Phabricator projects - https://phabricator.wikimedia.org/T355519 (10taavi) [12:00:45] 10Striker, 10cloud-services-team, 10Phabricator: Striker can't create Phabricator projects - https://phabricator.wikimedia.org/T355519 (10taavi) a:03taavi [12:02:32] (03PS1) 10Majavah: phabricator: Require a project description [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) [12:05:32] (03CR) 10CI reject: [V: 04-1] phabricator: Require a project description [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) (owner: 10Majavah) [12:09:59] (03PS2) 10Majavah: phabricator: Require a project description [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) [12:14:43] 10Striker, 10cloud-services-team, 10Phabricator, 10Patch-For-Review: Striker can't create Phabricator projects - https://phabricator.wikimedia.org/T355519 (10Soda) >>! In T355519#9476166, @Peachey88 wrote: > I was able to do it on #Tool-link-dispenser, I don't have anything other than #trusted-contributors... [12:28:41] (03PS1) 10Majavah: phabricator: Autofill project description from toolinfo if available [labs/striker] - 10https://gerrit.wikimedia.org/r/992145 [12:28:43] (03PS1) 10Majavah: phabricator: Offer to set issue tracker URL in toolinfo [labs/striker] - 10https://gerrit.wikimedia.org/r/992146 [12:29:08] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/992147 (owner: 10L10n-bot) [12:32:03] (03CR) 10CI reject: [V: 04-1] phabricator: Offer to set issue tracker URL in toolinfo [labs/striker] - 10https://gerrit.wikimedia.org/r/992146 (owner: 10Majavah) [12:33:55] (03PS2) 10Majavah: phabricator: Offer to set issue tracker URL in toolinfo [labs/striker] - 10https://gerrit.wikimedia.org/r/992146 [12:39:53] (03PS1) 10Majavah: contrib: Improve setup docs a bit [labs/striker] - 10https://gerrit.wikimedia.org/r/992151 [12:48:01] (03PS1) 10Majavah: repo: Offer to set SCM URL in toolinfo [labs/striker] - 10https://gerrit.wikimedia.org/r/992155 [12:56:31] 10VPS-project-Extdist: extdist should use object storage - https://phabricator.wikimedia.org/T355315 (10Bugreporter) In long term I suppose that Extdist and ExtensionDistributor should be replaced with GitLab releases. [12:57:28] (03PS1) 10Majavah: docker(phabricator): add repository url custom field [labs/striker] - 10https://gerrit.wikimedia.org/r/992156 [12:57:30] (03PS1) 10Majavah: phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 [13:00:58] (03CR) 10CI reject: [V: 04-1] phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 (owner: 10Majavah) [13:02:43] (03PS2) 10Majavah: phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 [13:04:20] (03CR) 10CI reject: [V: 04-1] phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 (owner: 10Majavah) [13:05:55] (03PS3) 10Majavah: phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 [13:08:48] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) [13:31:49] 10Data-Services, 10Wikidata: wb_terms still listed in wikireplicas maintain-views - https://phabricator.wikimedia.org/T265137 (10taavi) 05Open→03Resolved a:03taavi [13:41:29] 10VPS-project-Codesearch, 10Data-Platform-SRE ( 2023/24 Q3 Milestone 2), 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Gehel) [14:22:17] 10PAWS: move paws-dev to pawsdev - https://phabricator.wikimedia.org/T355543 (10rook) [14:27:44] 10VPS-project-Codesearch, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Add all Data Engineering gitlab repositories to codesearch - https://phabricator.wikimedia.org/T355069 (10Gehel) p:05Triage→03Low [14:44:42] (03CR) 10David Caro: "LGTM, but I don't have a setup to try it out (yet), so untested" [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) (owner: 10Majavah) [14:45:41] 10cloud-services-team: PuppetFailure Puppet failure on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T355458 (10Andrew) 05Open→03Resolved a:03Andrew Whatever this was is now resolved. [14:46:15] 10cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T355460 (10Andrew) 05Open→03Resolved a:03Andrew [14:57:39] (03CR) 10Andrew Bogott: [C: 03+1] phabricator: Require a project description [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) (owner: 10Majavah) [14:59:33] (03CR) 10Andrew Bogott: phabricator: Autofill project description from toolinfo if available (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992145 (owner: 10Majavah) [15:03:40] (03CR) 10Majavah: [C: 03+2] phabricator: Require a project description [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) (owner: 10Majavah) [15:05:12] (03Merged) 10jenkins-bot: phabricator: Require a project description [labs/striker] - 10https://gerrit.wikimedia.org/r/992131 (https://phabricator.wikimedia.org/T355519) (owner: 10Majavah) [15:05:37] (03CR) 10Majavah: phabricator: Autofill project description from toolinfo if available (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992145 (owner: 10Majavah) [15:08:52] (03PS1) 10Gmodena: Update Hiera for deployment-prep deployment-eventstreams-2.deployment-prep.eqiad1.wikimedia.cloud [cloud/instance-puppet] - 10https://gerrit.wikimedia.org/r/992169 [15:10:10] (GaleraClusterSizeMismatch) firing: Galera in has 1 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:10:22] (HAProxyBackendUnavailable) firing: (2) HAProxy service mysql backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:10:53] 10Toolforge Build Service: Support monorepos with the Multi Procfile buildpack - https://phabricator.wikimedia.org/T355329 (10Count_Count) Back then it wasn't possible due to the peculiarities of the Rust buildpack, but now that I have restructured it to be a Rust "workspace" I will give it another shot. [15:11:22] (HAProxyServiceUnavailable) firing: (2) HAProxy service mysql has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:11:29] 10cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T352544 (10phaultfinder) [15:11:37] (03CR) 10Majavah: [C: 04-2] "This repository must be changed via Horizon or the ENC API, manual edits will just get everything out of sync." [cloud/instance-puppet] - 10https://gerrit.wikimedia.org/r/992169 (owner: 10Gmodena) [15:15:10] (GaleraClusterSizeMismatch) resolved: Galera in has 1 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:15:22] (HAProxyBackendUnavailable) firing: (3) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:15:24] PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-wmcs-dnsleaks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:56] (SystemdUnitDown) firing: The service unit mariadb.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:16:58] PROBLEM - Check systemd state on cloudservices1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-wmcs-dnsleaks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:07] 10Grid-Engine-to-K8s-Migration: Migrate panoviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319953 (10dschwen) Tim, I just added you as a maintainer! [15:20:22] (HAProxyBackendUnavailable) firing: (3) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:21:22] (HAProxyServiceUnavailable) resolved: (2) HAProxy service mysql has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:21:56] (SystemdUnitDown) firing: (5) The service unit mariadb.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:24:21] (NeutronAgentDown) firing: (13) Neutron neutron-linuxbridge-agent on cloudvirt-wdqs1001 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:25:22] (HAProxyBackendUnavailable) firing: (3) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:26:57] (SystemdUnitDown) firing: (5) The service unit mariadb.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:27:53] (HAProxyServiceUnavailable) firing: (2) HAProxy service mysql has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:29:12] (03Abandoned) 10Gmodena: Update Hiera for deployment-prep deployment-eventstreams-2.deployment-prep.eqiad1.wikimedia.cloud [cloud/instance-puppet] - 10https://gerrit.wikimedia.org/r/992169 (owner: 10Gmodena) [15:30:22] (HAProxyBackendUnavailable) firing: (3) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:31:34] 10Striker, 10cloud-services-team, 10Phabricator: Striker can't create Phabricator projects - https://phabricator.wikimedia.org/T355519 (10taavi) 05Open→03Resolved [15:31:56] (SystemdUnitDown) firing: (4) The service unit mariadb.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:32:00] 10Striker, 10Patch-For-Review: Set description to tool URL when creating project tags - https://phabricator.wikimedia.org/T320916 (10taavi) [15:32:12] 10Striker, 10cloud-services-team, 10Phabricator: Striker can't create Phabricator projects - https://phabricator.wikimedia.org/T355519 (10taavi) [15:32:18] 10Striker, 10Patch-For-Review: Set description to tool URL when creating project tags - https://phabricator.wikimedia.org/T320916 (10taavi) [15:34:21] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:36:56] (SystemdUnitDown) firing: (9) The service unit mariadb.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:39:21] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:41:43] 10Tool-ducttape, 10Abstract Wikipedia team: Provide mechanism for getting test artefacts out of pipeline - https://phabricator.wikimedia.org/T334228 (10Jdforrester-WMF) a:05SDunlap→03None [15:41:56] (SystemdUnitDown) firing: (10) The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:44:10] (GaleraClusterSizeMismatch) firing: Galera in has 1 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:44:21] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:46:57] (SystemdUnitDown) firing: (9) The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:49:21] (NeutronAgentDown) resolved: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:50:22] (HAProxyBackendUnavailable) firing: (3) HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:54:10] (GaleraClusterSizeMismatch) firing: (3) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:54:10] (GaleraNotEnabled) firing: Galera not enabled on cloudcontrol1007:9104 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraNotEnabled - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraNotEnabled [15:54:21] (NeutronAgentDown) firing: (50) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:55:10] (GaleraNodeOutOfSync) firing: Galera node cloudcontrol1007:9104 is out of sync - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraNodeOutOfSync - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraNodeOutOfSync [15:56:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [15:56:56] (SystemdUnitDown) firing: (9) The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:57:52] (HAProxyServiceUnavailable) resolved: (2) HAProxy service mysql has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:59:10] (GaleraClusterSizeMismatch) resolved: (3) Galera in has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [15:59:10] (GaleraNotEnabled) resolved: Galera not enabled on cloudcontrol1007:9104 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraNotEnabled - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraNotEnabled [16:00:10] (GaleraNodeOutOfSync) resolved: Galera node cloudcontrol1007:9104 is out of sync - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraNodeOutOfSync - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraNodeOutOfSync [16:00:22] (HAProxyBackendUnavailable) resolved: (2) HAProxy service mysql backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:01:22] (HAProxyBackendUnavailable) firing: (6) HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:01:56] (SystemdUnitDown) firing: (9) The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:01:57] (03CR) 10Andrew Bogott: [C: 03+1] phabricator: Autofill project description from toolinfo if available (031 comment) [labs/striker] - 10https://gerrit.wikimedia.org/r/992145 (owner: 10Majavah) [16:03:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [16:05:37] (HAProxyBackendUnavailable) resolved: (8) HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:09:21] (NeutronAgentDown) resolved: (6) Neutron neutron-linuxbridge-agent on cloudvirt1031 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [16:10:16] 10cloud-services-team, 10Infrastructure-Foundations, 10Security: Disable insecure rsa-ssh public key signature algorithm - https://phabricator.wikimedia.org/T318345 (10joanna_borun) p:05Triage→03Medium [16:12:28] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: Move Galera clustering to cloud-private - https://phabricator.wikimedia.org/T355418 (10Andrew) 05Open→03Resolved a:03Andrew [16:12:33] 10Cloud-VPS, 10cloud-services-team: Move Cloud VPS internal flows from cloud-hosts to cloud-private - https://phabricator.wikimedia.org/T355416 (10Andrew) [16:12:43] (03CR) 10Majavah: [C: 03+2] phabricator: Autofill project description from toolinfo if available [labs/striker] - 10https://gerrit.wikimedia.org/r/992145 (owner: 10Majavah) [16:13:09] 10cloud-services-team: PuppetDisabled Puppet disabled on cloudservices2005-dev:9100 - https://phabricator.wikimedia.org/T355276 (10Andrew) 05Open→03Resolved a:03Andrew [16:14:16] (03Merged) 10jenkins-bot: phabricator: Autofill project description from toolinfo if available [labs/striker] - 10https://gerrit.wikimedia.org/r/992145 (owner: 10Majavah) [16:14:46] (03CR) 10Majavah: "Proposed alternative: https://gerrit.wikimedia.org/r/c/labs/striker/+/992157" [labs/striker] - 10https://gerrit.wikimedia.org/r/971912 (https://phabricator.wikimedia.org/T320915) (owner: 10Aklapper) [16:16:12] 10cloud-services-team, 10Infrastructure-Foundations, 10Security, 10User-MoritzMuehlenhoff: Disable insecure rsa-ssh public key signature algorithm - https://phabricator.wikimedia.org/T318345 (10MoritzMuehlenhoff) [16:17:57] 10Cloud-VPS (Quota-requests): Quota increase for reading-web-staging - https://phabricator.wikimedia.org/T355453 (10Jdlrobson) Thank you ! [16:19:45] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10netbox: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10joanna_borun) a:03cmooney [16:30:54] RECOVERY - Check systemd state on cloudservices1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:10] RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:49] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10bking) @taavi a few questions to clarify scope and amount of work required, since we've already been asked to [[ https://phabricator.wikimedia.org/T351354#9475546 | mov... [17:11:19] (03PS1) 10David Caro: toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 [17:12:28] (03CR) 10David Caro: "Sample run:" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [17:16:02] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-01-19 - https://phabricator.wikimedia.org/T355411 (10dcaro) Got a cookbook running for it :) the output: ` dcaro@urcuchillay$ wm... [17:16:40] (03CR) 10CI reject: [V: 04-1] toolsdb: add cookbook to retrieve stuck table+query [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/992215 (owner: 10David Caro) [17:19:48] (03PS2) 10Majavah: dev(docker): Add repository url custom field [labs/striker] - 10https://gerrit.wikimedia.org/r/992156 [17:19:50] (03PS4) 10Majavah: phabricator: Allow setting source repository project field [labs/striker] - 10https://gerrit.wikimedia.org/r/992157 [17:23:19] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Migrate largest ToolsDB users to Trove - https://phabricator.wikimedia.org/T291782 (10fnegri) [17:26:08] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820 (10fnegri) [17:30:08] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Create a community offering of OpenStack Magnum - https://phabricator.wikimedia.org/T328712 (10fnegri) [17:30:57] (SystemdUnitDown) firing: The systemd unit wmf_auto_restart_prometheus-mysqld-exporter.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:31:01] 10cloud-services-team: SystemdUnitDown Unit wmf_auto_restart_prometheus-mysqld-exporter.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T355572 (10phaultfinder) [17:35:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 (10fnegri) [17:47:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Better support for Postgres on Trove - https://phabricator.wikimedia.org/T337396 (10fnegri) [18:10:44] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: Toolforge: Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664 (10fnegri) [18:13:41] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: Toolforge: Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664 (10taavi) [18:14:12] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: remove webservicemonitor (down due to DNS errors) - https://phabricator.wikimedia.org/T329467 (10taavi) 05Open→03Declined Let's just let this die when the grid dies. [18:14:48] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: remove webservicemonitor (down due to DNS errors) - https://phabricator.wikimedia.org/T329467 (10taavi) [18:14:50] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: Toolforge: Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664 (10taavi) [18:23:29] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-01-19 - https://phabricator.wikimedia.org/T355411 (10fnegri) Awesome! 🎉 [18:33:10] 10Cloud-VPS, 10Toolforge (Toolforge iteration 03), 10cloud-services-team: Ensure Toolforge and Cloud VPS comply with Google's new email sender guidelines - https://phabricator.wikimedia.org/T354112 (10taavi) [18:33:15] 10Toolforge: Require mail sent via the Toolforge mail servers uses a Toolforge domain - https://phabricator.wikimedia.org/T341004 (10taavi) 05Open→03Resolved a:03taavi [18:36:06] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: Maintain-dbusers troubleshooting for tools.sbot - https://phabricator.wikimedia.org/T355356 (10taavi) 05Open→03Resolved [18:42:33] 10Toolforge Build Service: Toolforge refused to install build-essential - https://phabricator.wikimedia.org/T355575 (10Soda) [18:43:42] 10Toolforge Build Service: Toolforge refuses to install build-essential - https://phabricator.wikimedia.org/T355575 (10Soda) [18:45:49] 10Toolforge Build Service: Toolforge refuses to install build-essential - https://phabricator.wikimedia.org/T355575 (10LucasWerkmeister) CCing @dcaro who probably understands the Apt build pack best at the moment. Also, useful context from Telegram: > libc-dev is a virtual package, it makes sense that the packa... [19:31:01] (SystemdUnitDown) resolved: The systemd unit wmf_auto_restart_prometheus-mysqld-exporter.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:31:56] (SystemdUnitDown) resolved: The service unit wmf_auto_restart_prometheus-mysqld-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:43:49] 10Cloud-VPS, 10cloud-services-team: cloud-vps: monitor dns record leaks - https://phabricator.wikimedia.org/T354365 (10Andrew) 05Open→03Resolved a:03Andrew [19:44:38] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T354491 (10Andrew) [19:44:40] 10cloud-services-team: NodeDownForLong Node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T354496 (10Andrew) [19:44:42] 10cloud-services-team: NeutronAgentDownForLong A Neutron agent has been down for more than 2h, VMs will have connectivity issues - https://phabricator.wikimedia.org/T354497 (10Andrew) [19:45:37] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) [19:45:40] 10cloud-services-team: SystemdUnitDown Unit purge_vm_rbd_images.service on node cloudcontrol1005 has been down for long. - https://phabricator.wikimedia.org/T354924 (10Andrew) 05Open→03Resolved a:03Andrew [20:04:44] 10Data-Services, 10cloud-services-team, 10Abstract Wikipedia team, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for Wikifunctions.org (new public content wiki) - https://phabricator.wikimedia.org/T289316 (10Jdforrester-WMF) 05Resolved→03In progress [20:05:54] 10Data-Services, 10cloud-services-team, 10Abstract Wikipedia team, 10DBA, 10Data-Platform-SRE: Prepare and check storage layer for Wikifunctions.org (new public content wiki) - https://phabricator.wikimedia.org/T289316 (10Jdforrester-WMF) 05In progress→03Resolved [20:47:18] 10Grid-Engine-to-K8s-Migration: Migrate croptool from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319653 (10Jmabel) As of yesterday, this tool is down. I'm guessing GridEngine has been turned off, and the issue with CropTool using it has still not been addressed. [21:03:26] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10taavi) These hosts are still in Netbox and are marked as occupying switch ports etc - can those be cleaned up? [21:03:28] 10Grid-Engine-to-K8s-Migration: Migrate request from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320003 (10FNDE) Migration complete. Thanks all. [21:04:06] 10Grid-Engine-to-K8s-Migration: Migrate request from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320003 (10FNDE) 05Open→03Resolved [21:09:30] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) Physically moved the server to F4, U18. Port 4 CableID 2M-20220019