[00:08:22] * andrewbogott out for the day [09:32:30] hmpf... I think I broke toolsbeta images, I disabled the 'immutability' rule to be able to push a new builder, and the retention job ran, deleting a bunch of images that were meant to be immutable [09:32:40] lima-kilo envs might fail pulling while I restore them [09:39:37] oh, something is going on with ceph looking [09:44:28] * dhinus paged ToolsDBWritableState [09:45:00] ceph has caught up [09:45:21] I see the alerts in -feed, but nothing in alerts.w.o [09:45:25] toolsdb crashed, probably because ceph was slow [09:45:46] this ticket was created T370752 [09:45:47] T370752: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T370752 [09:45:53] arturo: I think it auto-resolved [09:46:23] I don't see the resolved message on the -feed channel [09:47:30] just now I see [09:47:31] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status [09:47:42] none of them is getting into alert manager [09:48:08] that was in alertmanager [09:48:09] correction: toolsdb did NOT crash, it just dumped some warnings to the log [09:48:11] I acked it there [09:48:13] (the ceph one) [09:48:41] okok, so it's not triggering anymore [09:51:01] https://usercontent.irccloud-cdn.com/file/7Y2NyUkw/image.png [09:51:31] that look ok (that's the toolsdb one) [09:52:05] it did use the `vector(-1)` when it got no data [09:54:36] in the logstash the alerts for ceph look ok too [09:54:38] https://usercontent.irccloud-cdn.com/file/8I9Rp3Bv/image.png [09:54:55] the cluster in warning has been triggering sometimes also while I'm doing the tests, though it should have been silenced automatically [10:03:24] we have a new alert about wikitech-static being out of sync [10:03:27] never seen that one before [10:04:14] me neither, the runbook could be a bit more specific too [10:04:18] I don't fully understand why that alert shows with `team: wmcs`. I guess because andrew wants to stay in the loop [10:04:37] I believe we can otherwise ignore the wikitech-static alert [10:04:56] is wikitech-static one of those things that we don't officially own but practically do? :P [10:05:05] kinda [10:05:50] it might be https://wikitech.wikimedia.org/wiki/Wikitech-static#Automatic_content_syncronization [10:06:09] summary: Project tools instance tools-db-3 is down [10:06:20] dcaro: I'll look at tools-db-3 [10:06:22] slow ops again on ceph [10:06:27] 31 slow ops, oldest one blocked for 112 sec, osd.246 has slow ops [10:06:56] tools-db-3 seems to be up and running, including replication [10:07:43] the mysql exporter failed on tools-db-3, restarting it [10:09:37] does it say anything why it failed? [10:11:23] prometheus-mysqld-exporter: error: unknown long flag '--collect.heartbeat.utc' [10:11:52] ohh, that one I found before, it requires upgrading the package [10:11:55] that's why it's failing to restart, not why it failed [10:12:17] I though I had done :/, maybe I missed db-3? [10:12:20] I know arnaud was updating the package on prod hosts, maybe we need to do the same on toolsdb [10:12:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053341 [10:13:02] ^ when I enabled the bpo repos [10:13:05] I did a manual sudo apt install prometheus-mysqld-exporter [10:13:42] that seems to fixed the issue [10:13:44] is it ignored on auto-updates? [10:13:56] (apt install upgraded from 0.12 to 0.13) [10:14:05] that should be it yes [10:14:15] are auto-updates only for security updates maybe? [10:14:57] not sure now [10:15:47] tools-db-1 is up to date [10:15:52] so we should be fine [10:16:30] we have a gap in the stats for tools-db-3 from July 10th to today [10:17:11] dcaro: where did you see "Project tools instance tools-db-3 is down"? [10:17:24] email and karma [10:17:27] (alertmanager) [10:17:33] 2 [ND ] 07/23 12:11 root@wmflabs.org [Cloud-admin-feed] [RESOLVED] InstanceDown tools (tools-db-3 node warning wmcs) [10:17:36] ^ email [10:18:54] hmm why did it only trigger today? [10:19:39] no idea, why the instance down one? [10:19:46] that's at the openstack level no? [10:19:54] (unrelated to the mariadb exporter) [10:20:38] yes I think you're right [10:20:50] a short glitch today probably caused by ceph [10:21:11] that makes me think, maybe the exporter is not working on other db replicas, we have two right? [10:22:09] nope, only one at the moment. tools-db-2 was removed [10:22:55] that should be the normal state: two replicas only temporarily when creating a new one to replace the old one [10:27:26] okok [10:27:51] then we should be ok there, I was expecting the auto-updates to upgrade the package :/ [10:30:23] not a big issue [10:30:43] now that the exporter restarted, we got an alert about replication lag [10:30:50] xd [10:30:51] usual problem with a long tx, I created https://phabricator.wikimedia.org/T370760 [10:31:35] πŸ‘ yep, big deletes have been common long transactions [10:32:14] yes, I listed a couple of things to try in the parent task, but I'll do it after the upgrade to 10.6 [10:33:53] did you find anything about the wikitech-static alert? [10:34:49] this is listed in the runbook but doesn't work for me: ssh root@wikitech-static.wikimedia.org [10:35:49] maybe I'm only missing the ssh public key for that server [10:39:32] I did not check further, let me have a look [10:40:41] ok I managed to ssh to the host [10:40:58] nice [10:43:05] is there a reason not to deploy the ingress-admission to lima-kilo? [10:44:25] there might not be images in toolsbeta (/me restoring) [10:44:30] otherwise sounds ok to me [10:44:40] (resources? though I don't think it needs a lot) [10:46:03] it's been working ok while testing on it [10:47:18] unless you are rebuilding, it should not try to re-pull anything I think [10:47:35] (if it's pulling from tools, it should not fail anyhow) [10:51:30] quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/42 [10:51:56] (merging all the api deprecation patches in about an hour) [10:57:43] I'm struggling to find where the failing check is defined in wikitech-static [10:58:02] the icinga alert is "check_wikitech_static", but I'm not finding it on the host [10:58:52] blancadesal: +1d [10:59:03] thanks! [10:59:36] dhinus: from puppet, it seems it should be under /usr/lib/nagios/plugins/check_wikitech_static [10:59:53] if you are looking for the script [11:00:02] yep, thanks [11:00:17] '/usr/lib/nagios/plugins': No such file or directory [11:00:19] :D [11:00:22] hahahaha [11:00:22] xd [11:00:40] I don't think puppet is even installed [11:01:05] are you sure you are in the right host? [11:01:57] maybe :D icinga has "wikitech-static.wikimedia.org" and that's where I sshed [11:02:27] I think that the current alert is https://gerrit.wikimedia.org/g/operations/puppet/+/f8720fa20af437e00dad12189c05d1f903beb751/modules/icinga/manifests/monitor/wikitech_static.pp#10 [11:02:40] not the icinga plugin [11:03:13] wait, the command there is check_wikitech_static (that is most probably that plugin) [11:03:39] yes the command is "check_wikitech_static" and I'm not finding it anywhere on the host :P [11:04:13] ah-ha it's in alert1001 [11:04:14] hmm, does it run on icinga itself? [11:04:17] yep [11:04:18] exactly! [11:04:18] xd [11:04:27] * dhinus hates icinga [11:06:37] ok so that check curls api.php and checks RecentChanges [11:06:53] so maybe something in the sync went wrong [11:07:05] the logs in wikitech-static show the sync script ran last night [11:07:33] with no visible errors [11:07:41] I'll try re-running it manually [11:08:26] πŸ‘ [11:13:28] hmm it's starting then it gets stuck with no errors :/ [11:14:23] I'll ask in -sre [11:16:06] ack [11:16:07] * dcaro lunch [11:32:55] dhinus: the MR https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/22/diffs fails the plan :-( [11:33:36] https://www.irccloud.com/pastebin/Qfpp7EOX/ [11:33:42] I can't tell what is happening [11:34:13] but suddenly it misses the hashicorp/openstack provider [11:34:19] hmm looking [11:34:36] does it only happen with your MR or also on main? [11:35:15] on the MR, main is clean [11:35:27] it's working locally [11:35:45] with tofu init -backend=false [11:35:55] Installed hashicorp/openstack v1.54.1. Signature validation was skipped due to the registry not containing GPG keys for this provider [11:36:22] in main we have [11:36:24] Initializing provider plugins... [11:36:24] - Reusing previous version of terraform-provider-openstack/openstack from the dependency lock file [11:36:24] - Using previously-installed terraform-provider-openstack/openstack v2.0.0 [11:36:44] so I guess the mystery is: why it is changing from terraform-provider-openstack/openstack to hashicorp/openstack [11:38:40] I emptied my local cache and it's fetching it correctly [11:38:43] not sure about the two names [11:39:03] is one of the two our custom cloud-vps provider? [11:39:28] no, it is a locally cached copy of the provider, in the git repo [11:39:39] see https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/tree/main/vendor-providers/registry.opentofu.org/terraform-provider-openstack/openstack?ref_type=heads [11:40:18] right [11:44:42] I'm confused, it's also two different versions (2.0.0 vs 1.54.1) [11:45:19] maybe that's why taavi had embeeded the provider config on each module, to override the default selection from hashicorp [11:50:19] but I can "tofu init" just fine locally [11:54:14] if I create an empty directory with a simple providers.tf, it does not download the "hashicorp" one [11:55:14] I'm reading this https://developer.hashicorp.com/terraform/cli/config/config-file#explicit-installation-method-configuration [11:56:45] <> [11:57:59] so this confirms what I suspected, by removing each local required_provider config it is reverting to the default behavior of downloading the module providers, which in turns defaults to the hashicorp ones [11:58:54] but why does it default to the hashicorp ones? [11:59:11] corporate policy I guess? :-P [11:59:58] right so "openstack" is linked by default to "hashicorp/openstack" [12:22:02] confirmed, I created a test tofu config without any provider block, and just a resource "openstack_compute_instance", and it tries to use hashicorp/openstack [12:22:24] it recognizes the prefix "hashicorp/" in the resource, and it links it by default to hashicorp/openstack [12:23:43] then I also found why on cloudcontrol it fails to download the hashicorp one: there's a /root/.tofurc telling tofu to only install providers from a local directory (/srv/tofu-infra/vendor-providers) [12:23:55] rightr [12:24:43] I'm not sure that is strictly needed, it might be fine to download from the tofu registry... maybe taavi wanted to make sure we did not use any non-free provider [12:25:13] but in any case, you need to specify a required_providers{} block in all the modules, apparently [12:25:20] I think it is a good idea to don't depend on external registries to be able to deploy our stuff, if it is just a matter of caching [12:25:50] I think it is yeah, it's just a little obscure because it's hidden in a /root/.tofurc file [12:25:55] but that's probably ok [12:31:21] my theory [12:31:30] if we include something like this iun the tofurc config file [12:31:33] https://www.irccloud.com/pastebin/dEwewc8Q/ [12:31:45] we can prevent any usage of hashicorp providers [12:32:11] hmm but I'm not 100% sure that other registries include only FOSS providers [12:32:13] but then, that tofurc file is tracked via puppet. I would rather track it in the tofu-infra repo [12:34:04] it looks like .tofurc needs to be in the home directory and cannot be in the working dir [12:34:36] it can be specified via TF_CLI_CONFIG_FILE env var [12:34:54] so we could add it to the wrapper... [12:35:01] yeah [12:35:28] my 2c is that we should just remove the config file, we can still check in the lock file that we don't introduce any non-free provider [12:35:28] quick version-bump reviews: https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/7 [12:35:39] https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/10 [12:35:57] https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/14 [12:39:21] blancadesal: +1d [12:40:58] thanks! [12:48:11] dhinus: latest version of https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/22 includes a somewhat in-the-middle solution: soft-link the providers.tf files from the child modules to the root one [12:48:32] it works just fine, let me know if you think this is acceptable [12:54:47] I think that's ok! [12:56:28] ok [12:57:30] I also noticed that 2.1.0 was released yesterday with some improvements :) [12:58:13] https://github.com/terraform-provider-openstack/terraform-provider-openstack/blob/main/CHANGELOG.md#210--22-july-2024- [13:01:44] time to update then :-) [13:03:22] * arturo food [13:06:50] dcaro: another quick one πŸ™ https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/448 [13:48:09] blancadesal: sorry meeting [13:48:11] +1d [13:49:00] thanks! [13:53:40] finally got my lima-kilo rebuilt... now I'm sure there's no missing artifacts in toolsbeta xd [13:55:14] :)) [14:26:36] the volume-admission controller is having TLS handshake errors on both toolsbeta (where I'm testing the newest version) and tool (running the previous version): [14:26:41] https://www.irccloud.com/pastebin/NTOu86Fk/ [14:28:53] similar logs on tools [14:31:55] both versions work fine on lima-kilo [14:37:58] I have seen them before. I think they are harmless, but to double check, we would need to figure out what is using that IP address 192.168.113.192 for example [14:38:06] is that the api-server? [14:40:11] arturo: it's the volume-admission [14:42:00] there are no 'normal' logs in there, it's all errors [14:43:21] no, none of the pods of the volume-admission seem to own those IP addresses [14:43:23] https://usercontent.irccloud-cdn.com/file/J3TebgpL/image.png [14:44:47] blancadesal: did you change the cert/secret names in the templates? [14:45:19] dcaro: no, should I have? [14:45:31] there are no pods with the IP address 192.168.113.192 which supports my theory: those errors are just pods not cleanly closing the HTTPS connection [14:45:33] nono, just wondering, configmap changes don't ususally reboot pods [14:46:42] for all the admission-controllers, all I've been doing is update the two k8s libs in go.mod [14:47:33] interesting [14:47:47] and I don't think arturo's patch from last week did anything either https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/414 [14:47:49] in prod, you say it's using the 'previous' version [14:47:59] and it's still failing right? [14:48:14] tools is using the version arturo deployed a week ago [14:48:23] you have not restarted it? [14:48:34] no [14:49:09] ack [14:49:13] interesting [14:49:36] are things still working? [14:50:23] nothing is down. the logs from tools go back to the last deploy on the 16th [14:51:09] whether the controller is actually validating anything, I don't know [14:51:52] the volume admission doesn't validate, it mutates things [14:52:14] as in, when a pod is created with --mount=all, you have access to the user's home [14:52:39] it is working [14:52:48] pods created 12s ago have all the mounts [14:53:10] the error is most likely harmless, HTTPS TLS clients not closing the connection cleanly when dying [14:53:17] yep, it's working [14:53:28] all the IP addresses reported in the logs are of clients that no longer exists in the cluster (pods that went away) [14:53:30] so yep, it seems it's not a new thing [14:53:59] it's weird though [14:54:01] interesting. I didn't see this when I last deployed 2 weeks ago [14:54:22] it's the same ip continuously [14:54:24] (for tools) [14:54:31] 192.168.57.64 [14:54:34] and why aren't there regular logs interspersed with the errors? [14:55:06] do the controller log anything? [14:55:11] :-) [14:55:50] it does in lima-kilo [14:55:56] it might not be set to debug [14:55:58] each time it gets a request [14:56:01] (if there's a debug) [14:56:10] ah, that might be it [14:57:25] ok, so nothing to worry about? [14:57:46] this is the kind of debugging exercises we do in SRE job interviews :-) [14:58:15] slavina joining the SRE side of the force [14:58:22] the ip belongs to the calico tunnel of the tools-k8s-control-9 node [14:58:45] i'm being pushed toward the dark side xd [14:59:01] dcaro: how did you find out? [14:59:38] the SRE side is not the dark side 😒 [15:00:03] who said that dark = bad? :p [15:00:13] dcaro@tools-bastion-13:~$ kubectl-sudo get nodes -o yaml | grep 192.168.57.64 [15:00:16] that let me know it's there [15:00:21] dcaro@tools-bastion-13:~$ kubectl-sudo get nodes -o yaml | vim - [15:00:30] that plus searching inside showed me which node [15:01:00] I think it might be just the api server calling the controller (makes sense) [15:01:30] yes, that was my initial guess as well [15:01:44] blancadesal: touchΓ© [15:02:18] xd [15:02:50] well then, I'll go ahead with deploying volume-admission in tools [15:02:51] I don't see errors on the api-server side [15:02:57] just the calls [15:03:24] https://www.irccloud.com/pastebin/O53IxUVn/ [15:03:54] dhinus: are you taking care of the power supply task? [15:04:16] andrewbogott: I was looking for a previous similar task, but I think I can just create a new one [15:04:26] ok, thanks! [15:04:56] I found this one from may on the same host T368212 [15:04:56] T368212: PowerSupplyFailure - https://phabricator.wikimedia.org/T368212 [15:04:57] maybe the volume-admission code is missing some kind of connection closure, and what we see is the system TCP timeout [15:05:09] but I don't find the related dc-ops task (if there is one) [15:05:15] dhinus: https://phabricator.wikimedia.org/T368211 [15:05:25] andrewbogott: thanks! [15:06:01] now I'm thinking there's a short in the cable [15:09:44] I added #ops-codfw to the new task and linked to the previous one [15:10:33] and the alert is magically gone :O [15:19:49] dang [16:05:47] I just got a funny spam email from microsoft [16:06:07] "Secure your nonprofit against cyber threats" <-- I guess they will recommend that I run crowdstrike [16:07:09] * arturo offline [16:10:35] a.rturo: sounds like you got put on the "Tech for Social Impact" advertising list from Microsoft too. I just unsubscribed, and I really wish I knew which conference sold them my address. [16:11:38] I don't think it's a conference, because I received it as well, and I haven't attended a con as a wikimedia-affiliated person [16:12:26] * andrewbogott got it too [16:14:15] * dhinus got it too [16:14:55] did you miss out on the blue screens this time? make sure you will get them the next one! [17:25:34] * dcaro off [17:25:36] cya tomorrow