[01:10:56] * bd808 off [08:38:02] I'm rebooting all codfw1dev cloudvirts (again) to test my new cookbook to restart them all [08:57:47] morning [08:57:52] ack [09:00:06] ok! [09:00:09] good morning [09:11:04] I'm seeing some weird errors on tools-k8s-worker-nfs-25 (unreachable through ssh, using console) [09:11:07] Feb 14 09:10:16 tools-k8s-worker-nfs-25 containerd[3178]: time="2024-02-14T09:10:16.834661515Z" level=error msg="StopPodSandbox for \"c3750d94c13e9be5a6ffab687fc0ef380261df8f97a22031360c38ce51bccf43\" failed" error="failed to destroy network for sandbox \"c3750d94c13e9be5a6ffab687fc0ef380261df8f97a22031360c38ce51bccf43\": plugin type=\"loopback\" failed (delete): failed to find plugin \"loopback\" in path [/usr/lib/cni]" [09:12:20] it's also failing puppet runs (complaining about self-signed cert) [09:12:24] odd, I'll have a look [09:12:40] (I provisioned that node this morning) [09:12:41] is it one you just created? [09:12:45] ah, okok [09:13:03] I'll let you play with it then :) [09:15:25] I ran the refresh puppet certs cookbook on it and now it's fine [09:15:27] how did you notice that? [09:24:03] there was an alert about puppet failing to run, then tried to ssh and did not work, so I used the console, and checked journalctl (and browsed a bit finding also the puppet cert error) [09:25:08] taavi: I suspect that removing toolforge workers leaks the DNS records, is that a known issue? [09:28:18] https://www.irccloud.com/pastebin/osL7dVVE/ [09:29:14] dcaro: yes. the tl;dr is that these instances are too old and so the automated cleanup doesn't work properly. so Andrew and I are cleaning things up manually [09:29:38] okok, should I leave it to you or can I just run the cleanup script [09:30:10] go ahead if you already have it handy, I don't think I'll be deleting more today [09:30:29] awesome, I'll do [09:31:05] thanks! [09:44:02] aand the cloudvirt reboot in codfw1dev finished. can I get a review on https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1003047 before I use it in eqiad1? [09:55:03] taavi: if I understand, the parallelization there is done by cumin, when using the `str(hosts)` as `D{{{fqdn}}` for the cumin query right? [09:58:44] hmm, how does it work with the drain cookbook? (I'll point it in the patch) [10:01:27] in run_batch_operation() it splits the host set to run on one host at the time [10:02:23] so the moment we enable multiple hosts per batch, that cookbook will fail no? [10:04:25] right now the implementation of SafeRebootRunner would mean that changing the batching logic in run_batch_operation() would be an issue, yes [10:06:10] hmm, maybe we can verify that there's only one at the beginning of the `run_on_hosts` implementation to make it explicit [10:07:11] sure, one moment [10:07:59] otherwise it looks ok, I still don't se easily how to parallelize the batch (if that's the goal), we could try to do so in the `run_batch_operation` on the metaclass, otherwise the cookbook will have to handle it itself (that might be the right thing though, as it know how to parallelize stuff) [10:08:29] as in it knows the details on how to parallelize the actions of the specific cookbook (ex. first drain all in parallel, then reboot all in parallel) [10:09:39] taavi: LGTM [10:11:04] thanks! I think something like what you said (drain x nodes, reboot them, undrain, then repeat for the next x nodes) is worth at least exploring [10:11:18] yep, would save a lot of time [10:50:20] anyone doing anything on cloudvirt1031? [10:51:41] I am running the 'reboot all cloudvirts' cookbook which sems to have done 1031 then chashed when draining 1032 [10:51:44] ah yes, taavi let me know if you want any help, it seems it failed to reboot or similar and the alerts (for neutron) are showing up [10:52:00] * dcaro should look first on cloud-feed [10:53:52] I think it's just that silencing things from cloudcumin is broken and an alert will fire faster than a reboot takes [10:54:09] okok, I'll keep that in mind then [10:59:58] sorry about that, we really need to get this fixed :/ [11:18:59] I rebased https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/991016 to silence the alerts in alertmanager [11:24:27] hmm, this time the drain script crashed on an instance on 1032 that's in error state [11:27:45] icinga is not the source of pages for anything anymore right? and things are moving out of it already, I think that silencing alertmanager only is better than not silencing anything, and it's even future proof xd [11:27:51] I'll comment there too [11:28:21] I don't think we have anything paging for cloudvirts in icinga anymore [11:29:16] topranks: any specific agenda for the network meeting later today? anything to prepare on our side? [11:29:18] I think so too [11:35:08] and other patches in the same stack could use some reviews too :-) [11:43:28] I just reviewed them [11:44:44] thanks [12:06:49] arturo: nothing specific I want to raise in the meeting later [12:07:21] Probably good for us to get T316544 (cloudsw upgrades) over the line - I think we said last time we can prep on it I didn't follow up myself [12:07:22] T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 [12:57:58] ack [13:36:57] I have disabled the first batch of 40 tools. 12 of them have already been archived [13:40:31] I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/998401 to make the floating IP reverse DNS updater use a RFC2317 style zone name [13:51:19] that seems to have made the designate api unwell. looking [13:56:38] seems like the script was just mass-creating records a bit too fast and designate was not keeping up. all fine now [14:13:38] According to https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Renewing_puppetmaster_CA_certificate certs should expire after five years, though the paws system is yet to be 2 years old, is expiration in this situation expected? [14:17:38] Rook: likely the puppetmaster was replaced at some point for OS upgrades but the CA certificate was simply transferred from the previous one instead of entirely replacing it [14:17:50] yep, that's the case [14:18:18] Fair enough [14:18:48] is there anything in paws that requires a local puppetmaster anymore? you can likely just get rid of that instead of renewing the cert now that the prometheus VMs are gone [14:19:06] I think the last VMs were the prometheus ones maybe? [14:19:18] The only thing that I'm aware of is the magic that manages the nfs exports file [14:19:38] oh, true, the nfs server [14:19:49] I don't think the NFS server needs a local puppetmaster? [14:20:02] probably not [14:20:47] there's a bastion also [14:20:56] (just checking the VMs in horizon) [14:22:20] it's possible that it's not needed at all, and we can use the common puppetmaster yes, that'd be nice [14:23:01] The bastion doesn't use puppet in any paws specific way [14:23:21] So the nfs exports magic still happens if I drop the local puppet master? [14:23:59] you'll have to move the nfs VM to use the central puppetmaster, but I think it should [14:24:22] do you have any secrets on the private puppetmaster repo? [14:24:31] None that I know of [14:24:31] (I'm guessing not) [14:24:58] then it should be possible :) [14:30:06] That should only be taking `puppetmaster: paws-puppetmaster-2.paws.eqiad1.wikimedia.cloud` out of the project puppet, yes? [14:30:48] not really, you have to rebuild the host certificates and such [14:31:13] do that and run the refresh certs cookbook I think? [14:31:21] we have a cookbook to move ... yep [14:31:48] How is that run? [14:32:10] not sure if the cookbook has any expectation of what the puppetmaster should be though (as in, I have only used it to move hosts to a project puppetmaster from the common one, not the other way around) [14:32:27] Also any ideas why paws-nfs-1.paws.eqiad1.wikimedia.cloud is not letting me in as me? [14:32:56] quick look looks ok [14:33:02] (the cookbook I mean) [14:33:53] I can ssh as root, but not as dcaro [14:34:40] oh, there might be project-specific secrets for the replica_cnf_api [14:37:30] yep, there's some secrets, so still needed [14:38:01] What's the replica_cnf_api? [15:00:20] sorry, got distracted, it's the API that allows us to generate database credentials for the users [15:00:39] and puts them in the nfs directories (that's why it's running on the NFS server) [15:16:34] Oh that. Very good. Well I believe the puppet master is updated l, so all can remain as is for now [15:17:27] Rook: did you follow that process? did it work ok? (I have to do it in a couple other puppetmasters) [15:18:14] * dhinus paged: cludvirt1041/ensure kvm processes are running [15:18:20] Yes, though I need to make one edit to it. [15:18:38] dhinus: me too, looking since I just rebooted it [15:18:48] awesome :), please do, I appreciate it [15:19:17] taavi: dhinus I got paged by that, but it went away right after, I though taavi was rebooting it [15:19:37] well I did, but it's not supposed to page [15:19:56] dcaro: done, it now reflects what I did [15:19:59] I thought it was just a slow reboot maybe (like the other alerts) [15:20:05] Rook: thanks a lot! [15:20:28] do we not have a canary on cloudvirt1041? [15:21:58] now we do [15:22:35] did you have to create it? [15:23:00] I ran the cookbook to create that [15:23:05] btw I removed Nicholas from VictorOps, and I'm moving the shift change from Wed to Thu [15:24:09] and I'm adding back arturo :P [15:33:44] 👍 [15:46:14] Uh oh — The VictorOps user you have linked to your external SSO ID is not part of wikimedia. Please contact your administrator. [15:46:29] ^^^ I get that when trying to log in into victorops [15:46:42] hmm I sent you a new invite, but maybe there's some trace of your old user in the system [15:46:50] did you get an email from victorops? [15:46:59] yes I did [15:47:09] ot [15:47:10] it feels like my user is disabled or something [15:47:27] *it's "pending", but I guess you're unable to link it for some reason [15:48:11] you might have to email techsupport@wikimedia.org [15:50:51] also, the invitation takes to a form to create a new account, which I think is wrong. I need to use the SSO [15:51:11] I think o11y manages victorops accounts instead of ITS? [15:52:35] all selected tools have been stopped and moved to Backlog(Disabled) column on the phab board. [15:53:01] \o/ [15:53:06] Now we just have to wait for the complaints :) [15:54:08] hm, https://grafana.wmcloud.org/d/zyM2etJ4k/toolforge-grid-deprecation?orgId=1 is showing many more tools still running than I'd expect [15:56:44] ah zooming in helps a bit. is "backlog" supposed to be the column for tools that have not been disabled? [15:57:08] komla: ^ ? [15:58:38] I stopped a little over 100 tools [15:58:46] what about the other columns like "help wanted" and "pywikibot"? are those supposed to be disabled or not? [15:59:22] they are part of tools that are still running. [15:59:49] does the grafana dashboard update instantly? [16:00:43] taavi: yes backlog has tools that have not been disabled [16:01:53] ok, I see [16:02:02] I also see some tools (like wikishizhao) running that have no task at all [16:03:45] maybe those were born after the board was compiled? [16:04:16] yeah, probably [16:04:51] let me check. there were incidents where a tool was created after the fact. let me check and then create the ticket [16:04:55] yeah [16:07:26] taavi: yeah I think you're right about victorops accounts -- arturo try pinging in #-observability maybe? [16:07:33] there's also some info here https://wikitech.wikimedia.org/wiki/Splunk_On-Call [16:07:45] that seems to suggest you have to create a non-SSO user first, then link it to SSO [16:08:02] I'll do tomorrow [16:08:05] thanks! [16:08:08] no rush :) [16:12:25] I see wikishizhao running on the grid-deprecation portal but there's no entry on toolsadmin when I search. It also has no members/maintainers [16:13:11] hmmm I wonder if that's someone running something on the grid as their user [16:15:49] himowd is a tool that is running things on the grid without a ticket [16:19:48] and gergesbot [16:21:18] okay. checking [16:22:06] it is completely possible to run workloads on grid engine as a user rather than a tool [16:22:48] I will pull from the grid-deprecation portal again and go through the list [16:23:31] the ones I listed were running something at that exact moment, https://grid-deprecation.toolforge.org/ seems to have a bunch more that only have cron jobs that were not running at that moment [16:24:50] yeah. there are currently 116 on the grid portal. many of them showing disabled [17:02:23] andrewbogott: I was thinking of rebooting cloudvirtlocals, are you around in case anything goes wrong? [17:03:24] (context is T356975) [17:10:32] dhinus: I'm here, go ahead. [17:10:54] thanks [17:12:01] rebooting cloudvirtlocal1001 [17:16:46] hmm the cookbook tried and failed to migrate instances [17:17:07] "No valid host was found. There are not enough hosts available." [17:17:22] what is the right way to do this? [17:18:15] I'm trying to create a MX record set in horizon for wikimedia.beta.wmflabs.org. to 185.15.56.115 but it refuses to create it without any errorr [17:18:21] Fehler: Unable to create the record set. [17:18:40] I can't find any existing MX record for this [17:18:49] I could make A record for the domain no problem [17:21:02] Amir1: MX records point to fqdns not IPs [17:21:23] I tried fqdns as well [17:21:29] but maybe not the correct one [17:21:34] gonna try that too [17:21:39] did you try them with a trailing dot? [17:21:58] both [17:22:03] I'm desperate [17:22:49] and you have the priority included in the record too? [17:23:38] andrewbogott: from my IRC logs, the last time we touched cloudvirtlocals we used wmcs-cold-migrate, does that sound right to you? [17:23:51] that might be why, let me take a look [17:24:13] dhinus: if you're just rebooting them then not migrating anything might be easier [17:24:51] dhinus: yes, just reboot. The cookbook probably doesn't know what to do with a local [17:24:58] and we don't want to migrate, just let the VMs reboot [17:25:56] sigh, I thought that's a separate field [17:25:58] jeez why [17:26:22] amir@amir:~$ dig wikimedia.beta.wmflabs.org mx +short [17:26:22] 10 instance-deployment-mx03.deployment-prep.wmflabs.org. [17:26:24] taavi, what am I doing wrong with 'cluster health'? [17:26:26] thanks [17:26:29] https://www.irccloud.com/pastebin/oL5ZLmjk/ [17:27:11] andrewbogott: etc will not talk to you without a client cert [17:27:15] (I can also keep googling if you don't immediate see my error) [17:27:22] ok, I'll add that [17:27:34] * andrewbogott doesn't understan why it doesn't just take the default cert from /etc/etcd [17:28:26] andrewbogott taavi are you suggesting to use sre.hosts.reboot-single instead of wmcs.openstack.cloudvirt.safe_reboot? [17:29:24] dhinus: yes [17:29:35] I have identified a couple more tools without tickets. I'm creating the tickets for them [17:29:47] andrewbogott: ok, trying that! [17:29:58] mx record didn't fix the mail reject but meh, tomorrow-me problem. [17:38:18] cloudvirtlocal1001 rebooted, and I see all VMs running [17:39:32] is there a quick command to check if etcd is happy? [17:41:51] yes, I'll do it since I only just figured it out [17:42:05] * dhinus popping out for 5 mins, brb [17:42:24] looks good [17:42:25] dhinus: [17:42:27] https://www.irccloud.com/pastebin/dH3sIeyt/ [17:53:20] cool! I will proceed with cloudvirtlocal1002 [18:08:05] cloudvirtlocal1002 rebooted and etcd is happy, proceeding with cloudvirtlocal1003 [18:16:18] hmm alertmanager is still unhappy about neutron on cloudvirtlocal1002, a few mins after the reboot [18:18:12] but I think it's just the alert is based on min_over_time over 20m [18:22:43] yep, they have a bit of delay [18:22:45] * dcaro off [18:31:19] all 3 cloudvirtlocals have been rebooted [18:31:32] I added some notes here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#cloudvirtlocal_reboot [18:38:35] * dhinus off [18:50:24] thanks dhinus [19:13:41] * bd808 lunch [19:23:13] taavi, do you still have cloudvirt reboots running in the background or should I take over for a round? [19:25:32] andrewbogott: feel free to. I'm tracking progress in https://phabricator.wikimedia.org/T356975, and command I've been using `test-cookbook -c 991016 wmcs.openstack.cloudvirt.safe_reboot --fqdn cloudvirtXXXX.eqiad.wmnet` [19:25:46] great, I'll do a few [19:26:20] turns out draining a node is a bit too unstable to just start the script in the background and come back hours later. I had like 3 failures in a row draining 1032 which made me switch to per-node cookbook executions and after that it's worked flawlessly [19:27:04] is your test-cookbook -c 991016 different from the git head in the wmcs-cookbook repo? [19:27:40] it's HEAD with https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/991016 applied [19:27:54] cool [21:33:06] hmm. seems like I screwed up when removing some toolsbeta nodes when I thought the cookbook would take care of the hiera updates when it clearly does not [21:33:51] fixed. and apologies for anyone who got paged for that, toolsbeta being able to page in the first place seems like a bug, I'll look into that tomorrow