[05:59:17] andrewbogott: I'm successful in stopping some tools(eg. calling-card). Others continue to run after successful execution of the two commands(eg. blockcalc). I can still see the web interface. qstat and webservice status also show it running. the TOOL_DISABLED file is in the directory. I will try others... [10:42:51] we have a request in the cloud@ mailing list about installing elasticsearch in cloudvps. I think they should use opensearch instead, maybe via a docker image? [10:43:06] do you know of other people who are successfully running opensearch in cloudvps? [10:53:43] dhinus: yeah, my advice for them would be to avoid puppet and go with docker or the packages directly whatever else upstream recommends [10:54:07] thanks, I will reply to the email [10:54:25] I'll add a warning to the puppet-on-cloud-vps docs to hopefully discourage non-SREs/others with a project already relying on it from using it [10:55:09] sounds good! [10:55:51] the screenshots are also a bit old, and we should update them at some point, but I don't think it's a big issue [10:55:55] yep [10:58:57] https://wikitech.wikimedia.org/w/index.php?title=Help:Puppet&diff=prev&oldid=2136160 [11:00:45] btw, turns out the metrics I was using on how close we are running out of k8s capacity were wrong, it was also counting completed pods which it should not have been doing. I updated https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1&from=now-1h&to=now&viewPanel=36 and will update the alerts too [11:03:04] oh nice find, thanks [11:03:50] re: opensearch/docker, is there an easy puppet way to install docker engine on cloud vps, or should they just "apt install docker.io"? [11:04:32] or maybe just use the dpkg as you said https://opensearch.org/docs/latest/install-and-configure/install-opensearch/debian/ [11:05:52] there is profile::docker::engine, but again I don't think recommending use of puppet is a good idea [11:06:34] thanks, I'm posting a reply to cloud@, feel free to follow up with more details [11:09:23] komla: seems like disabling web services is a bit broken but https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/7 fixes it. I'll merge and deploy it, and you can retry in a few moments [12:52:09] T353104 is a big ask, but I think we can do that, and those are tools using lots of grid engine resources. can I get a +1? [12:52:09] T353104: Request increased quota for cewbot, toc, signature-checker, mgp-cewbot Toolforge tool - https://phabricator.wikimedia.org/T353104 [13:02:51] andrewbogott: https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/8 makes it possible to re-run the disable script if something fails on the first time [13:17:44] I updated the 404 handler to show a custom page for tools affected by the grid engine shutdown, for example https://calling-card.toolforge.org/ [14:12:43] there's rabbitmq security updates, is there anything in particular to keep an eye on for when updating the cloudrabbit* servers? [14:15:12] moritzm: it's pretty fragile, best if you create a task and leave the update to me so I can rebuild the cluster if things go poorly [14:15:35] taavi: that 404 page is super fancy! [14:16:14] komla: we've made several changes and updates to the grid-stop scripts, can you try again? [14:18:32] moritzm: understood, I'll go create a task in a few [14:20:08] thanks! [14:27:51] wow, I love the 404 page :) [14:29:25] taavi: I +1d the resource request, andrewbogott do you have any concerns? T353104 [14:29:26] T353104: Request increased quota for cewbot, toc, signature-checker, mgp-cewbot Toolforge tool - https://phabricator.wikimedia.org/T353104 [14:30:57] dhinus: seems OK to me although it's a lot of ram -- we might need to add another worker node (or two?) just to support that. [14:31:14] My guess is that taavi is already frantically adding new nodes to support the grid migration though [14:31:45] I'm indeed keeping an eye on that - https://gerrit.wikimedia.org/r/c/operations/puppet/+/983692 will make it a lot easier though [14:33:15] I was checkin that patch but I'm not familiar with prometheus "recording rules"... let me have a quick look at the docs [14:34:46] you're basically exposing a "pre-computed" resource that you can then use in a dashboard? [14:36:32] yeah, in dashboards or alert rules. for more complex metrics like that one it's much cheaper than computing it on-demand [14:41:07] cheaper as in: it's now taking a long time to show values? [14:42:45] cheaper as in "quicker and less resource-intensive to calculate" [14:44:38] dhinus: ok to merge your jemalloc puppet change? it's pending on the puppetmaster [14:45:29] ah sorry got distracted [14:45:31] please go ahead [14:46:42] dhinus: It sure was nice to spend a weekend not restarting tools-db! nice work [14:47:02] thanks :) I'm still seeing a slight downward trend, I'll keep an eye on it [14:47:38] I'm curious to see what data-persistence folks think about it, and if they ever tried using jemalloc in prod [14:48:33] it's also possible that this is no longer a problem after upgrading to mariadb10.6 + bookworm [14:52:32] taavi, that 404 page looks amazing. Well done! [15:07:14] Do we want to include any messaging specific to users of the tool who might see the page? I'm again thinking of ensuring disabling tools in an attempt to reach maintainers doesn't result in an unintentional support burden [15:08:35] taavi andrewbogott: sorry, small followup patch for jemalloc https://gerrit.wikimedia.org/r/c/operations/puppet/+/983722 [15:09:58] dhinus: +1'd [15:10:00] btw, is there a way to create a relation chain in gerrit even if the other patch has already been merged? [15:10:16] taavi: thanks [15:11:45] dhinus: I don't think so? Although if you use the same topic branch (and use topic branches correctly which I mostly don't) they'll show as a group in the UI [15:12:20] yeah, in cases like this one it would be nice to have a link to my previous patch, I think the Bug: id is providing some sort of link [15:13:44] Yeah, and sometimes I add 'follow up to:' in the code comments [15:18:34] komla, how do you plan to keep track of what on the unreached tools list has been disabled? [15:27:50] balloons: I have a column on phab [15:28:23] andrewbogott: noted [15:29:31] I will rename the column on phab from 'shutdown' to 'disabled' [15:33:02] 👍 [15:37:11] I see both "shutdown" and "disabled" on https://phabricator.wikimedia.org/project/view/6135/? [16:04:39] taavi: I renamed 'shutdown' to 'disabled'. let me refresh and see [16:19:09] komla, I went and checked that http status codes of all the tools. Looks like 136 of the 255 I checked return success [16:19:21] s/tools/unreached tools [16:20:12] https://docs.google.com/spreadsheets/d/1p_pJA3Cthyr3zObesC5oh_JR85iUCdECax8R1Cco9OQ/edit is a quick google sheet of the results if it helps [16:21:10] thanks! let me take a look [16:24:36] I want to be able to filter it. can I have edit access? [16:26:23] komla, yes.. Also, https://paste.toolforge.org/view/4a07a8f6 if you want to repeat it [16:28:39] It may be worth conveying some of that information onto the wiki or in communication about the disabling [16:29:42] I will update the wiki and point to the phab column/list [16:30:50] Ok. The current tool status should help you as you start to disable things. You'll know from the list which tools have an active web presence and therefore might be more visible to users versus others [16:34:54] yes, yes [16:35:26] andrewbogott: did you do anything regarding to designate recently? I'm seeing some REFUSED queries that seem very worrying: https://phabricator.wikimedia.org/P54489 [16:35:47] nothing recent but I can investigate [16:37:42] taavi: 'tooforge'? [16:37:51] oops that might explain.. sorry [16:38:03] :D [16:38:05] yeah that works much better :D [16:38:50] balloons: for some reason your spreadsheet uses tooforge.org instead of toolforge.org for zygserv [16:41:02] whoops! Good catch. [17:11:21] I'm going to upgrade rabbitmq in eqiad1 which (based on my experience with codfw1dev) will require me to rebuild the cluster. So things will be a bit broken for a few minutes. [17:28:14] "sudo become canary" fails with 'no such tool'. the disable tool also fails with filenotfounderror. on the surface, it appears it doesn't exit but the 503 web interface error says its a tool from yuvipanda. does it mean it's deleted? [17:30:38] I didn't spend a bunch of time looking, but I was surprised to see tools like T319993 on the unreached list, as the listed maintainer at list is quite active on phabricator [17:30:38] T319993: Migrate raymond from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319993 [17:34:22] komla: judging from the name I would guess that that was some test of the proxy service that never had an associated tool. I think you should ignore it for now. [17:35:02] At least, it doesn't seem to have a home dir. Do you have reason to think it is something that uses the grid? [17:37:33] okay [17:42:20] root@tools-nfs-2:/srv/tools/archivedtools# ls -l canary.tgz [17:42:20] -rw-r--r-- 1 root root 14670909 Nov 29 14:02 canary.tgz [17:42:25] yeah, a now-deleted tool [18:11:43] noted [18:14:26] Do we always leave the ou=servicegroup entry in LDAP when we archive a tool? That one is still there at cn=tools.canary,ou=servicegroups,dc=wikimedia,dc=org which is what the 404 handler is finding. That seems like a bug. [18:15:21] indeed sounds like a bug. the user account entry is meanwhile gone [18:43:07] bd808: are you seeing that record left behind for other tools? The code /looks/ like it deletes it... [18:44:37] and in the logs I see things like "Removing ldap entry for cn=tools.media-insights-hub,ou=servicegroups,dc=wikimedia,dc=org" [18:53:02] andrewbogott: I didn't try any other spot checks. It does seem like something we would have noticed before if it was endemic. [18:54:35] * andrewbogott lifts rug, readies broom [18:55:42] I spot checked 5 archived tools. They do NOT have dangling ou=servicegroup records. [18:58:21] andrewbogott: balloons: komla: can any of you send a cloud-admin@ email with details on how to re-enable a grid tool and the policy on who can request that and how [18:58:54] I will [19:00:01] bd808: I was going to delete the entry for tools.canary and I can't find it, did you delete already? [19:00:44] I did not. It is at dn: cn=tools.canary,ou=servicegroups,dc=wikimedia,dc=org [19:03:19] * bd808 lunch [19:03:48] ...what am I missing? [19:03:50] https://www.irccloud.com/pastebin/S2h7czc8/ [19:05:52] `ldapsearch -x -b cn=tools.canary,ou=servicegroups,dc=wikimedia,dc=org` -- searching by dn is hard. you basically have to set it as the base of the search. [19:06:32] or search more loosely like `ldapsearch -x cn=tools.canary` [19:08:25] * bd808 really lunch [19:19:53] what the heck [19:23:05] taavi: here's my proposal: [19:23:07] https://www.irccloud.com/pastebin/GK6bpXWw/ [19:41:00] email sent [19:47:37] andrewbogott, thank you! [20:14:41] andrewbogott: the grid queue is still filling up quite rapidly with now-disabled tools somehow https://grafana.wmcloud.org/d/zyM2etJ4k/toolforge-grid-deprecation?orgId=1&from=now-2d&to=now&viewPanel=8 :/ [20:16:37] Happen to know if that predates your merge of those patches today? [20:17:08] I think it's some issue with how the cron files are being deleted [20:17:47] Do you have an example I can look at? [20:18:36] for example tools.phabbot, auth logs show komla running stop_grid_for_tool.py at 18:51Z but cron jobs keep running and the crontab file is magically restored at 20:05Z [20:18:51] it can't be someone manually re-creating the cron file, as that would leave a trace in the same auth.log file [20:19:12] great, I'll look at that one [20:19:58] I still have a suspicion cron might not like you manually touching the files in the /var/spool/cron/crontabs/ directory instead of using `crontab` [20:20:00] I'll hold on with the process for now. [20:21:50] taavi: I think this is another case of the disable-tool loop re-enabling things. I'll write a patch [20:30:32] untested, but https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/9 [20:31:44] cronttab [20:32:57] should be fine [20:36:10] komla: is it easy for you to re-run the script on tools-sgecron-2 for tools that you already shut down? [20:37:05] andrewbogott: yes [20:37:16] please do, and we'll see if that graph settles down [20:37:28] Ok, i'm on it [21:09:45] https://grafana.wmcloud.org/d/zyM2etJ4k/toolforge-grid-deprecation?orgId=1&from=now-2d&to=now&viewPanel=8 is looking good so far but we'll see if it finds another reason to bounce back [21:11:42] T353671 if someone has a bit of time to run a quota update playbook for the RelEng folks [21:11:53] stashbot: hello? [21:11:54] T353671: Request additional resources for devtools project - https://phabricator.wikimedia.org/T353671 [21:11:54] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [22:17:36] ^^ done [22:39:18] * andrewbogott out for now