[10:12:35] !log admin restart backup_vms service in cloudvirt1024 (T300956) [10:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:12:40] T300956: wmcs-backups: failure on cloudvirt1024 - https://phabricator.wikimedia.org/T300956 [16:32:01] The gb_by_central_id column in the globalblocks table is not visible in the wiki replicas. I guess the views need to be recreated? [16:33:40] https://phabricator.wikimedia.org/T299827 seems to say the data should not be made available on the replicas [16:33:58] oh wait, marostegui commented on that already [16:35:09] hm, according to https://gerrit.wikimedia.org/g/operations/puppet/+/9e84c8ac4105aad822875916b4565a6e4dd5353d/modules/profile/templates/wmcs/db/wikireplicas/maintain-views.yaml#36 globalblocks is mapped 1:1 without a special view [16:36:09] I don’t know enough about how those views work to comment further… probably worth a Phabricator comment, possibly reopening the task, if nobody else responds here :) [16:38:09] globalblocks do not really contain any private data [16:38:49] zabe: gb_by_central_id is the replacement for the plain text stewards' name right? [16:39:14] if so, user id numbers ain't really private either [16:41:08] yeah, I know that (they are not private at all). The question is more about whether I still need to create a task to reset the views or if this will eventually happen on its own? [16:41:13] and yes thats the replacement [16:42:59] Maybe maintain-views in Cloud was not yet run? I'd add a comment on the task [16:43:21] bbl [16:45:58] https://phabricator.wikimedia.org/T297026 seems to be the related task [16:49:11] zabe: I suspect you'll need to create a new task for that specific view, although I'm not quite sure how to tag it (cc razzi andrewbogott who probably do) [16:50:18] Yeah, usually there's a new task made for every view change [16:50:30] ok, I can create one [16:52:53] https://phabricator.wikimedia.org/T300988 [18:53:02] !log toolsbeta upgrading to kubernetes 1.21 T282942 [18:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [18:53:05] T282942: Upgrade Toolforge Kubernetes to latest 1.21 - https://phabricator.wikimedia.org/T282942 [19:03:30] Android app [19:04:18] I don't know English very much everything to do [20:28:10] !log tools.bash Updated to php7.4 runtime [20:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bash/SAL [21:22:21] Trying to start lighttpd for 'ru_monuments' tool: [21:22:21] webservice --backend=gridengine start [21:22:22] No 'error.log' file in home directory is created. [21:22:24] In 'service.log' I see 3 times of 'No running webservice job found, attempting to start it'. Then 'Throttled for 3 restarts in last 3600 seconds'. [21:22:25] Need help. [21:33:16] @avsolov: `qstat` shows a running webservice job too but the front proxy doesn't know about it. Sometimes a `webservice stop` and then `webservice --backend=gridengine start` fixes this. I don't have a really good explanation of why things get into the state where that is needed. [21:34:17] I did several times. Unfortunately it didn't help. [21:34:28] bd808: I don't have time to look much further, but today's SGE root@ spam seems to all mention this one tool [21:34:53] @avsolov: do you mind if I try? [21:35:03] Do it, please [21:35:25] taavi: ack. I mostly ignore those, but I will look them up if this doesn't magically work :) [21:36:38] !log tools clear error state from some webgrid nodes [21:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:38:30] I see 'error.log' now, but it refers to kubernetes, why is it? [21:39:35] @avsolov: I typed `webservice status` and that was the result because I forgot to set --backend=gridengine and this tool seems to have broken Kubernetes credentials. [21:39:53] Still trying to figure out what is going wrong with the job. [21:40:45] The job is stuck in "qw" mode which means that the grid scheduler can't find a place to run it yet. [21:43:47] any advice? [21:45:04] @avsolov: still investigating. at this point I think this is more a general grid problem than an error in your tool's code or config. [21:46:17] taavi: it looks to me like the scheduler thinks all of the tools-sgewebgrid-lighttpd-* instances are unavailable. Any idea what might be causing that? [21:46:40] * bd808 tries to remember how to grid engine [21:50:48] 02/04/2022 21:39:03|worker|tools-sgegrid-master|E|queue webgrid-lighttpd marked QERROR as result of job 9099456's failure at host tools-sgewebgrid-lighttpd-0915.tools.eqiad.wmflabs [21:51:19] (the same for the remaining lighttpd nodes, job id is same too for almost all the nodes) [21:52:26] hrmmmm... that's the job from tools.ru_monuments too... [21:53:04] the detail on that job is full of "exit_status of epilog = 1" errors [21:54:25] also confused about why this tool has bad k8s credentials, but not sure how that would be related [21:55:05] hold on... why does that tool have an underscore in its name? [21:55:20] that's surely an illegal character in a dns name and so in kubernetes namespaces [21:56:08] oh, yup. one of the fun legacy tools that are not namespaced for k8s [21:57:07] and starting a grid webservice has a dependency on kubernetes these days, due to my (and to a lesser degree b.storm's) plan to get rid of the toolforge front proxy by routing sge webservices via the k8s ingress mechanism [21:57:16] ... oh [21:57:21] well that would do it [21:58:42] taavi: and that landed relatively recently? Like this could reasonably be the first time that @avsolov or another maintainer tried a restart since k8s was somehow involved on the gridengine backend? [21:59:29] last restart for tool webservice was 2021-03-25 [21:59:43] that would make the "exit_status of epilog = 1" error make some sense if the epilog now tries to hit k8s and gets the connection refused error from k8s-master [21:59:44] it landed in webservice 0.78, released last October [21:59:46] I mean, "previous" [22:01:54] @avsolov: *nod* I think this is starting to make sense then. Not sure about a fix yet... [22:04:26] taavi: that `register_kubernetes` stuff in the toolsws/proxy.py isn't really used yet correct? Meaning that that is not how a grid webservice actually gets traffic. [22:05:57] bd808: currently it's registering sge webservices to both dynamicproxy and kubernetes, but since dynamicproxy is being evaluated first that's what actually affecting things [22:07:15] @avsolov: this may not have been clear yet, but it looks like the problem is that the name of this tool includes "_" and this is not compatible with our Kubernetes system. Until last October that would have only caused you problems when trying to directly use kubernetes, but now it also causes problems on the grid engine backend. [22:07:41] if you get rid of (parts of) it, the main thing to be aware of is that you don't want to leave stale entries registered or otherwise you'll get very confusing problems at some point [22:08:28] taavi: *nod* I'm wondering if I can add code to catch and log the k8s failure so that tools like ru_monuments will work again. There not numerically many tools with this problem but there are a few. [22:09:52] that indeed sounds like a working short-term solution [22:10:35] @avsolov: a different fix here would be for you to register a new tool with a name that works with kubernetes (like "ru-monuments"), move things to that tool, and then ask for help getting a redirect setup to send traffic for the ru_monuments legacy tool to the new name. [22:12:02] T176027 is the old bug about the naming problem. [22:12:03] T176027: Tools with "_" in their name or names longer than 63 characters do not get Kubernetes namespaces created - https://phabricator.wikimedia.org/T176027 [22:12:40] I see. But migrating to new name will take some time. [22:12:40] Can I expect any short-term solution soon? [22:13:41] @avsolv: I will keep working on one, but I can't guarantee when I will have something [22:23:45] I opened T301015 to track this [22:23:46] T301015: Tools with invalid/missing Kubernetes credentials cannot start gridengine webservices - https://phabricator.wikimedia.org/T301015 [22:23:48] bd808: maybe https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/759826/? [22:27:20] taavi: +1 given. it looks like a good way to keep it from blowing up [22:28:22] taavi: oh.. that won't actually stop this case though... [22:28:44] $HOME/.kube/config exists, it jsut has bad creds [22:28:51] oh... [22:29:10] and if we remove it a new one with bad creds will be created [22:29:38] did you test that? [22:30:01] if I'm reading maintain-kubeusers correctly it should just completely ignore tools with invalid names [22:30:17] maybe play with file permissions to prevent creating this file? [22:31:18] taavi: I did not test that, no. The creds file there was created 2020-05-05 which made me assume it would be recreated, but maybe more protections were added later... [22:33:04] !log tools `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015 [22:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:33:09] T301015: Tools with invalid/missing Kubernetes credentials cannot start gridengine webservices - https://phabricator.wikimedia.org/T301015 [22:36:07] a full maintain-kubeusers run just finished, and it did not recreate that [22:36:18] excellent [22:36:29] want me to +2 your patch then? [22:37:03] heh. too late :) [22:37:04] just did by myself (assumed it was a "deploy when you want" style +1) [22:37:14] yeah, it totally was [22:37:35] so I'll tag (https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/759829/) and build 0.80 [22:37:48] thank you taavi [22:50:53] !log tools.ru_monuments Cleaned up old log files [22:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ru_monuments/SAL [22:53:29] update rolled out.. I think your web service should now work again if you try to start it [22:53:40] sorry about that [22:54:16] may I try to start it? [22:54:32] sure [22:57:18] \o/ https://ru_monuments.toolforge.org/ [22:57:29] thanks for the work on that taavi [22:57:38] Thanks a lot! I works now! [22:57:57] i thought underscores weren't supposed to be allowed in (sub)domain names [22:58:22] @jhobsy: there is a large mismatch between the DNS RFCs and actual DNS :) [22:59:02] another little question: how can I reset my developer password? [22:59:31] @avsolov: https://wikitech.wikimedia.org/wiki/Special:PasswordReset [23:00:33] And I see that https://wikitech.wikimedia.org/wiki/User:Avsolov is "unregistered" so first I need to attach your account there... hang on a minute [23:02:16] @avsolov: Ok, now https://wikitech.wikimedia.org/wiki/Special:PasswordReset should work for you. [23:02:29] thank you! [23:03:32] someday™ I should make another fix for T174469 than the manual attach solution [23:03:33] T174469: LDAP account that is not attached on wikitech has no means for password reset - https://phabricator.wikimedia.org/T174469 [23:04:32] if only that would be T179463 :-) [23:04:32] T179463: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 [23:04:48] that would be ideal :) [23:16:51] I just checked and found that there were 4 tools in total with underscores in the tool name. I have queued 2 for deletion as they are empty stubs that are many years old. The other 2 are ru_monuments and wdq_checker, which both now have had their $HOME/.kube/config files removed. [23:18:35] @avsolov: migrating ru_monuments to another name would probably still be a good idea in the longer term. Grid engine is not gone yet, but folks would really like it to be gone so someday it will be. [23:20:29] thank you for advice. I started such discussion in our community. We will try to migrate as soon as possible. [23:59:11] !log cvn accidentally restarted all VMs due to misreading the project purge page. sorry! [23:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cvn/SAL