[12:53:33] !log admin Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus [12:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:53:47] !log admin Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [12:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:01:03] !log admin Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus [13:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:01:07] !log admin Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:21:53] !log admin Rebooting the nodes cloudcephmon2002-dev,cloudcephmon2003-dev,cloudcephmon2004-dev (T281248) - cookbook ran by dcaro@vulcanus [13:21:57] !log admin Rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:24:26] !log admin Finished rebooting node cloudcephmon2002-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:24:30] !log admin Rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:27:25] !log admin Finished rebooting node cloudcephmon2003-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:27:30] !log admin Rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:30:21] !log admin Finished rebooting node cloudcephmon2004-dev.codfw.wmnet (T281248) - cookbook ran by dcaro@vulcanus [13:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:30:38] !log admin Finished rebooting the nodes ['cloudcephmon2002-dev', 'cloudcephmon2003-dev', 'cloudcephmon2004-dev'] (T281248) - cookbook ran by dcaro@vulcanus [13:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:57:17] !log tools clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940 [13:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:59:06] Whom do I need to bribe to become admin on a wiki in the beta cluster? A colleague got blocked by AbuseFilter there due to heavy testing. 😬 [14:00:08] WMDE-Fisch: ask someone who can technically grant that access. I can do that, what's your usename and which wiki? [14:00:35] we like cookies! [14:00:53] https://simple.wikipedia.beta.wmflabs.org/wiki/User:Fisch-WMDE [14:01:23] done [14:01:25] I'm already admin on the en beta cluster if you need any credibility proof. ;-D [14:01:31] thx [14:02:10] * WMDE-Fisch sends cookies [14:04:55] WMDE-Fisch: I could do with some decent beer [14:06:19] Reedy: But you didn't do anything! [14:06:35] Do I need to have done so, to want some decent beer? [14:09:02] Good point. [14:09:43] * ma hands Reedy a bottle of Franziskaner [14:20:12] WMDE-Fisch: https://simple.wikipedia.beta.wmflabs.org/wiki/Special:AbuseFilter/29 [14:20:28] Yeah just saw that [14:20:42] I'm forwarding the info to my colleague [14:20:44] thanks [14:38:27] majavah: (or other toolforge ops) can node.js be updated? seems to be running on version 8 which is long unsupported i think? [14:38:40] current LTS seems to be v14 [14:39:57] proc: could you please open a phab task? [14:40:03] sure, what project? [14:40:05] T243159 already exists [14:40:05] T243159: Request to enable node version 12.14.1 in toolforge to deploy VideoCutTool - https://phabricator.wikimedia.org/T243159 [14:41:37] proc: we generally pay attention to `cloud-services-team (kanban)` [14:43:53] the node10 webservice type is available fwiw (the plain “node” one is deprecated) [14:44:01] the grid operating system upgrade in the works will come with node v10, and kubernetes has v10 available and v12 hopefully fairly soon after Debian Bullseye is enabled [14:44:13] (sorry, the deprecated one is called “nodejs”, not “node”) [15:09:53] !log tools.precise-tools Added majavah to edit and push ACLs for https://phabricator.wikimedia.org/source/tool-precise-tools/ [15:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.precise-tools/SAL [15:13:20] majavah: ^ {{done}}. I'm not sure if following the existing pattern and making a "stretch-tools" tool for the new report is best or if it's time to make something like "grid-deprecation" instead that can be repurposed in future would be better. I did the latter for the last Cloud VPS switch () [15:14:39] I guess that depends on whether there will be a next time or not :P [15:14:58] but grid-deprecation sounds like a good idea [15:15:05] Oh, don't tell me that grid is being deprecated [15:15:17] Kubernetes is such a pain to understand for me :-( [15:15:33] ma: we're working on that [15:15:47] * ma suspira [15:16:00] you'll have a nice way to submit jobs to k8s without knowing anything about k8s [15:18:18] ma: that's the long term goal, yes, but for now we're working on an operating system upgrade there too, since we need to get off stretch at some point and k8s isn't yet ready for all the workloads [15:31:46] majavah: I guess I should subscribe to -announce then so my bot doesn't break [15:37:16] /me feels dejavu [15:38:30] @yuvipanda, if you want the full dejavu experience, we have open positions :-) [15:39:03] hahaha 😄 [15:55:53] * bd808 would write yuvipanda a glowing recommendation [16:30:24] !log tools.grid-jobs Added majavah to edit and push ACLs for https://phabricator.wikimedia.org/source/tool-grid-jobs/ [16:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.grid-jobs/SAL [16:32:45] !log tools.gridengine-status Added majavah to edit and push ACLs for https://phabricator.wikimedia.org/source/tool-gridengine-status/ [16:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.gridengine-status/SAL [16:33:08] majavah: more repos for you to play with! :) [16:35:52] majavah: also feel free to direct push rather than opening a review patch in those repos. Differential is a giant pain in the butt if you are not used to it (and I'm not anymore). [16:49:00] I've noticed that KrinkleBot crons have stopped running (getting user reports). This happens about once every other month. I haven't reported it here before, but I figured maybe it'd be good to investigate this time and see what causes it (or at least what caused it this time). [16:49:10] $ tail fileprotectionsync.err [16:49:10] [Wed Jun 9 14:50:08 2021] there is a job named 'fileprotectionsync' already active [16:49:10] [Wed Jun 9 15:00:15 2021] there is a job named 'fileprotectionsync' already active [16:49:10] [Wed Jun 9 15:10:06 2021] there is a job named 'fileprotectionsync' already active [16:49:30] This is what I've got, and that's indeed how it usually it. There is something running but it isn't a working pywiki bot process [16:50:12] job-ID prior name user state submit/start at queue slots ja-task-ID [16:50:12] 3681841 0.25002 fileprotec tools.krinkl Eqw 06/08/2021 05:20:10 1 [16:50:40] crontab -l: 0,10,20,40,50 * * * * /usr/bin/jsub -once -quiet -mem 500m -N fileprotectionsync $HOME/pywikienv/bin/python3 $HOME/src/pywiki-fileprotectionsync/fileprotectionsync.py [16:51:46] stat info: scheduling info: Job is in error state [16:52:32] I don't see anything useful in the .err file though: [16:52:42] 2021-06-08 05:11:43 Page [[Commons:Auto-protected files/misc/logos]] saved [16:52:42] [Tue Jun 8 05:40:09 2021] there is a job named 'fileprotectionsync' already active [16:52:59] it goes from last working at 05:11 and then nothing for a while and then these errors 20min later for the next 24h [16:53:04] have you considered moving the bot to a kubernetes CronJob? [16:53:19] I have not. [16:55:03] > error reason 1: can't get password entry for user "tools.krinklebot". Either user does not exist or error with NIS/LDAP etc. [16:55:20] that's the LDAP hiccup error :/ [16:55:31] ok, but why does it keep occuping the grid slot in that case? [16:55:40] skipping a run seems fine indeed [16:56:06] I can look at using k8s jobs if that's recommended these days for non-web stuff as well. [16:56:07] because the gird is ancient, mysterious, and designed for you to watch your jobs and manage error states [16:56:28] there is literally nothing user friendly about grid engine [16:56:34] ok :) [16:57:18] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Kubernetes_cronjobs is the cronjob documentation, I've been using it for all my bots for a while and haven't had an issue [16:57:40] I believe there are few misbehaving tools (T282474 etc) that are sometimes running nodes out of resourses which creates issues like that [16:57:41] T282474: tools.topicmatcher update_items_from_sparql.php frequently running Toolforge nodes out of resources - https://phabricator.wikimedia.org/T282474 [16:58:38] Krinkle: b.storm may be able to give more nuanced input, but I think that when a grid job dies because of a system error like the LDAP failure it says in the accounting in a hard error state as a weak way to keep jobs from failing in a tight loop. They also "break" the exec node that they happened on by marking that queue as being in error state. [17:02:18] !log tools.sge-status deploying https://phabricator.wikimedia.org/R1921:c256f778bbdfbf63aec831f78d9458f1c05bc6ff [17:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sge-status/SAL [17:02:44] I've previously avoided the cronjob YAML file (it's been a few years) as it seemed too heavy on boilerplate where my brain starts to think this will keep changing over time and be more logic to maintain even though it is the same for lots of people, it becomes substituted for me once I copy it which is bad for me, but also means you don't have the ability to tune stuff proactively for "simple" use cases. [17:02:54] that's just a gut feeling though [17:05:23] we're working on a set of tooling around kubernetes that makes the most common grid use cases about as simple as `cron` and `webservice` are today [17:08:26] Krinkle: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_jobs [17:08:43] in particular: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_jobs#Feature_mapping [17:11:21] arturo: oh nice. yeah, that seems like the right abstraction. cmd + container + schedule. [17:12:14] I'm working really hard to have some kind of initial/functional code working by the end of this Q [17:21:19] arturo: the wikiloveslove/cronjobs.yaml example there, would that match what this tool would implicitly create/manage, and/or is there a better/newer recommendation you have in mind? [17:21:39] I could switch to k8s cronjob.yaml in that case [17:22:01] Krinkle: yeah, is basically the same, plus some additional labels and such. But the base thing is the same: container, schedule, command [17:23:22] the code is here: https://github.com/wikimedia/cloud-toolforge-jobs-framework-api/blob/main/tjf/job.py#L112 [17:25:49] noticing a different apiVersion version there [17:27:30] good point. The kubernetes version I'm using already graduated the CronJob object from beta [17:28:32] our idea is basically the same as we have with webservices: we will provide an abstraction layer, but nothing will prevent you from using the k8s API directly (if you love it :-P) [17:29:04] there's no notable difference between v1beta1 and v1 per https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25 but if we're on k8s v1.21 we should probably update the docs [17:29:15] we are not (yet) [17:29:27] alright, then v1beta1 it is [17:29:35] * majavah points AntiComposite to T280299 [17:29:43] I wonder why `Kubernetes cronjobs` docs are so prominent in our doc page [17:33:06] !log admin removed icinga downtime for cloudmetrics1002 -- to see if hardware is healthy (T281881) [17:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:33:10] T281881: hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 [17:33:45] page started with webservices, cronjobs, and continuous jobs, then grew downward from there [17:35:02] arturo: ok if I link those wiki pages to task T283238? [17:35:03] T283238: Toolforge: develop jobs-framework-api - https://phabricator.wikimedia.org/T283238 [17:35:12] Krinkle: cool [20:22:57] bd808: did you find some time to take a look at T284144? [20:22:58] T284144: Mirroring tool-spacemedia Diffusion repository to GitHub seems to be broken - https://phabricator.wikimedia.org/T284144 [20:23:49] don-vip: not yet, no. I probably won't until my Friday [20:24:34] okay, thanks :) [20:24:39] things have been ... hectic :) [20:24:52] yes, I guess :) [20:26:49] * bd808 tries the 'simple' fix of scheduling the repo for a manual run in the update queue