[09:37:08] !log admin Taking one osd daemon down ot codfw cluster (T288203) [09:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:37:12] T288203: ceph: Test behavior when an osd host goes down on codfw - https://phabricator.wikimedia.org/T288203 [10:10:08] dcaro: Heads up I am about to make that change on cloudsw1-c8-eqiad, apologies was a little late starting due to circuits flapping. [10:10:29] 👍 [10:11:23] Ingress done, no ping loss, doing egress now... [10:11:53] Completed. [10:12:11] No signs of problems, doing more checks... [10:12:40] looking ok from my side too [10:18:30] Ok think everything is working without problem. LibreNMS polled the device there again a few mins back and all health stuff looking ok there, traffic levels healthy, all my manual checks also ok. [10:18:40] I'll do cloudsw1-d5-eqiad at the top of the hour. [10:18:43] \o/ [10:18:48] thanks again :) [10:19:21] np... be interesting to see if it helps much with discard rate when things get busy. fingers crossed :) [10:23:11] d5 does not have so many errors though (it's connected to row b asw through c8), though better have consistent configs yep [11:00:19] the graphs look good so far :) [11:01:36] though it coincides with a regularly quiet time xd [11:02:02] at 14:00 there's usually a spike of errors, we'll see then [11:08:17] yep 14:00 is the big one. [11:08:29] Just about to make change on d5 now btw. [11:08:48] 👍 [11:09:48] Ingress applied.. seems ok. [11:10:11] and complete. [11:10:20] no pings dropped [11:10:21] everything looking ok [11:10:46] I'll do the same checks but expect we're ok, will let you know. [11:11:00] okok [11:22:51] All looks ok, health checks, LibreNMS graphs etc. all looking ok. I'll close off those tasks. [11:23:00] \o/ [11:23:10] time for lunch then :), thanks a lot!! [11:26:07] np! [13:41:18] o/ random question but during our research team office hours this week, we got a question about whether Airflow ( T284225 ) which is being implemented on some internal machines for workflow management was under consideration for being added to community resources like Toolforge as a more robust alternative to cron etc. i assume Grid Engine + cron is sticking around but I figured I'd ask [13:41:18] T284225: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 [13:48:03] isaacj_: we're hoping to move everything to Kubernetes long term, actually [13:50:40] majavah: ahh yes that makes sense. I'm not super familiar with the particulars of kubernetes - would you be using a system then like airflow or argo or is there a native workflow management piece with kubernetes? [13:53:18] isaacj_: the current plan is native kubernetes cronjobs/deployments for job scheduling with some tooling to make it easier to use (T285944) plus Tekton pipelines and buildpacks for builds/deployments [13:53:18] T285944: Toolforge: beta phase for the new jobs framework - https://phabricator.wikimedia.org/T285944 [13:53:42] I'm not familiar with airflow or the like to compare it to anything unfortunately [13:57:18] majavah Ahh excellent - thanks! I'll pass that information back then. Not super familiar with the different systems either other than I know airflow works nicely with Python, which makes it sound good to me as a mainly Python coder :) [13:59:22] isaacj_: sure, as long as you include the massive caveat of "things are work in progress and subject to take ages or change" :) [13:59:32] I think that the current (and future) jobs implementation is way simpler than what airflow provides though, they are focused on running scheduled single jobs (like cronjobs), while airflow is a fully fledged workoflow service (graphs, retries, etc.) [13:59:50] that also makes them way simpled than airflow though xd [14:00:10] majavah: of course :) [14:01:12] dcaro: yeah, that makes sense. It was asked in the context of larger scale Wikidata / Commons workflows, which is why airflow I think was of interest, but I assume you're right that most folks want simple cron-level management [14:06:11] I think it could be discussed, we plan on using tekton pipelines internally, that is somehow a 'kubernetes native' version of pipelines, but so far it's not planned to be open to users (that would require tenancy and such). If you can build a strong case for it we can prioritize accordingly xd [14:12:48] :thumbs up: [21:50:56] !log tools.iabot truncated 995G Workers/Worker2.out T288276 [21:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL [21:50:59] T288276: 2021-08-05: Tools NFS share cleanup - https://phabricator.wikimedia.org/T288276 [21:55:57] !log tools.khanamalumat truncating 30GB khanamalumat/shaher.err file [21:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.khanamalumat/SAL [22:03:57] Cyberpower678: are you around? [22:26:37] @harej: are you around? bstorm is trying to find Cyberpower678. IABot is writing gigabytes of error log per minute which is in turn filling NFS for all of Tooforge. [22:26:56] It is past midnight where he currently is [22:27:10] heh. [22:27:14] we need to stop the bot if nobody can fix it :/ [22:27:17] Please do [22:27:31] Ok. I'll chmod the file for now [22:28:09] thanks bstorm and @harej [22:30:18] !log tools.iabot chmodded the file Worker2.out to read only to stop the bleeding T288300 [22:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL [22:30:21] T288300: IAbot is writing loads of text to Toolforge NFS at a high rate - https://phabricator.wikimedia.org/T288300 [22:31:33] Not good enough. Gotta chown it too [22:33:53] !log tools.iabot chowned the file Worker2.out to cyberpower678's account so take might work to revert things T288300 [22:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL [22:34:13] That seems to have slowed it down :) [23:04:05] !log tools extended docker registry volume to 120GB T288229 [23:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:04:10] T288229: tools-docker-registry almost out of disk space - https://phabricator.wikimedia.org/T288229 [23:50:56] !log tools rebooting the docker registry T288229 [23:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:51:01] T288229: tools-docker-registry almost out of disk space - https://phabricator.wikimedia.org/T288229 [23:51:37] * bstorm apologizes to anyone running `webservice` at this very moment