[09:37:08] <dcaro>	 !log admin Taking one osd daemon down ot codfw cluster (T288203)
[09:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[09:37:12] <stashbot>	 T288203: ceph: Test behavior when an osd host goes down on codfw - https://phabricator.wikimedia.org/T288203
[10:10:08] <topranks>	 dcaro:  Heads up I am about to make that change on cloudsw1-c8-eqiad, apologies was a little late starting due to circuits flapping.
[10:10:29] <dcaro>	 👍
[10:11:23] <topranks>	 Ingress done, no ping loss, doing egress now...
[10:11:53] <topranks>	 Completed.
[10:12:11] <topranks>	 No signs of problems, doing more checks...
[10:12:40] <dcaro>	 looking ok from my side too
[10:18:30] <topranks>	 Ok think everything is working without problem.  LibreNMS polled the device there again a few mins back and all health stuff looking ok there, traffic levels healthy, all my manual checks also ok.
[10:18:40] <topranks>	 I'll do cloudsw1-d5-eqiad at the top of the hour.
[10:18:43] <dcaro>	 \o/
[10:18:48] <dcaro>	 thanks again :)
[10:19:21] <topranks>	 np... be interesting to see if it helps much with discard rate when things get busy.  fingers crossed :)
[10:23:11] <dcaro>	 d5 does not have so many errors though (it's connected to row b asw through c8), though better have consistent configs yep
[11:00:19] <dcaro>	 the graphs look good so far :)
[11:01:36] <dcaro>	 though it coincides with a regularly quiet time xd
[11:02:02] <dcaro>	 at 14:00 there's usually a spike of errors, we'll see then
[11:08:17] <topranks>	 yep 14:00 is the big one.
[11:08:29] <topranks>	 Just about to make change on d5 now btw.
[11:08:48] <dcaro>	 👍
[11:09:48] <topranks>	 Ingress applied.. seems ok.
[11:10:11] <topranks>	 and complete.
[11:10:20] <topranks>	 no pings dropped
[11:10:21] <dcaro>	 everything looking ok
[11:10:46] <topranks>	 I'll do the same checks but expect we're ok, will let you know.
[11:11:00] <dcaro>	 okok
[11:22:51] <topranks>	 All looks ok, health checks, LibreNMS graphs etc. all looking ok.  I'll close off those tasks.
[11:23:00] <dcaro>	 \o/
[11:23:10] <dcaro>	 time for lunch then :), thanks a lot!!
[11:26:07] <topranks>	 np!
[13:41:18] <isaacj_>	 o/ random question but during our research team office hours this week, we got a question about whether Airflow ( T284225 ) which is being implemented on some internal machines for workflow management was under consideration for being added to community resources like Toolforge as a more robust alternative to cron etc. i assume Grid Engine + cron is sticking around but I figured I'd ask
[13:41:18] <stashbot>	 T284225: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225
[13:48:03] <majavah>	 isaacj_: we're hoping to move everything to Kubernetes long term, actually
[13:50:40] <isaacj_>	 majavah: ahh yes that makes sense. I'm not super familiar with the particulars of kubernetes - would you be using a system then like airflow or argo or is there a native workflow management piece with kubernetes?
[13:53:18] <majavah>	 isaacj_: the current plan is native kubernetes cronjobs/deployments for job scheduling with some tooling to make it easier to use (T285944) plus Tekton pipelines and buildpacks for builds/deployments
[13:53:18] <stashbot>	 T285944: Toolforge: beta phase for the new jobs framework - https://phabricator.wikimedia.org/T285944
[13:53:42] <majavah>	 I'm not familiar with airflow or the like to compare it to anything unfortunately
[13:57:18] <isaacj_>	 majavah Ahh excellent - thanks! I'll pass that information back then. Not super familiar with the different systems either other than I know airflow works nicely with Python, which makes it sound good to me as a mainly Python coder :)
[13:59:22] <majavah>	 isaacj_: sure, as long as you include the massive caveat of "things are work in progress and subject to take ages or change" :)
[13:59:32] <dcaro>	 I think that the current (and future) jobs implementation is way simpler than what airflow provides though, they are focused on running scheduled single jobs (like cronjobs), while airflow is a fully fledged workoflow service (graphs, retries, etc.)
[13:59:50] <dcaro>	 that also makes them way simpled than airflow though xd
[14:00:10] <isaacj_>	 majavah: of course :)
[14:01:12] <isaacj_>	 dcaro: yeah, that makes sense. It was asked in the context of larger scale Wikidata / Commons workflows, which is why airflow I think was of interest, but I assume you're right that most folks want simple cron-level management
[14:06:11] <dcaro>	 I think it could be discussed, we plan on using tekton pipelines internally, that is somehow a 'kubernetes native' version of pipelines, but so far it's not planned to be open to users (that would require tenancy and such). If you can build a strong case for it we can prioritize accordingly xd
[14:12:48] <isaacj_>	 :thumbs up:
[21:50:56] <bstorm>	 !log tools.iabot truncated 995G Workers/Worker2.out T288276
[21:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL
[21:50:59] <stashbot>	 T288276: 2021-08-05: Tools NFS share cleanup - https://phabricator.wikimedia.org/T288276
[21:55:57] <bstorm>	 !log tools.khanamalumat truncating 30GB khanamalumat/shaher.err file
[21:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.khanamalumat/SAL
[22:03:57] <bstorm>	 Cyberpower678: are you around?
[22:26:37] <bd808>	 @harej: are you around? bstorm is trying to find Cyberpower678. IABot is writing gigabytes of error log per minute which is in turn filling NFS for all of Tooforge.
[22:26:56] <wm-bb>	 <harej> It is past midnight where he currently is
[22:27:10] <bstorm>	 heh.
[22:27:14] <bd808>	 we need to stop the bot if nobody can fix it :/
[22:27:17] <wm-bb>	 <harej> Please do
[22:27:31] <bstorm>	 Ok. I'll chmod the file for now
[22:28:09] <bd808>	 thanks bstorm and @harej
[22:30:18] <bstorm>	 !log tools.iabot chmodded the file Worker2.out to read only to stop the bleeding T288300
[22:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL
[22:30:21] <stashbot>	 T288300: IAbot is writing loads of text to Toolforge NFS at a high rate - https://phabricator.wikimedia.org/T288300
[22:31:33] <bstorm>	 Not good enough. Gotta chown it too
[22:33:53] <bstorm>	 !log tools.iabot chowned the file Worker2.out to cyberpower678's account so take might work to revert things T288300
[22:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.iabot/SAL
[22:34:13] <bstorm>	 That seems to have slowed it down :)
[23:04:05] <bstorm>	 !log tools extended docker registry volume to 120GB T288229
[23:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[23:04:10] <stashbot>	 T288229: tools-docker-registry almost out of disk space - https://phabricator.wikimedia.org/T288229
[23:50:56] <bstorm>	 !log tools rebooting the docker registry T288229
[23:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[23:51:01] <stashbot>	 T288229: tools-docker-registry almost out of disk space - https://phabricator.wikimedia.org/T288229
[23:51:37] * bstorm apologizes to anyone running `webservice` at this very moment