[04:49:16] <legoktm>	 I'm ocassionally seeing what seems like DNS failures
[04:49:39] <legoktm>	 $ curl 'https://release-monitoring.org/api/v2/versions/?project_id=7635'
[04:49:40] <legoktm>	 curl: (6) Could not resolve host: release-monitoring.org
[04:50:11] <legoktm>	 I ran the same request 10 times, failed twice
[06:34:21] <legoktm>	 in other news, `webservice restart` doesn't seem to be working, I filed https://phabricator.wikimedia.org/T294888
[06:51:43] <legoktm>	 !log tools.newusers Rebuilt with Rust 1.56.1 and restarted
[06:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.newusers/SAL
[06:51:47] <legoktm>	 !log tools.logo-test Rebuilt with Rust 1.56.1 and restarted
[06:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.logo-test/SAL
[06:51:52] <legoktm>	 !log tools.shorturls Rebuilt with Rust 1.56.1 and restarted
[06:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.shorturls/SAL
[06:51:56] <legoktm>	 !log tools.ircservserv Rebuilt with Rust 1.56.1 and restarted
[06:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ircservserv/SAL
[09:43:28] <wm-bb>	 <lucaswerkmeister> legoktm: webservice restart issue sounds like https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/6Z3EXK3OOGZPHJZ6ZBSBYIYCVDGVMYDP/ ?
[11:01:25] <RhinosF1>	 I need to do that
[11:45:35] <arturo>	 !log admin [codfw1dev] downgrade kernel on cloudgw2001-dev/2002-dev (T294853, T291813)
[11:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[11:45:39] <stashbot>	 T294853: 2021-11-02 Cloud VPS network outage - https://phabricator.wikimedia.org/T294853
[12:41:27] <mdipietro>	 !log paws deployed https://github.com/toolforge/paws/pull/92 (4961101972be6f27c1c96e327a82457346965f32) T150098
[12:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL
[12:41:30] <stashbot>	 T150098: Render markdown / rst in PAWS public - https://phabricator.wikimedia.org/T150098
[16:06:13] <legoktm>	 !log codesearch manually triggered codesearch-write-config job to pick up fix for T294915 immediately
[16:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL
[16:06:16] <stashbot>	 T294915: CodeSearch's "deployed" profile hasn't yet worked out that WikiLambda is now branched for production - https://phabricator.wikimedia.org/T294915
[16:48:27] <legoktm>	 majavah: I think I asked you to set the codesearch monitoring to wait 15 min before alerting, could we extend that to 20 min? seems like the restart process takes 17 minutes.
[16:51:18] <majavah>	 legoktm: done
[16:51:28] <legoktm>	 ty :)
[17:00:20] * dcaro back
[17:22:24] <arturo>	 !log admin [codfw1dev] installing keepalived 2.1.5 from buster-backports on cloudgw2001-dev/2002-dev (T294956)
[17:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL
[17:22:30] <stashbot>	 T294956: keepalived: flap when rebooting servers - https://phabricator.wikimedia.org/T294956
[17:34:20] <legoktm>	 majavah: also, is it possible to have alerting if a systemd unit fails, like we do in prod?
[17:35:01] <legoktm>	 specifically the codesearch-write-config unit (https://phabricator.wikimedia.org/T294915) but even if it was all units that would be fine too
[17:35:37] <legoktm>	 if not I'll have the codesearch web app export the systemd status as a bool in our current metrics endpoint
[17:36:27] <legoktm>	 T294958
[17:36:28] <stashbot>	 T294958: Add monitoring+alerting for codesearch-write-config - https://phabricator.wikimedia.org/T294958
[17:50:12] <majavah>	 legoktm: prometheus-node-exporter collects systemd stats, so that should be set now
[17:50:36] <legoktm>	 majavah: for all units or just that specific one?
[17:50:50] <majavah>	 just that specific one
[17:51:19] <legoktm>	 would it be excessive/problematic/difficult to do it for all?
[17:51:46] <majavah>	 T287309
[17:51:46] <stashbot>	 T287309: Some systemd services appear to be broken on all VMs - https://phabricator.wikimedia.org/T287309
[17:52:10] <legoktm>	 ack, fair enough
[17:52:13] <legoktm>	 thanks :)
[20:23:47] <wm-bot>	 !log tools.wd-image-positions <lucaswerkmeister> deployed 620e07e107 (refactoring, no functional change hopefully)
[20:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-image-positions/SAL
[22:44:22] <gifti>	 um, in /data/project/shared, i had a directory named factor, which got moved to factor-unmaintained and now contains something totally different and it's not by me. what happened?!
[22:44:32] <gifti>	 (toolforge)
[22:46:59] <bd808>	 gifti: /data/project/shared is a free-for-all space and not really a good location for anything important. There was some cleanup of files there late last week when the NFS share was alerting for being >85% full. I do not know if the files you are talking about were purged during that cleanup or not.
[22:47:43] <gifti>	 sigh
[22:48:16] <bd808>	 As far as I understand things shared was originally meant to host the shared clones of MediaWiki and pywikibot.
[22:50:43] <bd808>	 gifti: There may be a backup that could be recovered if the content was recently removed and is difficult to recreate. If that's something you are interested in pursuing, a phabricator task tagged to #wmcs-kanban and #toolforge would be the way to ask folks to look.
[22:51:35] <gifti>	 thx :')
[22:56:56] <legoktm>	 bd808: oh, I just set up https://wikitech.wikimedia.org/wiki/Tool:Rustup the other day - should I *not* have put that in shared?
[22:58:08] <gifti>	 is it cloud-services-team (kanban)?
[22:58:56] <RhinosF1>	 Yes
[23:00:23] <AntiComposite>	 that sounds like a valid use of shared
[23:01:43] <AntiComposite>	 probably
[23:02:33] <bd808>	 legoktm: is there a README in the directory? Or just some files randomly spewed there? That's one of the big problems with shared, not knowing when/if/how to get it cleaned up.
[23:02:52] <legoktm>	 I can create a README pointing to the wiki page
[23:03:29] <gifti>	 you could speak with the owners
[23:03:29] <legoktm>	 but it's just copies of the rust toolchain
[23:05:21] <AntiComposite>	 a subdirectory in /data/project/rustup with the appropriate permissions set would work just as well
[23:05:33] <bd808>	 ^ that
[23:05:35] <gifti>	 ah, the directory wasn't moved, there was a separate directory factor-unmaintained
[23:06:26] <bd808>	 and that would be clearly scoped and also something that the nearly done "clean up when a tool is deleted" scripts would handle
[23:07:19] <bd808>	 the biggest pain for Toolforge admins is NFS filling up and nobody being responsive to requests to purge junk that they have collected
[23:07:56] <bd808>	 it is sadly common for us to make a Phabricator task and also send direct emails and get no response at all from the tool maintainers
[23:08:50] <bd808>	 If we had infinite disk space nobody would care, but we don't. Disk storage is our most constrained resource.
[23:09:54] <legoktm>	 ok, I can move it under the tool's hierarchy then
[23:16:55] <bd808>	 gifti: while you are around, if factor-unmaintained is junk could you delete it? You are the owner and it looks like you made that directory ~2 months ago.