[04:49:16] I'm ocassionally seeing what seems like DNS failures [04:49:39] $ curl 'https://release-monitoring.org/api/v2/versions/?project_id=7635' [04:49:40] curl: (6) Could not resolve host: release-monitoring.org [04:50:11] I ran the same request 10 times, failed twice [06:34:21] in other news, `webservice restart` doesn't seem to be working, I filed https://phabricator.wikimedia.org/T294888 [06:51:43] !log tools.newusers Rebuilt with Rust 1.56.1 and restarted [06:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.newusers/SAL [06:51:47] !log tools.logo-test Rebuilt with Rust 1.56.1 and restarted [06:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.logo-test/SAL [06:51:52] !log tools.shorturls Rebuilt with Rust 1.56.1 and restarted [06:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.shorturls/SAL [06:51:56] !log tools.ircservserv Rebuilt with Rust 1.56.1 and restarted [06:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ircservserv/SAL [09:43:28] legoktm: webservice restart issue sounds like https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/6Z3EXK3OOGZPHJZ6ZBSBYIYCVDGVMYDP/ ? [11:01:25] I need to do that [11:45:35] !log admin [codfw1dev] downgrade kernel on cloudgw2001-dev/2002-dev (T294853, T291813) [11:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:45:39] T294853: 2021-11-02 Cloud VPS network outage - https://phabricator.wikimedia.org/T294853 [12:41:27] !log paws deployed https://github.com/toolforge/paws/pull/92 (4961101972be6f27c1c96e327a82457346965f32) T150098 [12:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:41:30] T150098: Render markdown / rst in PAWS public - https://phabricator.wikimedia.org/T150098 [16:06:13] !log codesearch manually triggered codesearch-write-config job to pick up fix for T294915 immediately [16:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codesearch/SAL [16:06:16] T294915: CodeSearch's "deployed" profile hasn't yet worked out that WikiLambda is now branched for production - https://phabricator.wikimedia.org/T294915 [16:48:27] majavah: I think I asked you to set the codesearch monitoring to wait 15 min before alerting, could we extend that to 20 min? seems like the restart process takes 17 minutes. [16:51:18] legoktm: done [16:51:28] ty :) [17:00:20] * dcaro back [17:22:24] !log admin [codfw1dev] installing keepalived 2.1.5 from buster-backports on cloudgw2001-dev/2002-dev (T294956) [17:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:22:30] T294956: keepalived: flap when rebooting servers - https://phabricator.wikimedia.org/T294956 [17:34:20] majavah: also, is it possible to have alerting if a systemd unit fails, like we do in prod? [17:35:01] specifically the codesearch-write-config unit (https://phabricator.wikimedia.org/T294915) but even if it was all units that would be fine too [17:35:37] if not I'll have the codesearch web app export the systemd status as a bool in our current metrics endpoint [17:36:27] T294958 [17:36:28] T294958: Add monitoring+alerting for codesearch-write-config - https://phabricator.wikimedia.org/T294958 [17:50:12] legoktm: prometheus-node-exporter collects systemd stats, so that should be set now [17:50:36] majavah: for all units or just that specific one? [17:50:50] just that specific one [17:51:19] would it be excessive/problematic/difficult to do it for all? [17:51:46] T287309 [17:51:46] T287309: Some systemd services appear to be broken on all VMs - https://phabricator.wikimedia.org/T287309 [17:52:10] ack, fair enough [17:52:13] thanks :) [20:23:47] !log tools.wd-image-positions deployed 620e07e107 (refactoring, no functional change hopefully) [20:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wd-image-positions/SAL [22:44:22] um, in /data/project/shared, i had a directory named factor, which got moved to factor-unmaintained and now contains something totally different and it's not by me. what happened?! [22:44:32] (toolforge) [22:46:59] gifti: /data/project/shared is a free-for-all space and not really a good location for anything important. There was some cleanup of files there late last week when the NFS share was alerting for being >85% full. I do not know if the files you are talking about were purged during that cleanup or not. [22:47:43] sigh [22:48:16] As far as I understand things shared was originally meant to host the shared clones of MediaWiki and pywikibot. [22:50:43] gifti: There may be a backup that could be recovered if the content was recently removed and is difficult to recreate. If that's something you are interested in pursuing, a phabricator task tagged to #wmcs-kanban and #toolforge would be the way to ask folks to look. [22:51:35] thx :') [22:56:56] bd808: oh, I just set up https://wikitech.wikimedia.org/wiki/Tool:Rustup the other day - should I *not* have put that in shared? [22:58:08] is it cloud-services-team (kanban)? [22:58:56] Yes [23:00:23] that sounds like a valid use of shared [23:01:43] probably [23:02:33] legoktm: is there a README in the directory? Or just some files randomly spewed there? That's one of the big problems with shared, not knowing when/if/how to get it cleaned up. [23:02:52] I can create a README pointing to the wiki page [23:03:29] you could speak with the owners [23:03:29] but it's just copies of the rust toolchain [23:05:21] a subdirectory in /data/project/rustup with the appropriate permissions set would work just as well [23:05:33] ^ that [23:05:35] ah, the directory wasn't moved, there was a separate directory factor-unmaintained [23:06:26] and that would be clearly scoped and also something that the nearly done "clean up when a tool is deleted" scripts would handle [23:07:19] the biggest pain for Toolforge admins is NFS filling up and nobody being responsive to requests to purge junk that they have collected [23:07:56] it is sadly common for us to make a Phabricator task and also send direct emails and get no response at all from the tool maintainers [23:08:50] If we had infinite disk space nobody would care, but we don't. Disk storage is our most constrained resource. [23:09:54] ok, I can move it under the tool's hierarchy then [23:16:55] gifti: while you are around, if factor-unmaintained is junk could you delete it? You are the owner and it looks like you made that directory ~2 months ago.