[09:01:32] jynus, claime: I'm switching the codfw URL downloaders to new VMs running bullseye, this has been tested and no impact is expected, but an FYI to you just in case [09:01:44] moritzm: Thanks for the heads up [09:01:45] thanks for the heads up [09:15:59] I have the suspicion that mwlog1002 is not rotating its logs: https://phabricator.wikimedia.org/P48705 [09:16:32] could someone had a deeper look before it runs out of space, maybe the host was recently restarted or done maintenance to it? [09:24:14] it was updated to bullseye four days ago [09:27:58] I'll take a look [09:30:48] ah, then some change on rotation could have affected, as the archival (older) logs seem to have good sizes [09:31:04] *affected it [09:32:27] thank you, godog! [09:32:42] looks like deletions stopped around may 18th https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mwlog1002&var-datasource=thanos&var-cluster=misc&viewPanel=28&from=now-30d&to=now [09:34:14] oh, so not related to the upgrade? [09:34:23] deletions/rotations that is [09:34:34] doubt it [09:34:52] then some puppet or software upgade maybe [09:40:40] yeah 4da641b01234 is rotating api.log only [09:41:09] so there's this [09:41:10] -rw-r--r-- 1 udp2log udp2log 2.4T Jun 5 09:40 xff.log [09:41:17] -rw-r--r-- 1 udp2log udp2log 3.1T Jun 5 09:40 JobExecutor.log [09:46:04] obviously not enough space left to keep/compress the files now, I'll see if there's a simple way to keep only the recent bits [09:49:03] <_joe_> oh sigh [09:49:08] not a big deal, to be fair, as long as a new file is created regularly, and if needed I can contribute with extra space [09:49:17] <_joe_> jynus: it's actually a big deal [09:49:25] to compress it and return it back [09:49:33] <_joe_> if we have to remove those files [09:50:21] <_joe_> godog: let me know what you manage to do [09:52:52] <_joe_> godog: if I had to choose, save jobexecutor.log [09:53:01] <_joe_> but it's ok if we just have to take the hit [09:53:38] ok task is https://phabricator.wikimedia.org/T338127 [09:53:56] I was thinking of keeping say the last 100G of jobexecutor.log and xff.log [09:54:29] that should be enough breathing room to deal with the problem [09:55:49] it won't be line-wise truncation, though my understanding is that fallocate can actually keep only the end of a file (in place) [09:59:03] unless there are objections I'll proceed [09:59:34] <_joe_> godog: yeah it's ok [09:59:40] I'm ok with that, better than a full disk, but maybe serviceops has comments [09:59:59] <_joe_> jynus: many, but nothing against what godog is doing :P [10:00:31] yeah, I was thinking in the line of "X log is more valuable than Y" [10:12:36] I like seeing the ongoing criticals being cleaned up as the monday progresses :-D, thanks everybody that helped [14:01:47] not sure what's going on here, but my PCC runs keep failing fast, and others are clearly having success in the logs between my failures [14:01:52] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41539/console [14:02:08] ^ is a success that happened for someone else between two of my failures [14:02:31] but mine this morning all look like [14:02:34] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41542/console [14:03:23] I wonder what's up with CHANGE_PRIVATE vs CHANGE [14:03:24] how are you launching the pcc runs? via the script in operations/puppet? [14:03:33] oh [14:03:39] the difference seems to be that your runs have CHANGE_PRIVATE set and others have CHANGE [14:04:10] yeah, I guess my brain's just not working, and I keep pasting the gerrit number into the wrong field, thanks! [14:04:23] (although the error message for that case could use some work heh) [14:05:15] definitely [14:05:33] but yeah I always use utils/pcc nowadays [14:18:32] it's not very bookworm-friendly :) [14:19:02] (python3-jenkinsapi was replaced by python3-jenkins, etc) [14:19:17] err, just "python-jenkins" [14:19:35] bblack: just to mention it, the other way is to use the Hosts header in the commit message and then comment check-experimental [14:19:47] *check experimental (no dash) [14:20:13] ah, I didn't notice since I upgraded to bookworm here and the script works [14:20:21] that's odd [14:20:44] maybe it keeps old pip modules around? [14:22:09] it works via pip anyways, just not via apt [14:22:30] (pipenv technically, but whatever) [14:23:02] yeah that's what happened here, python3-jenkinsapi was left alone/installed [14:24:00] "pipenv install jenkinsapi" worked, but "pipenv run ./utils/pcc.py" still fails complaining about jenkinsapi [14:24:12] anyways, I've reached my rabbithole limit on this for today :P [14:29:30] https://graydon2.dreamwidth.org/307291.html is kinda fascinating and unique. Not often you hear someone admit + accept that kind of thing. [15:05:57] bblack: not that I grokked it all, as someone who is not well versed in language design, but still and interesting read [15:07:57] yeah you can gloss over the rust specifics really. The interesting part is how he's reacted/reacting to the situation. [15:22:07] definitely [15:55:55] effie, _joe_: can we enable parsoid cache warming jobs on enwiki tomorrow? [16:04:31] --> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/927236 [16:04:56] <_joe_> duesen: what's the overall status with those jobs? what have we migrated up to now? [16:05:18] <_joe_> did we disable pregeneration via restbase? [16:05:26] _joe_: samll and medium wikis (according to dblist) + frwiki. [16:05:37] <_joe_> so you want to go enwiki now? :P [16:05:52] We can't disable pre-generation via restbase as long as PCS is still relying on parsoid in restbase. [16:06:10] Also, I don't think there is a way to do this per-wiki in restbase. it's all or nothing [16:07:12] Migrating PCS to calling the MW endpoints has some complication wrt redirect handling, iiuc. [16:07:46] My thinking was that we can deal with that later. For now, we just need the jobs to run, so we can switch VE to direct mode, so it stops using restbase. [16:07:56] yes, I want to go for enwiki now. [16:08:09] I want to see how much that really is. [16:28:00] _joe_: can we work to enable the cache warmign jobs without worrying about disabling pre-generation in restbase for now? It would make my life a lot easier if these two issues were not tied to eachother. [16:33:57] <_joe_> I'm not sure what pre-generation in restbase has to do with pcs needing to use restbase. I would've assumed that restbase, when called from pcs, would call parsoid for an inexpensive api request, when traffic would hit [16:34:29] <_joe_> but - we can probably proceed this way, I'm just worried by the spike of computing time we'll get when we completely switch [16:34:53] <_joe_> anyways, yes, let's try enwiki tomorrow morning. [16:35:19] <_joe_> worst case scenario we revert, and we had some jobs piling up [16:35:27] <_joe_> that's not a serious concern [16:35:47] <_joe_> Amir1: I'd like you to be around as well to help keeping an eye on the datastores too [16:37:12] I can be around [16:37:19] When? [18:08:44] <_joe_> Amir1: the morning back port window? [18:09:05] I can make it work [18:09:40] <_joe_> We can do later [18:26:43] _joe_: I suppose we could have a "pure proxy" mode for parsoid on restbase. Will need to talk to Yiannis about that. [18:27:07] I am not around for the morning backport, I was going to suggest the afternoon. [18:30:04] I put it on the deployment calendar for the afternoon slot