[09:29:49] Krinkle: ack, thanks! yeah they can stay in place for now, can still be useful to speed up dashboards if needed (you can query those from thanos too) [09:33:37] hi, a heads-up from -operations: in scap backport it seems like the sync-proxies/sync-apaches/sync-canaries steps take much longer than I remember. proxies/canaries are 3-4 minutes, while sync apaches took 13 minutes. is that expected? [09:36:36] claime: ^ [09:37:02] also FYI I'm going ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/860573 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/860574 [09:41:54] kostajh: not really afaik. jnuche are we doing anything k8s related in these steps? [09:44:30] kostajh: yes, see yesterday's email to Wikitech-l [09:44:36] "[Wikitech-l] Building of mw-on-k8s images during Scap backports" [09:45:17] AFAIK is expected [09:48:06] <_joe_> volans: no [09:48:11] Yeah but no [09:48:32] <_joe_> what is expected to talke more is a specific step separated from the ones kostajh was talking about [09:48:43] It's not supposed to impact "regular" deployment once build and push is done [09:48:57] <_joe_> kostajh: do you have a paste of the scap run? [09:48:57] claime: older operations syncing to non-k8s hosts (such as sync-proxies or sync-apaches) haven't changed recently [09:49:06] so no K8s parts there [09:49:38] sure, one moment [09:50:32] sorry then, but wasn't totally clear to me this was unrealted [09:50:34] *unrelated [09:50:36] <_joe_> kostajh: I just want to understand if it's the k8s stuff impeding you or not [09:50:51] _joe_: https://phabricator.wikimedia.org/P42428 [09:50:55] no, it is not the k8s stuff [09:51:09] the steps that seemed to take longer than usual are the sync proxies/apaches/canaries steps [09:51:13] <_joe_> so yeah, something fishy [09:51:22] <_joe_> but not on the k8s side of things lol [09:51:22] I'm doing another backport now, so I can compare with what happens this time [09:51:55] <_joe_> 09:12:42 Started sync-check-canaries [09:51:57] <_joe_> 09:16:03 sync-canaries: 100% (in-flight: 0; ok: 9; fail: 0; left: 0) [09:52:04] <_joe_> wow that is bad [09:53:34] why does it say '09:09:26 MediaWiki wmf/1.40.0-wmf.13 successfully checked out.' when that should be done by the systemd timer? [09:58:08] taavi: that's just where I started selecting text for the paste. There were messages about security patches above that, and I lazily didn't want to edit everything out. [09:58:20] ahh [10:00:28] <_joe_> ahhh no wait [10:00:34] actually taavi might be onto something, I just checked and the systemd timer failed (partially) last night [10:00:34] <_joe_> tonight the train presync failed [10:00:43] <_joe_> yeah I just checked too :P [10:00:48] <_joe_> so we did sync the train [10:00:53] yeah, it never got to the point where it was syncing out [10:00:59] <_joe_> the docker images were deployed correctly OTOH [10:01:45] strangely enough it failed because apparently the helm diff plugin was missing? [10:02:39] is the log available somewhere where I can see it? [10:03:45] <_joe_> jnuche: uhm that's a recurring problem with helmfile but I think it's just a bogus error message [10:04:16] <_joe_> jnuche: can you open a task so someone from serviceops can investigate? [10:05:17] iirc our helm plugin setup needs some special env variable to be set up to work, and the timer might not have it available [10:06:37] taavi: not sure if there's a place where you can check, I can forward you the timer email if you're interested [10:07:06] also, the timer was working previously, so if the problem is the env var that's changed recently [10:07:28] _joe_: will do, I still want to try to run the presync again after the deployment window is over [10:07:38] <_joe_> ack [10:48:18] _joe_: sync-apaches took 2m 22s this time, that's much better [10:48:32] <_joe_> that's what I expected tbh [10:48:45] <_joe_> basically earlier you synced a whole new version, that takes its time [10:49:14] but the k8s image building took 10m 40s [10:49:40] whereas in the slower sync earlier, it took 45s [10:50:14] I added the second sync output to https://phabricator.wikimedia.org/P42428 [10:50:18] <_joe_> uhm interesting, I guess we went over the bump and rebuilt all layers [10:58:15] CRITICAL - degraded: The following units failed: geoip_update_main.service < That failed this night, is it safe to re-run ? [10:58:22] (on puppetmaster1001) [10:59:28] claime: I think that this is due to an expired licence. Olja has sent a message to Data Engineering (via Slack) that she's looking into it. [11:00:13] btullis: It seems due to that yes, logs spit a 403 Invalid product ID or subscription expired [11:00:14] Maybe a task and a tag and acknowledge for now, and I can add a comment? [11:00:36] sgtm [11:00:45] Thanks ever so much. [11:06:12] Not acknowledging the general checksystemd state alarm though [11:10:08] btullis: I'll reset the failed state so we get an alert if another systemd service starts failing [11:10:19] I'll switch my ack to a sticky one [11:11:42] well not sticky, but to a persistent comment at least [11:12:43] Perfect, thanks. [11:15:16] if that's expected to last few days it might be easier to just disable the timer via puppet [11:16:29] Ack, thanks volans. I'll check with Olja later today and see when we can expect it. [11:16:59] +1 [11:17:04] thx [11:35:57] _joe_: About the train-presync service, can I reset the state so we get an alert if it fails again? [11:36:47] And honestly, same question for most of what's in https://alerts.wikimedia.org/?q=%40state%3Dactive&q=severity%3Dcritical [11:37:48] claime: I'll pick up the stat1004 alert. Thanks. [11:38:11] And the an-worker1148 power supply. [11:38:34] Thanks btullis, appreciated [11:38:46] btullis: unrelated, could you check my ping in -operations please? [11:39:19] volans: Ack. Thanks. [11:39:43] you mean it's ok to puppet-merge? [11:41:55] Confirmed it's ok in the other channel. Apologies for being vague here. [11:42:11] no prob, thx [11:50:57] <_joe_> claime: go on with train-presync [11:51:17] ack [12:22:56] moritzm: package_builder_Clean_up_build_directory.service is failing because /var/cache/pbuilder/build/cow.99944/sys/devices/virtual/tty/tty0/active is being held, I suppose it's a chroot since there's a tmpfs mount for /var/cache/pbuilder/build/cow.99944/dev/shm [12:23:48] It's been failing for 22 days trying to clean what's older than 14 days (it still does, but if it's not a problem that it doesn´t clean everything, it shouldn't crit imo) [12:31:31] yeah, I'll make a patch to make it not fail the time, there's no reason this should alert [12:34:41] cool thanks <3 [14:19:52] what is the process for renaming hosts? is it https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging? [14:20:46] urandom: correct [14:21:11] it also requires a task for dcops for relabeling, but should be already in that list [14:21:51] volans: yup, last item [16:40:03] moritzm: there's a pending puppet merge from you, is it ok to merge? [16:40:11] "buster updates" [16:41:03] oh, sorry, merging niw [16:41:22] or you merge along, either way [16:47:21] 👍 merged :) [16:47:50] thx [18:34:08] volans: just wanted to let you know I deleted "weird" IPs from netbox, phab1001-vcs.eqiad.wmnet, phab2001-vcs.codfw.wmnet, marked with role VIP.. all that stuff is gone now. I thought you might like it because..special cases [18:34:28] that was possible since phab1001 is gone since yesterday [18:34:51] also ran the sync [19:48:40] go vote for one of the sound logos. you might have to hear it for the next decades :p (I voted: https://commons.wikimedia.org/wiki/File:Wikimedia_Sound_Logo_Finalist_OZ85.wav diggi di da) [19:49:11] you just reorder the sounds on the wiki page itself https://commons.wikimedia.org/wiki/Commons:Sound_Logo_Vote [20:33:19] mutante: great, thanks for the heads up! [20:36:29] :) [22:54:07] jhathaway: pm? [22:54:24] ? [22:56:08] jhathaway: see your PMs. You should have a window with my nick.