[07:53:13] Reminder that today at 14:00UTC we will upgrade codfw row B switches - https://phabricator.wikimedia.org/T327991 [08:16:03] 🌐🎉🌐🎉🌐 [08:31:02] Good morning, I have a systemd timer question. Is it possible to allow a non root / unix group to start a systemd-timer? Trying so I get: [08:31:04] hashar@deploy1002:~$ systemctl start train-presync.timer [08:31:04] Failed to start train-presync.timer: Access denied [08:31:57] I have retrieved the ExecStart from the service (train-presync.service) and ran it manually [09:14:24] hashar: a sudoers rule in data.yam [09:14:27] *data.yaml [09:18:33] hashar: I can run it if you need me to in the meantime [09:19:15] Oh you ran it manually from the exec, ok [10:16:00] volans: claime: thanks! :] [12:24:59] Feb 21 12:17:19 acmechief1001 systemctl[7535]: systemd-timer-mail-wrapper: error: the following arguments are required: -s/--subject [12:24:59] Feb 21 12:17:19 acmechief1001 systemd[1]: reload-acme-chief-backend.service: Main process exited, code=exited, status=2/INVALIDARGUMENT [12:25:06] jbond: that looks related to 5ad5de58c3 [12:26:37] vgutierrez: ack looking can yuo send me a copy of the email you got? [12:26:41] ill revert for now [12:26:52] jbond: email? [12:27:05] jbond: I got an icinga critical message [12:27:16] oh never mind i guess its something elses ill jump on the box and check [12:27:30] (the change refrenced changed the email message) [12:27:49] it looks like reload-acme-chief-backend is now deprecated [12:28:40] hmm that's using systemd::timer::job [12:29:43] vgutierrez: the error is with the systemd wrapper which sends emails [12:30:46] hmm we're getting now some errors to root@ [12:30:58] Failed to reload acme-chief.service: Access denied [12:31:12] vgutierrez: i think there must have been a race condition, where the systemd timer got updated and ran before the new wrapper script was deployed [12:31:25] that error was me running it manualy [12:31:30] ack [12:31:41] "RECOVERY - Check systemd state on acmechief1001 is OK: OK - running" [12:31:54] i have ran it manually now (with the correct permissions) and it looks good to me [12:32:20] btw, vgutierrez thanks for flagging it out, with the ongoing noisy alerting I would have missed it [12:32:54] no problem [13:37:45] head's up codfw row B upgrade in 20-ish minutes [13:39:12] ack [13:40:46] vgutierrez: I see that the traffic table on https://phabricator.wikimedia.org/T327991 is empty, is it expected? [13:41:53] jynus: does anything need to be done for backup[2005,2008] and dbprov2002 ? [13:41:59] nope [13:42:03] cool [13:42:26] as in, I checked and they are currently idle [13:42:49] elukey: I see that depool needs to be done for the ores servers, should I do it? Is it ok that I do it now? [13:43:12] XioNoX: definitely yes! [13:43:23] otherwise I can do it [13:43:31] I'm on it, don't worry [13:44:02] (done) [13:46:46] godog: for the o11y hosts, I think only kafka-logging[2002,2004] is left to do, should I do it? [13:47:30] XioNoX: thank you, last time I think h.erron did it, but I don't think he's around, I'll do it [13:47:55] thx! [13:48:15] sure np [13:49:00] inflatador, ryankemper: the search platform table is empty, just want to make sure there are no actions needed [13:49:53] (cc gehel because of timezones) [13:50:03] damn, sorry for that, lemme have a look [13:50:06] XioNoX: looking [13:51:12] other than that I think we're good [13:51:20] we should depool them in LVS. Want me to do it right now? [13:51:51] XioNoX: so considering that codfw is depooled the only thing missing is doh2002 [13:52:36] XioNoX: I'll take care of it now [13:52:54] I can't remember if we need to depool appservers too akosiaris [13:53:29] gehel: sure if you don't mind [13:53:35] you will be faster than me :) [13:54:21] The discovery.datacenter depool does not touch mediawiki services [13:54:29] vgutierrez: thanks! in theory BFD should "depool" it automatically in less than a second, but cleaner that way [13:55:52] XioNoX: done [13:55:58] thanks! [13:56:19] claime: geoip DNS depool of codfw should have done its magic and send appserver-ro and appservers-rw traffic to eqiad [13:56:26] claime: same for api-(ro|rw) [13:56:29] Ah you did a DNS depool, ack [13:56:34] <3 [13:56:35] I didn't [13:57:15] hmmm [13:57:24] nobody depooled codfw so far, should I' [13:57:32] yeah, I think you should [13:57:51] hi, I'm just getting online, anything I can help with? [13:57:53] we depooled the a/a services with the cookbook this morning, but service by service and not touching mediawiki [13:58:16] A global depool wasn't done [13:58:48] https://gerrit.wikimedia.org/r/c/operations/dns/+/890822 [13:59:21] +1 [13:59:55] thank you [14:00:42] running authdns-update right now [14:01:40] done [14:01:59] vgutierrez: so better wait 10min from now, right? so clients are redirected to other sites? [14:02:29] XioNoX: yes [14:02:41] cool [14:03:00] everybody, other than that is there anything left in the depools? [14:05:37] I take that as a no :) [14:05:55] +1 I think we are go [14:05:55] waiting for de dns propagation I guess [14:06:22] let me check the traffic shift [14:06:27] claime: FTR, https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&var-site=codfw&var-cluster=appserver&var-method=GET&var-code=200&var-php_version=All shows that appservers in codfw are idling since 09:16 UTC this morning [14:07:11] same for api ones [14:07:28] Hmm I'll check sth [14:07:33] interesting, I thought it wasn't a depool? [14:07:40] (the one before, I mean) [14:08:33] jynus: https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1&from=now-30m&to=now&viewPanel=28 traffic is going down [14:08:33] Well yeah that's why I'm surprised [14:09:09] vgutierrez: nice [14:09:34] Ah, I know [14:09:50] it was run *before* I merged the mw A/A exclusion [14:10:04] 2023-02-21 09:14:22,099 j.ayme 1838609 [INFO] Setting pooled=False for tags: {'dnsdisc': '(appservers-ro)', 'name': 'codfw'} [14:10:31] Which makes me think maybe I shouldn't exclude them statically, and only when it's run for dc switching [14:15:32] alright, let's go? [14:16:17] I updated topic in case we have some rogue deployer [14:16:27] (we shouldn't) [14:16:55] go go go [14:17:23] 🚀 [14:18:18] keeping an eye on the console of the master node [14:18:40] if it's like last time it should be 15min or hard downtime [14:26:31] alright, back on the prompt [14:28:43] the QFXs are up, EXs still booting [14:29:44] everything joined the VC [14:30:14] got alerts for etcd replication [14:30:16] I see some machines now responding to ping again (ores200[34]) [14:30:23] conf2005 [14:30:30] will ack [14:30:33] thank you jynus [14:30:56] should recover soon for what it looks [14:31:20] "RECOVERY - Router interfaces on cr1-codfw is OK" [14:31:23] yeah seems like it [14:31:27] all interfaces ar eup [14:31:38] devices look healthy [14:31:46] so now, if there is something unconnected, it shouldn't be, right? [14:32:12] (ovbiously at tcp layer, application may struggle a bit, etc) [14:32:43] yeah everything should show back up in monitoring [14:33:20] *nod* we're good to proceed with repools ? e.g. prometheus, thanos, etc [14:33:28] checking for things that may have stopped retrying [14:35:07] etcd replication autoresolved as expected [14:35:20] godog: on a network PoV, yes [14:35:39] let's give it a few minutes for things like db replication to catch up [14:35:46] it may take some minuts [14:35:51] XioNoX: cheers, I'll repool the o11y things I know are safe to [14:35:56] (at least before pooling mw) [14:36:18] yeah, stateless things should be ok to repool indeed [14:37:18] We're holding on repooling the rest until k8s upgrade is done [14:37:32] that is true, too [14:37:45] gehel: you can repool your services at your convenience [14:38:10] jbond: ^ same for puppetmaster2003 [14:38:41] moritzm: ^ same for urldownloader (f needed) [14:38:44] XioNoX: o/ I'd need to run https://gerrit.wikimedia.org/r/c/operations/homer/public/+/890834 in a bit, is it safe or would you run towards me with a big hammer if I try to do it? :D [14:40:16] elukey: 1 nit then lgtm [14:40:51] anyone else see something weird or red that shouldn't be? [14:41:26] no unexpected errors now as far as I can see [14:41:31] elukey: I repooled the two ores boxes [14:41:41] XioNoX: ack thanks [14:42:32] XioNoX: thanks! [14:42:48] jbond: have time to talk about https://gerrit.wikimedia.org/r/c/operations/puppet/+/890513 ? [14:43:00] I see a raid issue on analytics1068, but that is unrelated [14:43:12] the ML machines affected by this are all back and modulo some churn are all ok [14:43:14] XioNoX: ack, we'll just keep the current urldownloader active, they're all equal [14:43:25] XioNoX: ack [14:43:48] gehel: keep in mind we haven't repooled the updater that's on wikikube [14:43:50] and "Check unit status of httpbb_hourly_appserver" but I think that happens when there are depool? [14:44:09] jynus: yeah i think so [14:44:23] PROCS CRITICAL: 0 processes with command name 'bird' on doh2002 [14:44:30] claime: Oh, I forgot that one. We depooled the whole service for that right? [14:44:37] gehel: yuo [14:44:39] yup* [14:44:43] jynus: checking [14:44:47] jynus: that's from the depool [14:44:50] some potential timeout on individual services, but I see nothing weird [14:44:52] ganeti/codfw is all fine and healthy [14:44:54] sukhe: you can repool doh2002 [14:45:01] yep done [14:45:09] should be resolving soon [14:45:20] ok, will remove the topic if everybody agrees? [14:46:43] jynus: +1 [14:47:19] vgutierrez, claime should we repool DNS too or wait for the k8s maintenance? [14:47:26] done [14:47:33] great job, XioNoX [14:47:45] jayme: ^^^ [14:47:46] +1 [14:48:05] hmm let me restart pybal first [14:48:11] it's thanks to everybody that nothing bad happened :) [14:48:13] it was needed with row A maintenance [14:48:30] XioNoX: I think you can repool DNS, all A/A are depooled, but I'd rather have the input of the one actually running the maintenance [14:48:45] XioNoX: please do not repool the services yet, DNS should be fine though [14:49:10] XioNoX: agreed! pretty uneventful [14:49:57] Yeah that went smoothly [15:06:53] While people are around, I created the task for eqiad row B (in 1 month) - https://phabricator.wikimedia.org/T330165 [15:13:35] Hm, did some versioning thing just change with our puppet install? I have a bit of code that worked until 10:00 UTC and no longer compiles. (The code looks wrong to me! But nevertheless) [15:14:10] The code in question is $mon_hosts.reduce({}), which I believe should actually be $mon_hosts.reduce() [15:14:26] andrewbogott: please pass the link to the job run or error [15:14:44] https://www.irccloud.com/pastebin/70QfIzUm/ [15:15:11] Pretty sure I know how to fix it, just wondering why it compiled until a few hours ago. [15:15:18] oh, it is a production run, I thought it was just a compiler run [15:15:30] Nope, a happy server suddenly unhappy [15:16:18] "The last Puppet run was at Tue Feb 21 10:42:12 UTC 2023 (273 minutes ago)" [15:17:39] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/dns/+/890847 [15:18:49] XioNoX: k8s maintenance is still ongoing, right? [15:19:34] vgutierrez: yeah, my understanding is that DNS could be repooled, but I'm fine either way. I guess safer to wait for the k8s maintenance [15:19:37] +1, text and upload look good [15:23:05] andrewbogott: blame tells dcaro is the person to speak to, merged a patch a few hours ago: https://gerrit.wikimedia.org/r/c/operations/puppet/+/824202 [15:23:52] I wonder if it is just something that the original patch date confused you (it is almost a year old) [15:23:57] oh, you're right. I always forget that the timestamp on a patch is unrealted to when it merged [15:23:59] yep [15:24:04] *unrelated [15:25:19] XioNoX: vgutierrez: Amir.1 is taking the opportunity of mw codfw being depooled to run some drift correction [15:25:22] just fyi [15:25:35] ack [15:26:12] andrewbogott: I like phab's blame if it helps, it is very clear: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/manifests/cloudceph/client/rbd_backy.pp [15:26:46] oh, nice! I should learn how to use that instead [15:38:06] that's git blame basically [15:38:34] yeah, I just pointed it it is a nice gui for it [15:46:11] fyi i use the following alias to show the commiter relative date (%cr) which is usefull for thisngs like this [15:46:19] https://github.com/b4ldr/profile/blob/master/.gitconfig#L14 [15:46:56] * jbond thinks it was stolen from d.caro [15:57:28] * dhinus adds that alias to ~/.gitconfig [15:59:52] andrewbogott: btw. we have to discuss the replica_cnf stuff with raymond when you have some time [16:02:01] vgutierrez, Amir1, claime, to remove any confusion, who should repool codfw (DNS) and when? [16:02:28] XioNoX: akosiaris will handle it [16:03:29] it's fine from my side, I have some left but for complicated reasons they need to be done later [16:04:20] claime: ok, thanks! wanted to make sure we were not doing the spidermen pointing at each other meme [16:04:35] XioNoX: sure, it's better to be clear :D [16:07:12] ack [16:17:19] (feel free to ping me since it's working hours for Traffic NA :) [16:57:54] repooling services, not dns for now [17:05:44] ok [17:08:16] service repooling done [17:14:50] should we pool DNS now? [17:15:59] A few minutes to fix something regarding mw-on-k8s, let hashar finish his scap, then I think we're g2g [17:16:11] ok, np, thanks [17:43:35] seems like we are ready? [17:44:12] Are the PyBal backends alerts due to DNS depool or something else? [17:44:19] PYBAL CRITICAL - CRITICAL - eventstreams-internal_4992: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2024.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled [17:46:02] yep [17:46:12] let me confirm [17:50:06] claime: [17:50:12] confctl select name='^kubernetes2.*' get|grep '"inactive"' [17:50:18] this list, is this expected? [17:50:23] lemme check [17:51:33] yeah, it's depooled in eqiad too [17:51:39] all thumbor right, in your check too? [17:51:59] we need to remember to depool restbase async from eqiad once codfw is DNS repooled [17:52:12] thumbor codfw looks fine [17:52:22] yeah, it's thumbor on k8s [17:52:32] it's normal that it's not pooled iirc [17:52:52] hnowlan ^ ? [17:53:06] ok, if you are good with the inactive stuff above, that's what the pybal checks are so we should be good to repool but I will wait a bit, just in case someone tells us otherwise [17:53:22] sukhe: akosiaris says go in -serviceops [17:53:29] thumbor inactive is normal [17:53:33] cool! [17:53:35] going for it [17:53:38] ack [17:54:20] I'll depool restbase-async from eqiad once you're done [17:55:30] all done [17:55:35] all yours :) [17:55:38] thanks [17:55:54] Yes, thumbor is depooled on purpose right now. There is an open bug for causing issues to swift [17:56:29] https://phabricator.wikimedia.org/T328033 [17:57:01] akosiaris: thanks, good to know [17:57:04] ok depooling restbase-async [17:57:09] from eqiad [17:57:16] claime: yep, unpooled thumbor is normal [18:03:24] restbase-async depooled, I think we're all good now [18:03:33] thanks, alL! [18:06:42] Nice [19:46:41] jelto, jbond, ok if I merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/890843? I get test my race condition every night at 00:05 :) [19:48:29] andrewbogott: i afraid not i want morit.zm to give that a review first just to make sure its all sound as it affects all condifured systemd.timeres [19:48:51] ill ping and them and get it merged tomorrow thogh [19:48:54] ok! There's always tomorrow [19:49:20] We could add those other lines conditionally on having 'after' set, then it would only touch two timers. [19:49:53] true but i think that they are usefull in the genral case so would prefer to get it merged as is if we can [19:50:16] 'k [21:28:29] Our SWEs have sudo access on our existing airflow hosts (an-airflow1001-1004) but not on a newly-provisioned host (1005). Does anyone know where I might see/change this permissions (guessing Puppet)? [21:30:44] yeah, it sounds like the users are all in the right groups, but the group doesn't have access to the new host [21:30:51] inflatador: yep -- see https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/admin/README.md and https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/admin/data/data.yaml [21:30:52] those are configured in hiera, let's see [21:31:55] I see from manifests/site.pp that they all have different roles, e.g. an-airflow1001 is `role(search::airflow)` but an-airflow1005 is `role(analytics_cluster::airflow::search)` -- first of all, is that intended? [21:31:59] interestingly --- yeah, what rzl said [21:32:20] (I'm guessing it is, but just to double-check) [21:32:58] ebernhardson ^^ do you think we need to tweak the role here? [21:33:21] if they're supposed to be different roles that's fine, it just tells us what hiera file to edit :) [21:33:36] I'm also wondering about uh [21:33:59] role::analytics_cluster::airflow::search includes profile::analytics::cluster::airflow which includes profile::airflow [21:34:03] I'm not sure, we're in one of those fun situations where we're migrating the airflow stuff while other, related migrations are happening at/around the same time ;) [21:34:10] OTOH, role::search::airflow includes profile::analytics::airflow [21:34:54] inflatador: okay, so, just to skip to the end so you have this when you're ready for it :) if indeed you want to keep the current role, take a look at hieradata/role/common/analytics_cluster/airflow/search.yaml [21:35:26] the profile::admin::groups section there controls which groups (from data.yaml) have admin rights on an-airflow1005 [21:35:51] compare to hieradata/role/common/search/airflow.yaml for an-airflow1001, and make them match if you want them to :) [21:37:41] Excellent! Will take a look. To be clear, I think they just need to ability to sudo into analytics-search user, not complete root but will verify [21:37:50] er..."the ability" [21:38:20] nod -- *that's* configured in data.yaml, so you can check there to see what rights are granted [21:38:28] you can always make a new admin group, give it exactly the sudo privs you want and apply that only to your new role. actually that is even nicer than full root, if you are up for it [21:38:30] and if you need a group with different rights, that's where you'd add it [21:40:09] or you can have it both ways. traditionally the groups ending in -roots have full root and the ones ending in -admins have only some sudo privs. you can have 2 or more groups on the same role. [21:43:01] Thanks! Still need to clarify with the analytics team, but I'm thinking we probably want to extend the privileges for airflow-search-admins so they can become the deploy user (analytics-search) [21:44:27] looks like you already have that: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#991 [21:45:05] err, different line [21:45:09] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#701 [21:46:26] oops yes, thanks [21:46:59] that's analytics-search-users rather than airflow-search-admins but the result is the same at least [21:47:09] (rather than mine which was just wrong-wrong) [21:48:26] np , thanks for drilling down on this. Don't want to yolo permissions ;) [21:50:49] btw inflatador, have you used the puppet compiler before? it will be useful for writing the patch [21:51:51] cdanis PCC? Yeah, I use it but LMK if there's something specific I need to focus on [21:52:07] nope, nothing beyond the obvious :) [23:06:51] would be nice if we had a global variable like we have $CACHES, $DEPLOYMENT_HOSTS etc.. but for $MONITORING_HOSTS. (in modules/base/templates/firewall/defs.erb). making firewall rules to allow the monitoring hosts to connect is a common thing and I would like to outsource having to think about which hosts exactly are in that at any given time [23:49:55] turns out monitoring_hosts isn't the one I need, it's more like prometheus_all_nodes for this. both in hierdata/common.yaml. but that was kind of the point that I should not even have to know and look them up [23:51:28] now if I do.. it also means puppet role breaks in cloud until I added "fake prometheus hosts" in project hiera or not sure