[08:26:38] _joe_: it seems that you added a new group to ops_members in puppet 2752f863a96, that needs to be added to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openldap/files/cross-validate-accounts.py#305 too to prevent the cross-validate-accounts script to report them (see email from 4h ago to root@) [08:26:54] <_joe_> sigh yes [08:27:24] <_joe_> I noticed, I wanted to finish my refactor of another python script and I intend to fix it [08:27:42] moritzm: do you think we could get that list automated somehow or that would invalidate the very thing it's checking? [08:27:55] <_joe_> volans: the latter, I already looked into it [08:28:02] <_joe_> last time I forgot :D [08:28:23] lol [08:28:41] we could at least have CI check that though [08:28:56] * volans hides :-P [08:34:05] Hey SREs, in case you missed it, we 'll have more trains as an experiment soon (https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/RBTZLASVJTU3RMHEKNSCTAOF76ZXEIUG/). One interesting question for SRE is spelled out in https://phabricator.wikimedia.org/T303758 feel free to chime in [09:54:25] Happy St Patrick's Day SREino's!! [09:54:52] <_joe_> topranks|off: ahah I was worried you were unwell [09:55:05] <_joe_> happy st.patrick day :) [09:55:12] ☘️ [09:56:19] Thankfully I got sick the week before the festivities. Although tbh I avoid Dublin City on this day always. [10:27:49] enjoy St Patrick! [10:29:31] oh, is wiki.willy the right person to tag in eqiad hardware jobs (e.g. host decommissioning) like one does pa.paul in codfw? [10:30:12] Emperor: ops-eqiad is the tag to add [10:30:31] we have all ops-$DC tags [10:30:40] sorry, dcops-$DC [10:31:20] * volans undo last sorry, was right the first time [10:33:41] volans: sure, but the decom checklist says "reassign task from service owner to DC ops team member and site project (ops-sitename) depending on site of server" (nb and); so the last decom task I did in codfw I assigned to pa.paul as well as the ops-codfw tag [10:34:38] Emperor: eqiad are cmjohnson1 and jclark-ctr... so hard to say who to pick, they both can do it [10:35:00] hence the preference of the specific DC tag so they can auto-assing, but if in doubt ask in #wikimedia-dcops [12:42:00] Amir1: you asked a few days ago if it was possible to use multiple pcc Hosts selections e.g. [12:42:03] O:mariadb::core_multiinstance,O:mariadb::misc::analytics::backup,O:mariadb::misc::multiinstance [12:42:13] well its now possible https://gerrit.wikimedia.org/r/c/operations/software/puppet-compiler/+/771483 :) [12:42:23] e.g. https://puppet-compiler.wmflabs.org/pcc-worker1001/34394/ [13:22:02] <_joe_> jbond: did you make any progress on running pcc locally? [13:22:39] <_joe_> I was thinking we could provide a docker-compose recipe, and maintain a puppetdb-daily image with puppetdb already populated [13:23:06] <_joe_> so that anyone can just pull the newest image, and run the puppet compiler on their local code [13:23:21] <_joe_> I am asking because I remember you and dcaro discussing it [13:25:18] jbond: Thanks for this! If you have some ideas from top of your head about T281249#7785245, that'd be really appreciated [13:25:18] T281249: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 [13:28:01] _joe_: we made some progress towards that effort. inthat in the puppet-diffs project we havce split out the puppetdb and pcc functions. i was potentially thinking of having a public puppetdb instance that some local container could talk too, however i think that client auth connections from `puppet master compile` are hard coded to use client auth and thats where things stalled. [13:29:13] its also worth noting that dca.ro added some scripts under utils (from the puppet-compile repo) which allows you to set up a local envrionemtn to run pcc, but of course fails on anything that requires puppet db [13:29:39] Amir1: yes will takje a look in a sec [13:31:09] thanks [13:31:10] no rush [13:31:41] the other thing in relation to the puppetdb-daily is should we support cloud hosts in the same image and if so which projects. also the populate puppet script take quite some time (in puppet-diffs we try to populate the db all hosts now) [13:32:20] (we could improve the preformance of the populate script, just havn't yet) [14:19:52] XioNoX: https://phabricator.wikimedia.org/T304001#7785674 :) [14:19:59] i can try and make a patch if that would be helpful [14:21:00] do we have our puppet CA/and or WM internal CA certs posted publicly anywhere? [14:21:19] ottomata: it's ok, I'll do it [14:23:08] inflatador: yes, in the puppet repo, e.g. modules/profile/files/puppet/ca.production.pem [14:23:14] okay thanks [14:23:34] just started, slightly lost, would have found my way but i'll just see yours :) [14:23:47] akosiaris awesome, thanks [14:26:54] ottomata: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/771612 I did it in a way that will keep the list updated, at the cost of adding the POPs prometheus hosts as well [14:27:05] anyway, it's all going away [14:29:39] oh huh cool, thank you! [14:29:50] XioNoX: is there an apply step? [14:30:50] ottomata: in a rush? ;) [14:30:55] (deployed) [14:31:14] haha, well i just want to do my stuff before meeting times start for the day [14:31:15] THANK YOU! [14:31:54] Stupid puppet question, if I may: profile::swift::stats_reporter is used in 2 places (modules/profile/manifests/swift/proxy.pp and modules/profile/manifests/thanos/swift/frontend.pp) and in both cases, the swift_cluster parameter is passed. Why, then, does modules/profile/manifests/swift/stats_reporter.pp also have a lookup in its definition (String $swift_cluster = lookup('profile::swift::cluster'),)? Is this so that one could in the [14:31:54] future use profile::swift::stats_reporter _without_ specifying $swift_cluster and have it looked up, or is it duplication? [14:33:27] it seems odd (and confusing) to have swift/proxy.pp lookup profile::swift::cluster and pass it to swift::stats_reporter only for the latter to also be looking it up; so I feel sure I'm missing something... [14:37:09] (it's not marked as Optional) [14:40:38] Emperor: I don't see a reason either tbh. It just may be that type hints were added later on and it was indeed at the beginning having a sane default [15:17:39] inflatador: https://apt.wikimedia.org/wikimedia/pool/main/w/wmf-certificates/ may be a better option it includes the puppet ca and the pki.discover.wmnet root CA [15:26:23] <_joe_> yeah inflatador sorry I see my manager gave you outdated advice. What jbond said is the way to go nowadays if you need to have our internal PKIs available somewhere. [15:26:37] * _joe_ runs before alex can catch him [15:26:59] <_joe_> (he is way faster than me, I should note) [15:27:55] I want to see an SRE footrace now, at our eventual next offsite :) [15:28:37] maybe after dinner and drinks, to make it more interesting! [15:32:40] * Emperor has a strict no-running policy ;p [15:35:00] akosiaris: OK, so I took the lookup out, and now I get CI errors of the form "modules/profile/manifests/swift/stats_reporter.pp:3 wmf-style: Parameter 'swift_cluster' of class 'profile::swift::stats_reporter' has no call to lookup" [15:37:18] Which is odd because I think that means puppet (or the CI) is considering all the parameters to stats_reporter as in effect optional, because there's always the fallback to lookup ? [15:37:23] <-- hopelessly confused [15:40:37] if profiles must always have their parameters set via lookup, why are we parameterising them when calling them? [15:41:51] <_joe_> bblack: after drinks, I might stand a chance [15:42:05] <_joe_> oh wait we have a bunch of irish and british people, nevermind. [15:42:23] e.g. if stats_reporter _must_ lookup('profile::swift::cluster'), then why does modules/profile/manifests/swift/proxy.pp call 'profile::siwft::stats_reporter' and pass a swift_cluster parameter? [15:43:15] <_joe_> Emperor: you're assuming what you read is optimized or even makes sense. [15:43:16] surely it doesn't make sense to effectively define the swift_cluster parameter to profile::swift:;stats_reporter (at least) twice? [15:43:21] <_joe_> that's the error. [15:43:54] <_joe_> Emperor: sorry, I meant to answer to your last inquiry but I'm lagged with work :/ [15:43:55] never assume the person before you actually knew what they were doing and did it perfectly, even if it was yourself :) [15:44:25] or in my case, especially if it's myself! [15:46:34] I am at that point where I think I know even less how this is all meant to work than I did when I started :( [15:48:57] that's a good start! when you reach the point of knowing absolutely nothing, you will have achieved enlightenment :) [15:49:43] _joe_: no worries. I will resist sending you another mail in the mean time :) [15:50:10] <_joe_> Emperor: I'll try to find some time tomorrow morning when we can pair for a sec [15:50:31] _joe_: I would appreciate that, but I know you're busy! [16:02:07] Emperor: ah yeah, it's a profile, that's why and ouf wmf-style asks that we always do explicit lookups so that implicit lookups can never happen. See bullet point 1 of https://wikitech.wikimedia.org/wiki/Puppet_coding#Profiles [16:03:23] now the big question is why is a class including just 2 classes a profile class to begin with, especially if the only users of it don't just include/require it but rather specify all parameters [16:04:17] the reason btw we don't want implicit hiera lookups is that they are a mess to debug. [16:04:49] you end up with surprising values passed to parameters in way you did not expect [16:45:25] volans: So I am reimaging kubernetes10{18..22} and I am met (for all of them) with a dreaded Unable to verify that the host is inside the Debian installer, please verify manually with: sudo install_console kubernetes1018.eqiad.wmnet [16:45:27] message [16:45:35] respectively per host ofc [16:46:05] running that, asks for root pass and I login in the box [16:46:13] and is the old OS? [16:46:14] but it has not been reimaged at all [16:46:17] yes the old OS [16:46:28] ok so that happends for 2 possible reasons [16:46:56] 1) the host did not PXE boot even if we set the next boot to be PXE and checked that the change was applied according to IPMI (happened) [16:47:10] 2) the host failed PXE boot and then fallback to boot from existing OS on disk [16:47:14] let me check the row [16:47:45] akosiaris: all of them same issue? [16:47:56] I think so, let me double check [16:49:13] volans: yup, confirmed [16:49:35] all of them even display the motd about when puppet was disabled and a "Host reimage" message [16:49:36] weird, let me check some logs [16:49:41] they did reboot however [16:49:51] XioNoX: did anything change for the cloudvirt host pxe issues that might justify this? ^^^ [16:50:09] possible failed pxe, I'm starting to debug now [16:51:12] volans: not that I'm aware of [16:51:35] it's across all 4 rows btw [16:51:40] we got ETOOMANY pxe failures today, seems that something happened [16:51:54] 2 boxes in b and 1 box per a, c, d [16:54:08] let me dig a bit... [16:55:06] ahh i also have one in e [16:55:08] dumpsdata1006 [16:55:11] so lots of pxe fails [16:55:18] i see it hit install1003 and the syslog shows it [16:55:21] but it never makes it back to the host [16:55:26] and host times out [16:57:50] volans: so its all hosts not just cloud [16:57:53] if that helps for troubleshooting [16:58:00] well, not 'all' but rephrase [16:58:02] not just wmcs hosts [16:58:05] not sure [16:58:21] there have been other reimages [16:58:27] now mine is in row e though [16:58:29] before [16:58:32] but i imaged a system in row f before [16:58:40] so yeah [17:00:11] so, I can see DHCPACK messages for all 5 kubernetes hosts in /var/log/messages, so that part must be working [17:00:32] there's an open tftp service alert for install1003, not sure if related [17:00:52] there we go ^ [17:01:07] XioNoX: did you leave your debug atftp by any chance? [17:01:20] volans: nop [17:01:26] ok checking [17:02:32] install1003 wmf-auto-restart: INFO: 2022-03-17 17:02:16,249 : Could not query the PID of atftpd: 1 [17:02:43] perfect... atftpd became init :P [17:02:54] I thought that was emacs's job [17:03:16] should I try a restart ? [17:03:22] I've just restarted it [17:03:43] ok, I guess I restart all 5 of my cookbooks, right ? [17:04:09] herron: thanks! that saved us quite some time [17:04:16] if they failed yes, and please do it with 1~2 minutes apart from each other [17:04:23] akosiaris: np! [17:04:36] wait, pxe fixed? [17:04:56] coordinate between each other to avoid all reimages at the same time please [17:05:11] they might fail the puppet run on alert1001 if too many at the same time [17:05:17] ok folks i got 1 host to reimage but its going to fail a lot due to partman issues [17:05:31] so if you want me to stall for 30min to let your normal installs go i can [17:05:55] seems best, since most installs should just go, and i know mine will have recipe issues to troubleshoot [17:05:58] volans: what's the threshold? [17:06:05] I mean is 5 or 6 a lot ? [17:06:31] cause I got 5 and rob has 1, it sure doesn't sound like a lot of reimages [17:06:58] akosiaris: puppet run on alert1001 takes ~1 minute and the run-puppet-agent run by the reimage has some more retries [17:07:01] * volans checking [17:09:02] we run it with --attempts 30 [17:10:21] each attempt has a sleep 10 [17:10:58] akosiaris: I usually suggest to have 1~2 minute delay between the start of the reimage [17:11:15] then you can pile many [17:11:23] ok, that's what I did I guess [17:11:26] <_joe_> volans: you should add poolcounter support to reimaging [17:11:31] yes, that ^ [17:11:35] lol [17:11:43] <_joe_> we have python-poolcounter :) [17:11:46] last puppet run on alert1001 start 17:48:11 - 17:50:35 [17:11:59] previous one 17:18:36 - 17:21:11 [17:12:43] _joe_: you know I have plans to add locks for cookbooks integrated into spicerack... having time [17:13:12] I don't like the poolcounter approach because it's dc-local, but we can discuss details once we have time to work on it [17:15:45] akosiaris: you know the fun part? atftpd was ok from sysctl PoV so the systemd alert was not firing [17:16:09] yes, cause ExecStart is calling /etc/init.d/atftpd [17:16:14] ew [17:21:13] volans: seems i still have same issue [17:21:30] dumpsdata1006 spinning, install1003 sending it dhcp response [17:21:32] but not hitting the server? [17:23:56] robh: it's in the new rows, let me have a look [17:24:03] its new row yes [17:24:15] robh: do you have the IP or hostname it sends the dhcp response to? [17:24:24] I don't see incoming dhcp trquests on tcpdum [17:24:24] Mar 17 17:24:13 install1003 dhcpd[20555]: DHCPOFFER on 10.64.130.3 to e4:3d:1a:ae:59:c8 via 10.64.130.1 [17:24:24] Mar 17 17:24:17 install1003 dhcpd[20555]: DHCPDISCOVER from e4:3d:1a:ae:59:c8 via 10.64.130.1 [17:25:16] that is the mac of the host checked via idrac [17:25:18] robh: yeah I know what's going on, one sec [17:25:32] cool! [17:26:16] sorry i just have 'fight with new raid controller' as my high priority item and the installer tests are one of those things hehe [17:26:26] hence my pestering [17:28:19] robh: try again [17:28:51] trying [17:34:57] <_joe_> jayme / elukey / akosiaris the new bullseye k8s hosts fail to download the mediawiki images in eqiad it seems [17:35:02] <_joe_> with "auth required" [17:35:22] needs puppet run on registry [17:35:48] there is this rant in the nginx config... [17:35:56] XioNoX: works [17:35:57] <_joe_> ahhhh lol [17:36:02] <_joe_> jayme: by me or you? [17:36:16] fsero IIRC :-) [17:36:25] we did nothing wrong [17:36:30] robh: nice :) [17:36:36] <_joe_> ahah [17:37:13] <_joe_> akosiaris / elukey please do not reimage nodes during deployment windows then, or if you do so run puppet on the registry before adding the node to the cluster [17:37:54] this is ofc. only relevant for *new* nodes. Not for reimaging [17:38:01] <_joe_> yes [17:38:49] <_joe_> jayme: maybe we need to make this clear in the docs? [17:40:08] I guess that makes sense... [17:40:53] _joe_: I'm receiving this as an instruction :-p [17:41:28] <_joe_> jayme: if you prefer to deal with yet another stuck deployment instead, :P [17:42:14] fyi all i just pushed out a change which is adding some junk to shell logins fo bash. fixing now [17:42:24] nono, please go ahead :) [17:43:15] <_joe_> jayme: did you run puppet on the registries by any chance? [17:43:24] nope [17:43:45] <_joe_> ack, doing so [17:50:29] hm...guess what. There is no easy way of preventing this as the nodes join the cluster automatically on reimage [18:13:50] what's all those declare -x I see when I login on deploy1002? [18:13:55] I was about to cordon those nodes [18:15:22] akosiaris: fix is in place in puppet just needs a pupet run to fix [18:15:40] jbond: cool, thanks [18:16:55] i think puppet shouuld finish rolling this out by 18:30 so please ping if people are still seeing it after that [18:18:40] it's already ok on deploy1002 [18:19:21] kubernetes1018-22 cordoned now, will do a dry-run tomorrow and pool them into service [18:22:47] ack! [18:24:22] ahh cool [18:24:31] i just saw it on a host and wondered whats up