[08:26:38] <volans>	 _joe_: it seems that you added a new group to ops_members in puppet 2752f863a96, that needs to be added to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openldap/files/cross-validate-accounts.py#305 too to prevent the cross-validate-accounts script to report them (see email from 4h ago to root@)
[08:26:54] <_joe_>	 sigh yes
[08:27:24] <_joe_>	 I noticed, I wanted to finish my refactor of another python script and I intend to fix it
[08:27:42] <volans>	 moritzm: do you think we could get that list automated somehow or that would invalidate the very thing it's checking?
[08:27:55] <_joe_>	 volans: the latter, I already looked into it
[08:28:02] <_joe_>	 last time I forgot :D
[08:28:23] <volans>	 lol
[08:28:41] <volans>	 we could at least have CI check that though
[08:28:56] * volans hides :-P
[08:34:05] <akosiaris>	 Hey SREs, in case you missed it, we 'll have more trains as an experiment soon (https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/RBTZLASVJTU3RMHEKNSCTAOF76ZXEIUG/). One interesting question for SRE is spelled out in https://phabricator.wikimedia.org/T303758 feel free to chime in
[09:54:25] <topranks>	 Happy St Patrick's Day SREino's!!  
[09:54:52] <_joe_>	 topranks|off: ahah I was worried you were unwell
[09:55:05] <_joe_>	 happy st.patrick day :)
[09:55:12] <btullis>	 ☘️
[09:56:19] <topranks|off>	 Thankfully I got sick the week before the festivities.  Although tbh I avoid Dublin City on this day always.
[10:27:49] <volans>	 enjoy St Patrick!
[10:29:31] <Emperor>	 oh, is wiki.willy the right person to tag in eqiad hardware jobs (e.g. host decommissioning) like one does pa.paul in codfw?
[10:30:12] <volans>	 Emperor: ops-eqiad is the tag to add
[10:30:31] <volans>	 we have all ops-$DC tags
[10:30:40] <volans>	 sorry, dcops-$DC
[10:31:20] * volans undo last sorry, was right the first time
[10:33:41] <Emperor>	 volans: sure, but the decom checklist says "reassign task from service owner to DC ops team member and site project (ops-sitename) depending on site of server" (nb and); so the last decom task I did in codfw I assigned to pa.paul as well as the ops-codfw tag
[10:34:38] <volans>	 Emperor: eqiad are cmjohnson1 and jclark-ctr... so hard to say who to pick, they both can do it
[10:35:00] <volans>	 hence the preference of the specific DC tag so they can auto-assing, but if in doubt ask in #wikimedia-dcops
[12:42:00] <jbond>	 Amir1: you asked a few days ago if it was possible to use multiple pcc Hosts selections e.g. 
[12:42:03] <jbond>	 O:mariadb::core_multiinstance,O:mariadb::misc::analytics::backup,O:mariadb::misc::multiinstance
[12:42:13] <jbond>	 well its now possible https://gerrit.wikimedia.org/r/c/operations/software/puppet-compiler/+/771483 :)
[12:42:23] <jbond>	 e.g. https://puppet-compiler.wmflabs.org/pcc-worker1001/34394/
[13:22:02] <_joe_>	 jbond: did you make any progress on running pcc locally?
[13:22:39] <_joe_>	 I was thinking we could provide a docker-compose recipe, and maintain a puppetdb-daily image with puppetdb already populated
[13:23:06] <_joe_>	 so that anyone can just pull the newest image, and run the puppet compiler on their local code
[13:23:21] <_joe_>	 I am asking because I remember you and dcaro discussing it
[13:25:18] <Amir1>	 jbond: Thanks for this!  If you have some ideas from top of your head about T281249#7785245, that'd be really appreciated 
[13:25:18] <stashbot>	 T281249: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249
[13:28:01] <jbond>	 _joe_: we made some progress towards that effort. inthat in the puppet-diffs project we havce split out the puppetdb and pcc functions.  i was potentially thinking of having a public puppetdb instance that some local container could talk too, however i think that client auth connections from `puppet master compile` are hard coded to use client auth and thats where things stalled.
[13:29:13] <jbond>	 its also worth noting that dca.ro added some scripts under utils (from the puppet-compile repo) which allows you to set up a local envrionemtn to run pcc, but of course fails on anything that requires puppet db 
[13:29:39] <jbond>	 Amir1: yes will takje a look in a sec
[13:31:09] <Amir1>	 thanks
[13:31:10] <Amir1>	 no rush
[13:31:41] <jbond>	 the other thing in relation to the puppetdb-daily is should we support cloud hosts in the same image and if so which projects.  also the populate puppet script take quite some time (in puppet-diffs we try to  populate the db all hosts now)
[13:32:20] <jbond>	 (we could improve the preformance of the populate script, just havn't yet)
[14:19:52] <ottomata>	 XioNoX: https://phabricator.wikimedia.org/T304001#7785674 :)
[14:19:59] <ottomata>	 i can try and make a patch if that would be helpful
[14:21:00] <inflatador>	 do we have our puppet CA/and or WM internal CA certs posted publicly anywhere?
[14:21:19] <XioNoX>	 ottomata: it's ok, I'll do it
[14:23:08] <akosiaris>	 inflatador: yes, in the puppet repo, e.g. modules/profile/files/puppet/ca.production.pem
[14:23:14] <ottomata>	 okay thanks
[14:23:34] <ottomata>	 just started, slightly lost, would have found my way but i'll just see yours :)
[14:23:47] <inflatador>	 akosiaris awesome, thanks
[14:26:54] <XioNoX>	 ottomata: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/771612 I did it in a way that will keep the list updated, at the cost of adding the POPs prometheus hosts as well
[14:27:05] <XioNoX>	 anyway, it's all going away
[14:29:39] <ottomata>	 oh huh cool, thank you!
[14:29:50] <ottomata>	 XioNoX:  is there an apply step? 
[14:30:50] <XioNoX>	 ottomata: in a rush? ;)
[14:30:55] <XioNoX>	 (deployed)
[14:31:14] <ottomata>	 haha, well i just want to do my stuff before meeting times start for the day
[14:31:15] <ottomata>	 THANK YOU!
[14:31:54] <Emperor>	 Stupid puppet question, if I may: profile::swift::stats_reporter is used in 2 places (modules/profile/manifests/swift/proxy.pp and modules/profile/manifests/thanos/swift/frontend.pp) and in both cases, the swift_cluster parameter is passed. Why, then, does modules/profile/manifests/swift/stats_reporter.pp also have a lookup in its definition (String $swift_cluster = lookup('profile::swift::cluster'),)? Is this so that one could in the
[14:31:54] <Emperor>	 future use profile::swift::stats_reporter _without_ specifying $swift_cluster and have it looked up, or is it duplication?
[14:33:27] <Emperor>	 it seems odd (and confusing) to have swift/proxy.pp lookup profile::swift::cluster and pass it to swift::stats_reporter only for the latter to also be looking it up; so I feel sure I'm missing something...
[14:37:09] <Emperor>	 (it's not marked as Optional)
[14:40:38] <akosiaris>	 Emperor: I don't see a reason either tbh. It just may be that type hints were added later on and it was indeed at the beginning having a sane default
[15:17:39] <jbond>	 inflatador: https://apt.wikimedia.org/wikimedia/pool/main/w/wmf-certificates/ may be a better option it includes the puppet ca and the pki.discover.wmnet root CA
[15:26:23] <_joe_>	 yeah inflatador sorry I see my manager gave you outdated advice. What jbond said is the way to go nowadays if you need to have our internal PKIs available somewhere.
[15:26:37] * _joe_ runs before alex can catch him
[15:26:59] <_joe_>	 (he is way faster than me, I should note)
[15:27:55] <bblack>	 I want to see an SRE footrace now, at our eventual next offsite :)
[15:28:37] <bblack>	 maybe after dinner and drinks, to make it more interesting!
[15:32:40] * Emperor has a strict no-running policy ;p
[15:35:00] <Emperor>	 akosiaris: OK, so I took the lookup out, and now I get CI errors of the form "modules/profile/manifests/swift/stats_reporter.pp:3 wmf-style: Parameter 'swift_cluster' of class 'profile::swift::stats_reporter' has no call to lookup"
[15:37:18] <Emperor>	 Which is odd because I think that means puppet (or the CI) is considering all the parameters to stats_reporter as in effect optional, because there's always the fallback to lookup ? 
[15:37:23] <Emperor>	 <-- hopelessly confused
[15:40:37] <Emperor>	 if profiles must always have their parameters set via lookup, why are we parameterising them when calling them?
[15:41:51] <_joe_>	 bblack: after drinks, I might stand a chance
[15:42:05] <_joe_>	 oh wait we have a bunch of irish and british people, nevermind.
[15:42:23] <Emperor>	 e.g. if stats_reporter _must_ lookup('profile::swift::cluster'), then why does modules/profile/manifests/swift/proxy.pp call 'profile::siwft::stats_reporter' and pass a swift_cluster parameter?
[15:43:15] <_joe_>	 Emperor: you're assuming what you read is optimized or even makes sense.
[15:43:16] <Emperor>	 surely it doesn't make sense to effectively define the swift_cluster parameter to profile::swift:;stats_reporter (at least) twice?
[15:43:21] <_joe_>	 that's the error.
[15:43:54] <_joe_>	 Emperor: sorry, I meant to answer to your last inquiry but I'm lagged with work :/
[15:43:55] <bblack>	 never assume the person before you actually knew what they were doing and did it perfectly, even if it was yourself :)
[15:44:25] <bblack>	 or in my case, especially if it's myself!
[15:46:34] <Emperor>	 I am at that point where I think I know even less how this is all meant to work than I did when I started :(
[15:48:57] <bblack>	 that's a good start! when you reach the point of knowing absolutely nothing, you will have achieved enlightenment :)
[15:49:43] <Emperor>	 _joe_: no worries. I will resist sending you another mail in the mean time :)
[15:50:10] <_joe_>	 Emperor: I'll try to find some time tomorrow morning when we can pair for a sec
[15:50:31] <Emperor>	 _joe_: I would appreciate that, but I know you're busy!
[16:02:07] <akosiaris>	 Emperor: ah yeah, it's a profile, that's why and ouf wmf-style asks that we always do explicit lookups so that implicit lookups can never happen. See bullet point 1 of https://wikitech.wikimedia.org/wiki/Puppet_coding#Profiles
[16:03:23] <akosiaris>	 now the big question is why is a class including just 2 classes a profile class to begin with, especially if the only users of it don't just include/require it but rather specify all parameters
[16:04:17] <akosiaris>	 the reason btw we don't want implicit hiera lookups is that they are a mess to debug. 
[16:04:49] <akosiaris>	 you end up with surprising values passed to parameters in way you did not expect
[16:45:25] <akosiaris>	 volans: So I am reimaging kubernetes10{18..22} and I am met (for all of them) with a dreaded Unable to verify that the host is inside the Debian installer, please verify manually with: sudo install_console kubernetes1018.eqiad.wmnet
[16:45:27] <akosiaris>	 message
[16:45:35] <akosiaris>	 respectively per host ofc
[16:46:05] <akosiaris>	 running that, asks for root pass and I login in the box
[16:46:13] <volans>	 and is the old OS?
[16:46:14] <akosiaris>	 but it has not been reimaged at all
[16:46:17] <akosiaris>	 yes the old OS
[16:46:28] <volans>	 ok so that happends for 2 possible reasons
[16:46:56] <volans>	 1) the host did not PXE boot even if we set the next boot to be PXE and checked that the change was applied according to IPMI (happened)
[16:47:10] <volans>	 2) the host failed PXE boot and then fallback to boot from existing OS on disk
[16:47:14] <volans>	 let me check the row
[16:47:45] <volans>	 akosiaris: all of them same issue?
[16:47:56] <akosiaris>	 I think so, let me double check
[16:49:13] <akosiaris>	 volans: yup, confirmed
[16:49:35] <akosiaris>	 all of them even display the motd about when puppet was disabled and a "Host reimage" message
[16:49:36] <volans>	 weird, let me check some logs
[16:49:41] <akosiaris>	 they did reboot however
[16:49:51] <volans>	 XioNoX: did anything change for the cloudvirt host pxe issues that might justify this? ^^^
[16:50:09] <volans>	 possible failed pxe, I'm starting to debug now
[16:51:12] <XioNoX>	 volans: not that I'm aware of
[16:51:35] <akosiaris>	 it's across all 4 rows btw
[16:51:40] <volans>	 we got ETOOMANY pxe failures today, seems that something happened
[16:51:54] <akosiaris>	 2 boxes in b and 1 box per a, c, d 
[16:54:08] <volans>	 let me dig a bit...
[16:55:06] <robh>	 ahh i also have one in e
[16:55:08] <robh>	 dumpsdata1006
[16:55:11] <robh>	 so lots of pxe fails
[16:55:18] <robh>	 i see it hit install1003 and the syslog shows it
[16:55:21] <robh>	 but it never makes it back to the host
[16:55:26] <robh>	 and host times out
[16:57:50] <robh>	 volans: so its all hosts not just cloud
[16:57:53] <robh>	 if that helps for troubleshooting
[16:58:00] <robh>	 well, not 'all' but rephrase
[16:58:02] <robh>	 not just wmcs hosts
[16:58:05] <volans>	 not sure
[16:58:21] <volans>	 there have been other reimages
[16:58:27] <robh>	 now mine is in row e though
[16:58:29] <volans>	 before
[16:58:32] <robh>	 but i imaged a system in row f before
[16:58:40] <robh>	 so yeah
[17:00:11] <akosiaris>	 so, I can see DHCPACK messages for all 5 kubernetes hosts in /var/log/messages, so that part must be working
[17:00:32] <herron>	 there's an open tftp service alert for install1003, not sure if related
[17:00:52] <akosiaris>	 there we go ^ 
[17:01:07] <volans>	 XioNoX: did you leave your debug atftp by any chance?
[17:01:20] <XioNoX>	 volans: nop
[17:01:26] <volans>	 ok checking
[17:02:32] <akosiaris>	 install1003 wmf-auto-restart: INFO: 2022-03-17 17:02:16,249 : Could not query the PID of atftpd: 1
[17:02:43] <akosiaris>	 perfect... atftpd became init :P
[17:02:54] <akosiaris>	 I thought that was emacs's job 
[17:03:16] <akosiaris>	 should I try a restart ? 
[17:03:22] <volans>	 I've just restarted it
[17:03:43] <akosiaris>	 ok, I guess I restart all 5 of my cookbooks, right ?
[17:04:09] <akosiaris>	 herron: thanks! that saved us quite some time
[17:04:16] <volans>	 if they failed yes, and please do it with 1~2 minutes apart from each other
[17:04:23] <herron>	 akosiaris: np!
[17:04:36] <robh>	 wait, pxe fixed?
[17:04:56] <volans>	 coordinate between each other to avoid all reimages at the same time please
[17:05:11] <volans>	 they might fail the puppet run on alert1001 if too many at the same time
[17:05:17] <robh>	 ok folks i got 1 host to reimage but its going to fail a lot due to partman issues
[17:05:31] <robh>	 so if you want me to stall for 30min to let your normal installs go i can
[17:05:55] <robh>	 seems best, since most installs should just go, and i know mine will have recipe issues to troubleshoot
[17:05:58] <akosiaris>	 volans: what's the threshold?
[17:06:05] <akosiaris>	 I mean is 5 or 6 a lot ?
[17:06:31] <akosiaris>	 cause I got 5 and rob has 1, it sure doesn't sound like a lot of reimages
[17:06:58] <volans>	 akosiaris: puppet run on alert1001 takes ~1 minute and the run-puppet-agent run by the reimage has some more retries
[17:07:01] * volans checking
[17:09:02] <volans>	 we run it with --attempts 30
[17:10:21] <volans>	 each attempt has a sleep 10
[17:10:58] <volans>	 akosiaris: I usually suggest to have 1~2 minute delay between the start of the reimage
[17:11:15] <volans>	 then you can pile many
[17:11:23] <akosiaris>	 ok, that's what I did I guess
[17:11:26] <_joe_>	 volans: you should add poolcounter support to reimaging
[17:11:31] <akosiaris>	 yes, that ^
[17:11:35] <akosiaris>	 lol
[17:11:43] <_joe_>	 we have python-poolcounter :)
[17:11:46] <volans>	 last puppet run on alert1001 start 17:48:11 - 17:50:35
[17:11:59] <volans>	 previous one 17:18:36 - 17:21:11
[17:12:43] <volans>	 _joe_: you know I have plans to add locks for cookbooks integrated into spicerack... having time
[17:13:12] <volans>	 I don't like the poolcounter approach because it's dc-local, but we can discuss details once we have time to work on it
[17:15:45] <volans>	 akosiaris: you know the fun part? atftpd was ok from sysctl PoV so the systemd alert was not firing
[17:16:09] <akosiaris>	 yes, cause ExecStart is calling /etc/init.d/atftpd
[17:16:14] <akosiaris>	 ew
[17:21:13] <robh>	 volans: seems i still have same issue
[17:21:30] <robh>	 dumpsdata1006 spinning, install1003 sending it dhcp response
[17:21:32] <robh>	 but not hitting the server?
[17:23:56] <XioNoX>	 robh: it's in the new rows, let me have a look
[17:24:03] <robh>	 its new row yes
[17:24:15] <XioNoX>	 robh: do you have the IP or hostname it sends the dhcp response to?
[17:24:24] <volans>	 I don't see incoming dhcp trquests on tcpdum
[17:24:24] <robh>	 Mar 17 17:24:13 install1003 dhcpd[20555]: DHCPOFFER on 10.64.130.3 to e4:3d:1a:ae:59:c8 via 10.64.130.1
[17:24:24] <robh>	 Mar 17 17:24:17 install1003 dhcpd[20555]: DHCPDISCOVER from e4:3d:1a:ae:59:c8 via 10.64.130.1
[17:25:16] <robh>	 that is the mac of the host checked via idrac
[17:25:18] <XioNoX>	 robh: yeah I know what's going on, one sec
[17:25:32] <robh>	 cool!
[17:26:16] <robh>	 sorry i just have 'fight with new raid controller' as my high priority item and the installer tests are one of those things hehe
[17:26:26] <robh>	 hence my pestering
[17:28:19] <XioNoX>	 robh: try again
[17:28:51] <robh>	 trying
[17:34:57] <_joe_>	 jayme / elukey / akosiaris the new bullseye k8s hosts fail to download the mediawiki images in eqiad it seems
[17:35:02] <_joe_>	 with "auth required"
[17:35:22] <jayme>	 needs puppet run on registry
[17:35:48] <jayme>	 there is this rant in the nginx config...
[17:35:56] <robh>	 XioNoX: works
[17:35:57] <_joe_>	 ahhhh lol
[17:36:02] <_joe_>	 jayme: by me or you?
[17:36:16] <jayme>	 fsero IIRC :-)
[17:36:25] <jayme>	 we did nothing wrong 
[17:36:30] <XioNoX>	 robh: nice :)
[17:36:36] <_joe_>	 ahah
[17:37:13] <_joe_>	 akosiaris / elukey please do not reimage nodes during deployment windows then, or if you do so run puppet on the registry before adding the node to the cluster
[17:37:54] <jayme>	 this is ofc. only relevant for *new* nodes. Not for reimaging
[17:38:01] <_joe_>	 yes
[17:38:49] <_joe_>	 jayme: maybe we need to make this clear in the docs?
[17:40:08] <jayme>	 I guess that makes sense...
[17:40:53] <jayme>	 _joe_: I'm receiving this as an instruction :-p
[17:41:28] <_joe_>	 jayme: if you prefer to deal with yet another stuck deployment instead, :P
[17:42:14] <jbond>	 fyi all i just pushed out a change which is adding some junk to shell logins fo bash.  fixing now
[17:42:24] <jayme>	 nono, please go ahead :)
[17:43:15] <_joe_>	 jayme: did you run puppet on the registries by any chance?
[17:43:24] <jayme>	 nope
[17:43:45] <_joe_>	 ack, doing so
[17:50:29] <jayme>	 hm...guess what. There is no easy way of preventing this as the nodes join the cluster automatically on reimage
[18:13:50] <akosiaris>	 what's all those declare -x I see when I login on deploy1002?
[18:13:55] <akosiaris>	 I was about to cordon those nodes
[18:15:22] <jbond>	 akosiaris: fix is in place in puppet just needs a pupet run to fix
[18:15:40] <akosiaris>	 jbond: cool, thanks
[18:16:55] <jbond>	 i think puppet shouuld finish rolling this out by 18:30 so please ping if people are still seeing it after that
[18:18:40] <akosiaris>	 it's already ok on deploy1002
[18:19:21] <akosiaris>	 kubernetes1018-22 cordoned now, will do a dry-run tomorrow and pool them into service
[18:22:47] <elukey>	 ack!
[18:24:22] <robh>	 ahh cool
[18:24:31] <robh>	 i just saw it on a host and wondered whats up