[04:54:18] <diskdance[m]>	 Somebody in the Chinese Wikipedia community managed to build a modified Wikipedia Android app which sends fake SNIs (a.k.a. domain fronting) to evade GFW's SNI RST. I want to ask whether such measurements are supported or not
[07:05:16] <elukey>	 hello folks, I am going to move kafka-logging1001 to pki in a few
[07:17:47] <elukey>	 Kafka restarted on 1001, it seems recovering as expected. If you see clients misbehaving please let me know :)
[07:18:20] <elukey>	 (in theory we should have all clients covered and already connecting to kafka logging with the wmf bundle containing root pki cert + puppet root ca cert)
[07:31:19] <elukey>	 --
[07:31:47] <elukey>	 if anybody has a minute: https://phabricator.wikimedia.org/T319261#8282012 
[07:32:04] <elukey>	 I think that doing a roll restart of the eventgate-logging-external pods should be enough
[07:32:57] <_joe_>	 elukey: wait, eventgate doesn't refresh the schemas at runtime??
[07:33:28] <elukey>	 _joe_ not eventgate-logging-external, IIRC there were some issues with it 
[07:33:42] <_joe_>	 ok roll restart then
[07:33:47] <_joe_>	 it surely doesn't harm
[07:33:53] <_joe_>	 do you need me to do it?
[07:33:58] <elukey>	 nono will do it
[07:34:00] <elukey>	 thanks :)
[07:38:52] <elukey>	 mmm nope the roll restart didn't work
[07:39:09] <_joe_>	 brb
[07:41:49] <elukey>	 then it is bundled with the chart
[07:44:36] <elukey>	 err sorry in the docker image
[07:49:34] <_joe_>	 yeah I was supposing that was the case
[07:49:47] <_joe_>	 so I guess a commit is needed
[07:50:02] <_joe_>	 is there a bug about the issues with this?
[07:50:23] <elukey>	 https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/625966/1/.pipeline/blubber.yaml
[07:50:30] <elukey>	 yeah this is the last one
[07:51:10] <_joe_>	 oh I see
[07:58:24] <elukey>	 https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067 should work in theory
[08:52:19] <elukey>	 (going afk for some errands, bbl)
[09:03:35] <_joe_>	 btullis: ^^ it's an UBN on eventgate-logging-external, can you take a look
[09:04:11] <btullis>	 I'm already on it, but I have a gerrit access issue. I can't +2 this change https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067
[09:04:37] <btullis>	 Could someone from the Gerrit Managers group please assist? https://gerrit.wikimedia.org/r/admin/groups/93b1e277b72d0e0a883afbc0a87948dd6dd0d7b7,members
[09:05:05] <btullis>	 The procedure I'm following is this: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate-wikimedia_repository_change
[09:05:57] <taavi>	 looking
[09:06:08] <btullis>	 ...but the permissions on that repo currently won't let me merge the change and proceed, unless I'm much mistaken. Thanks taavi.
[09:08:07] <taavi>	 btullis: +2'd. and please file a task or something to figure out a better owner than gerrit-managers for that repo
[09:08:24] <btullis>	 taavi: Many thanks. Will do.
[09:08:40] <_joe_>	 sigh thanks taavi
[09:08:57] <_joe_>	 btullis: you can blame the fact andrew is a gerrit manager for that I guess :P
[09:26:46] <hnowlan>	 ahoyhoy, I'll be doing the rest of the sessionstore upgrades this morning
[09:30:39] <_joe_>	 hnowlan: cool
[09:30:54] <btullis>	 _joe_: I try not to apportion blame too much, unless it's to myself :-) 
[09:31:28] <btullis>	 deployment-charts CR ready for eventgate: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838107
[09:31:31] <_joe_>	 btullis: well I wasn't even blaming andrew for once
[09:32:15] <_joe_>	 +1'd
[09:57:10] <btullis>	 This is now deployed and I believe that it's fixed. Awaiting confirmation. Thanks all.
[10:39:45] <jayme>	 did anybody recently did a pbuilder build with network access (that worked) on build2001? I can't seem to get working what used to go well on good old deneb
[10:41:35] <_joe_>	 jayme: nope
[10:46:52] <hnowlan>	 _joe_: did you manage to tie the session losses to a particular deployment yesterday?
[10:51:23] <_joe_>	 hnowlan: no, tbh
[10:51:31] <_joe_>	 I didn't spend enough time on it though
[10:52:58] <_joe_>	 but yeah, nothing of the sorts
[10:53:35] <_joe_>	 so we need to tie one of these session losses to an http request
[10:53:42] <_joe_>	 I suspect it's some bot
[10:54:00] <_joe_>	 would you care to open a task?
[10:54:51] <hnowlan>	 yep, will do
[10:59:14] <arturo>	 volans: hey, I'm getting a weird cookbook exception
[10:59:19] <arturo>	 when reimaging a host
[10:59:23] <arturo>	 https://www.irccloud.com/pastebin/wtrbtIkJ/
[10:59:47] <arturo>	 figured I'd let you know before anything else
[11:02:28] <volans>	 arturo: it's saying that a config file for the dhcp already exists, that usually means a reimage is already in progress or it was killed midway in a bad way (kill -9 or double ctrl+c) without allowing it to cleanup the file
[11:02:53] <arturo>	 the manual cleanup should be on install1003?
[11:02:56] <volans>	 if you're sure there is noone else doing the same reimage you can safely delete the file from the install host and restart the cookbook
[11:03:23] <volans>	 codf codfw yes, correct, in /etc/dhcp/automation/..
[11:04:42] <arturo>	 done!
[11:04:49] <arturo>	 https://www.irccloud.com/pastebin/dJjqmg35/
[11:06:30] <arturo>	 thanks volans, the reimage is now proceeding
[11:08:41] <volans>	 great
[11:16:17] <hnowlan>	 I'm trying a reimage and it's looping on "Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb raised: Nagios_host resource with title sessionstore2001 not found yet"
[11:16:41] <hnowlan>	 retried the reimage after the last one timed out and I'm seeing the same. I think I've seen this happen to others before, any ideas?
[11:19:33] <volans>	 hnowlan: in the past that happened when we had issues with puppetdb that was slow to replicate to codfw
[11:19:36] <volans>	 let me check
[11:20:28] <hnowlan>	 thanks
[11:20:50] <volans>	 cc jbond for additional ideas
[11:21:03] <volans>	 latency is not ideal, but neither too horrible
[11:21:04] <volans>	 https://grafana.wikimedia.org/d/000000477/puppetdb?orgId=1&viewPanel=7
[11:24:01] <volans>	 mmmh I don't see any mention of sessionstore2001 in puppetdb logs after 9:34 UTC on puppetdb2002
[11:24:43] <volans>	 and just the 2 deactivate on puppetdb1002 at 9:40 and 10:56... weird
[11:25:03] <volans>	 jbond: do you know if anything happened on puppet that maybe doesn't store the report anymore on NOOP? did we change anything related?
[11:25:07] <volans>	 on the puppet side
[11:25:34] <volans>	 I think robh was possibly having the same issue yesterday
[11:25:44] <volans>	 (among with others unrelated)
[11:26:11] <jbond>	 volans: just looking now.  sessionstore2001 has not run puppet sunce after the re-image its getting an error 
[11:26:14] <jbond>	  Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=puppet]
[11:26:18] <jbond>	 Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=puppet]
[11:26:53] <volans>	 does it have the puppet CA ?
[11:27:08] <jbond>	 looking now
[11:27:40] <arturo>	 I got a similar thing to what hnowlan is reporting:
[11:27:42] <arturo>	 [27/50, retrying in 81.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title cloudnet1005 not found yet
[11:29:49] <volans>	 I wonder if the base image has changed in some way, arturo, hnowlan: which OS are you installing?
[11:29:58] <arturo>	 bullseye
[11:30:13] <arturo>	 I have 2 instances of the behavior now, same for the reimage of cloudnet1006
[11:31:46] <volans>	 ack
[11:32:06] <hnowlan>	 volans: buster
[11:33:36] <volans>	 jbond: is Puppet_Internal_CA.pem supposed to already exists on the d-i image? I don't recall when that's added
[11:34:11] <jbond>	 volans: Puppet_Internal_CA.pem is not used by puppet it, akaik,  uses  /var/lib/puppet/ssl/certs/ca.pem
[11:34:21] <jbond>	 fyi i have checked using openssl and seem to get connected
[11:34:25] <jbond>	  openssl s_client -CAfile /var/lib/puppet/ssl/certs/ca.pem -cert /var/lib/puppet/ssl/certs/sessionstore2001.codfw.wmnet.pem -key /var/lib/puppet/ssl/private_keys/sessionstore2001.codfw.wmnet.pem puppet:8140
[11:35:22] <volans>	 rght, it' uses the other path and that exists
[11:36:49] <arturo>	 let me know if I can be of any help, debugging, etc.. I tried accessing the console of cloudnet1005/1006, but no root password has been set
[11:37:14] <volans>	 you can access via install_console from cumin/puppetmaster hosts
[11:37:52] <arturo>	 that's what I tried, but there is no root_password set yet for the host
[11:38:10] <arturo>	 https://www.irccloud.com/pastebin/GOVQSVCh/
[11:38:55] <volans>	 I'm in
[11:38:55] <volans>	 $ sudo install_console cloudnet1005.eqiad.wmnet
[11:39:14] <arturo>	 -_-
[11:40:14] <arturo>	 ok I was trying via `sudo install_console cloudnet1005.mgmt.eqiad.wmnet` and then `console com2`
[11:41:04] <hnowlan>	 looks like sessionstore2001 in unblocked, but was that a manual action? 
[11:41:17] <volans>	 I bet was john :D
[11:41:22] <jbond>	 im running puppet manually on sessionstore 
[11:41:36] <jbond>	 but still not worked out the underlining issue
[11:42:25] <arturo>	 jbond: what's the recommended manual step to unblock it?
[11:43:02] <jbond>	 arturo: manuall;y remove th3e ssl dir i.e. on the server its self run 
[11:43:11] <jbond>	 rm -rf /var/lib/puppet/ssl
[11:43:25] <jbond>	 then on the puppetmaster  sudo puppet cert clean $server e.g.
[11:43:32] <jbond>	  sudo puppet cert clean cloudnet1005.eqiad.wmnet
[11:43:47] <jbond>	 then run puppet on the agent with wait e.g.
[11:43:52] <jbond>	 puppet agent -t -w 1
[11:43:57] <jbond>	 then sign on the puppetmaster
[11:44:05] <jbond>	 sudo puppet cert sign cloudnet1005.eqiad.wmnet
[11:44:10] <arturo>	 right, the usual dance then
[11:44:26] <arturo>	 thanks!
[11:44:34] <volans>	 if the cookbook is still polling then we should run the noop again, but it's a bit messy
[11:44:39] <jbond>	 yes and no problem
[11:44:43] <volans>	 please don't just run puppet and assume that the reimage was successful
[11:44:50] <volans>	 the reimage does a lot of things after the first puppet run
[11:45:05] <jbond>	 volans: indeed ill want to rebuild sessionstore properly
[11:45:17] <jbond>	 cloudnet are still configuered as insetup so i think they should be ok?
[11:45:22] <arturo>	 volans: ok! then I'll let cookbook timeout
[11:46:38] <hnowlan>	 cookbook just told me first puppet run failed on sessionstore2001, but there's a run still underway
[11:47:08] <volans>	 yeah it probably timeout while waiting for the lock from the manual run
[11:47:08] <hnowlan>	 should I retry? 
[11:47:18] <volans>	 wait tha tthe manual one completes
[11:48:30] <hnowlan>	 cool
[11:52:22] <jbond>	 hnowlan: feel free to retry when things are done, however i dont think the issues is fixed
[11:52:34] <jbond>	 but i think/hope ill be able to reproduce on sretest
[11:52:46] <jbond>	 fyi all im going to rebuild sretest :P
[11:53:16] <volans>	 go for it! :D
[11:54:01] <hnowlan>	 jbond: cool, thanks
[11:54:11] <arturo>	 I'll also standby
[12:00:19] <jbond>	 arturo: hnowlan: just an fyi that im going to grab some lunch and wilkl pick this back up when im back
[12:00:29] <arturo>	 👍
[12:01:36] <volans>	 I'm tailing the logs of the sretest reimage to see if I can spot anything
[12:03:35] <hnowlan>	 jbond: cool, thanks!
[12:03:58] <hnowlan>	 as it stands if this is successful can I continue with this host or do we need to do a full reimage? 
[12:04:35] <hnowlan>	 it's currently checking icinga service health and I don't want to pool the host if we're gonna have to reimage
[12:05:06] <volans>	 the reimage doesn't repool it automatically, so it's safe to let it finish
[12:05:43] <volans>	 I guess it dpeends on what's the issue, a clear reimage is always better, but if with the manual unblock the cookbook is able to complete it would be also ok for me
[12:10:56] <hnowlan>	 seems like it has been successful (despite an expected icinga fail), the manual step worked 
[12:11:00] <volans>	 and ofc it's not reproing on sretest1001
[12:11:13] <volans>	 went through and it's doing the puppet run
[12:11:27] <volans>	 cc jbond (for when you're back)
[12:34:32] <jbond>	 ack thanks looking at cloudnet1006 which is still in a borked state
[12:35:18] <arturo>	 both cloudnet1005/1006 are in the same state
[12:35:48] <volans>	 on 1005 the ssl dir was deleted though
[12:36:11] <jbond>	 arturo: i think you can try rebuilding 1005 its possiblethis was a transient issues as sretest and session sotre have all rebuilt successfully
[12:36:47] <arturo>	 volans: yes, sorry, I started the manual cleanup dance but stopped. I'll do a full reimage now
[12:38:35] <jbond>	 ahh i had hopped to keep 1006 to test :(
[12:42:51] <arturo>	 oh, sorry!
[12:43:01] <arturo>	 jbond: feel free to take over it. I can stop the reimage, no problem
[12:43:57] <jbond>	 arturo: iu imaging its allready gone to far ill just wait to see if it comes about again
[12:44:12] <arturo>	 jbond: ok... sorry!
[12:44:38] <jbond>	 no problem
[13:46:07] <sukhe>	 just as a heads-up: vgutierrez and I will be upgrading to ATS9 on all cp hosts in eqsin, esams, eqiad today. no impact expected and the caches should be preserved. see T309651
[13:46:09] <stashbot>	 T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651
[13:46:34] <sukhe>	 please let us know if you see something broken :)
[13:49:31] <hnowlan>	 depooling codfw sessionstore for a minute
[13:57:31] <hnowlan>	 jbond, volans: am I safe to (and is there anything useful I can do if I) reimage more sessionstore hosts? 
[13:58:09] <jbond>	 hnowlan: please go ahead.  we where unable to see any issues or recreate so the only thing is if you hit an error give me a ping so i can look further
[13:58:20] <hnowlan>	 jbond: will do, thanks 
[13:58:27] <jbond>	 cheers
[14:06:29] <volans>	 bblack: same applies for your reimage, for the puppet cert issue
[14:06:53] <volans>	 some host hit that again this morning, but then succeeded when trying to repro it shortly afterwards
[14:07:03] <volans>	 ping us if you hit it again
[15:10:39] <hnowlan>	 reimaging sessionstore2002 worked without issue
[15:47:02] <volans>	 hnowlan: thanks for the update, now we'll never know :/
[18:18:12] <mutante>	 having lookups in ./hieradata itself,  foo: '%{lookup('profile::something::else')}'. Are they just fine in some use cases or do we hate them in general as anti-patterns?
[18:21:59] <sukhe>	 mutante: I have used it quite freely (see hieradata/role/common/durum.yaml) but now I am interested in knowing if that was right or wrong too :)
[18:22:28] <mutante>	 ack, thx. it's more like a survey :)
[18:22:41] <volans>	 thre is also %{alias()}"
[18:23:06] <mutante>	 good point
[18:27:18] <mutante>	 sukhe: I see, in the middle of an Icinga check_http command, heh. it's kind of nice though that it's not even done in puppet let alone in Icinga config itself to generate all the check commands
[18:38:03] <sukhe>	 mutante: yeah, I think that's because how we structured the bird module, which is where the check_cmd comes from
[18:45:24] <mutante>	 ACK, makes sense
[19:17:20] <cdanis>	 for those of you who have been curious about jupyter + pyspark in our analytics environment, a little sample for you at https://phabricator.wikimedia.org/F35546836
[19:38:10] <volans>	 cdanis: nice! yesterday I was thinking that it would be nice to have either on-demand or always-on an easy insight with a bunch of top_n/sum_n stats with 1m windows that can be easily aggregable over larger windows (yes even if the data is partial can still give quite some signal if N is not too small) from the live data (either raw or sampled)
[19:40:04] <volans>	 and potentially we could even alert on some of them...
[19:40:08] <volans>	 thoughts?
[20:08:49] <cdanis>	 volans: I think that's quite reasonable, although really streaming druid/turnilo is the 'right' way to do that
[20:09:59] <volans>	 indeed, that's one of the options I had in mind
[20:10:01] <cdanis>	 basically I think how much effort to put into that depends on when we think T314981 might be done
[20:10:02] <stashbot>	 T314981: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981
[20:10:12] <volans>	 yes
[20:12:10] <cdanis>	 also you reminded me to file some more tasks re: turnilo :)
[20:12:16] <volans>	 :D
[21:20:01] <bblack>	 I'm staring at some deep stupid puzzles with some puppet circular dependency issues
[21:20:25] <bblack>	 it seems like something has fundmentally changed since I last looked hard at some of this stuff
[21:20:32] <bblack>	 but I can't put my finger on what changed where
[21:20:47] <bblack>	 [in our puppet repo and a lot of the very basic/wide dependencies]
[21:21:03] <bblack>	 but a good concrete example of behavior change
[21:21:19] <bblack>	 way back whenever, it used to be the case that even if some higher-level service things were failing to puppetize on the first run
[21:21:33] <bblack>	 it would still make progress on a lot of other basic things, like installing base packages and creating user accounts, etc
[21:22:00] <bblack>	 now it seems to consistently skip all kinds of basics due to "failed dependencies" that don't make any natural sense
[21:22:28] <bblack>	 like:
[21:22:30] <bblack>	 /Stage[main]/Base::Standard_packages/Package[colordiff] -> Skipping because of failed dependencies
[21:22:47] <bblack>	 ^ this happened, when the actual triggering failure was way off in bird/anycast-healthchecker stuff
[21:23:19] <bblack>	 somehow the loop involves things like Class['Apt'], Exec['apt-get update'], the git package, etc
[21:23:27] <bblack>	 it's very perplexing
[21:23:48] <bblack>	 has anyone run into the edges of this or does it sound familiar?
[21:24:49] <volans>	 do you have a puppetboard link to quickly look?
[21:25:17] <volans>	 I might endup 301ing you anyway towards jb..ond
[21:25:46] <bblack>	 https://phabricator.wikimedia.org/F35546970
[21:26:06] <bblack>	 ^ I did a dot2png on the dep cycle (I reverted the change that created the cycle, which was just a dep between two Service[foo]
[21:26:09] <bblack>	 )
[21:27:08] <bblack>	 volans: when I get a dependency cycle, it doesn't get far enough to report
[21:27:27] <bblack>	 Error: Found 1 dependency cycle:
[21:27:27] <bblack>	 (Exec[apt-get update] => Class[Apt] => Class[Profile::Apt] => Class[Base::Standard_packages] => Package[git] => Exec[git_clone_/srv/authdns/git] => Git::Clone[/srv/authdns/git] => Exec[authdns-local-update] => Package[gdnsd] => Systemd::Service[gdnsd] => Service[gdnsd] => Service[pdns-recursor] => Service[anycast-healthchecker] => Systemd::Service[anycast-healthchecker] => 
[21:27:31] <volans>	 right
[21:27:33] <bblack>	 Class[Bird::Anycast_healthchecker] => Class[Bird] => Apt::Package_from_component[bird2] => Apt::Repository[repository_bird2] => File[/etc/apt/sources.list.d/repository_bird2.list] => Exec[apt-get update])\nCycle graph written to /var/lib/puppet/state/graphs/cycles.dot.
[21:27:37] <bblack>	 ^ that's what I got on the run that made the graph
[21:27:51] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/838266/1/modules/profile/manifests/dns/recursor.pp
[21:27:59] <bblack>	 ^ that's the new dep I was trying to add, which completes the circle
[21:28:44] <bblack>	 but again, it seems like something more-fundamental is afoot.  Lots of very base/system level puppetization isn't being applied (skipped for failed dependencies) in ways that don't make sense, even without this new cycle
[21:28:57] <bblack>	 https://puppetboard.wikimedia.org/report/dns4003.wikimedia.org/66eb726bd638c5912ac4fa3d3c9a87b06c6feb50
[21:29:09] <bblack>	 ^ that's a failed run without the cycle-inducing change that's trying to fix things
[21:29:38] <bblack>	 but how this infects base::standard_packages all bailing out and skipping, is puzzling
[21:30:36] <bblack>	 something to do with the bird module's use of apt::repository it seems, but I donno
[21:30:51] <volans>	 my hunch is that because of the bird module that requires a component
[21:30:56] <volans>	 so a change in apt config files
[21:31:02] <volans>	 that requires an apt-get update
[21:31:35] <bblack>	 yes
[21:31:40] <bblack>	 that all seems sane
[21:32:06] <bblack>	 but then why does Exec['apt-get update'] seem to "require" all of Class['Apt']? it's just one of the things executed within it
[21:32:31] <bblack>	 and why does base::standard_packages require git, as opposed to just installing it?
[21:32:43] <bblack>	 I'm sure there's answers to those questions, but they're opaque to me at present
[21:32:58] <bblack>	 err sorry I think I said that backwards
[21:33:25] <volans>	 it might be related to ensure_packages in some way
[21:33:50] <bblack>	 yeah I think so
[21:34:05] <bblack>	 so the arrows point backwards for my brain, making it extra confusing
[21:35:52] <bblack>	 but so apt::repository does a notify to apt-get update (makes sense)
[21:36:27] <volans>	 yeah sorry, too late here to have a proper look right now
[21:36:33] <bblack>	 yeah me too I think
[21:37:00] <bblack>	 I'll poke jb tomorrow, maybe he has some idea.  It can't have been *that* long since we reimaged one of these from scratch