[04:54:18] Somebody in the Chinese Wikipedia community managed to build a modified Wikipedia Android app which sends fake SNIs (a.k.a. domain fronting) to evade GFW's SNI RST. I want to ask whether such measurements are supported or not [07:05:16] hello folks, I am going to move kafka-logging1001 to pki in a few [07:17:47] Kafka restarted on 1001, it seems recovering as expected. If you see clients misbehaving please let me know :) [07:18:20] (in theory we should have all clients covered and already connecting to kafka logging with the wmf bundle containing root pki cert + puppet root ca cert) [07:31:19] -- [07:31:47] if anybody has a minute: https://phabricator.wikimedia.org/T319261#8282012 [07:32:04] I think that doing a roll restart of the eventgate-logging-external pods should be enough [07:32:57] <_joe_> elukey: wait, eventgate doesn't refresh the schemas at runtime?? [07:33:28] _joe_ not eventgate-logging-external, IIRC there were some issues with it [07:33:42] <_joe_> ok roll restart then [07:33:47] <_joe_> it surely doesn't harm [07:33:53] <_joe_> do you need me to do it? [07:33:58] nono will do it [07:34:00] thanks :) [07:38:52] mmm nope the roll restart didn't work [07:39:09] <_joe_> brb [07:41:49] then it is bundled with the chart [07:44:36] err sorry in the docker image [07:49:34] <_joe_> yeah I was supposing that was the case [07:49:47] <_joe_> so I guess a commit is needed [07:50:02] <_joe_> is there a bug about the issues with this? [07:50:23] https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/625966/1/.pipeline/blubber.yaml [07:50:30] yeah this is the last one [07:51:10] <_joe_> oh I see [07:58:24] https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067 should work in theory [08:52:19] (going afk for some errands, bbl) [09:03:35] <_joe_> btullis: ^^ it's an UBN on eventgate-logging-external, can you take a look [09:04:11] I'm already on it, but I have a gerrit access issue. I can't +2 this change https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067 [09:04:37] Could someone from the Gerrit Managers group please assist? https://gerrit.wikimedia.org/r/admin/groups/93b1e277b72d0e0a883afbc0a87948dd6dd0d7b7,members [09:05:05] The procedure I'm following is this: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate-wikimedia_repository_change [09:05:57] looking [09:06:08] ...but the permissions on that repo currently won't let me merge the change and proceed, unless I'm much mistaken. Thanks taavi. [09:08:07] btullis: +2'd. and please file a task or something to figure out a better owner than gerrit-managers for that repo [09:08:24] taavi: Many thanks. Will do. [09:08:40] <_joe_> sigh thanks taavi [09:08:57] <_joe_> btullis: you can blame the fact andrew is a gerrit manager for that I guess :P [09:26:46] ahoyhoy, I'll be doing the rest of the sessionstore upgrades this morning [09:30:39] <_joe_> hnowlan: cool [09:30:54] _joe_: I try not to apportion blame too much, unless it's to myself :-) [09:31:28] deployment-charts CR ready for eventgate: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838107 [09:31:31] <_joe_> btullis: well I wasn't even blaming andrew for once [09:32:15] <_joe_> +1'd [09:57:10] This is now deployed and I believe that it's fixed. Awaiting confirmation. Thanks all. [10:39:45] did anybody recently did a pbuilder build with network access (that worked) on build2001? I can't seem to get working what used to go well on good old deneb [10:41:35] <_joe_> jayme: nope [10:46:52] _joe_: did you manage to tie the session losses to a particular deployment yesterday? [10:51:23] <_joe_> hnowlan: no, tbh [10:51:31] <_joe_> I didn't spend enough time on it though [10:52:58] <_joe_> but yeah, nothing of the sorts [10:53:35] <_joe_> so we need to tie one of these session losses to an http request [10:53:42] <_joe_> I suspect it's some bot [10:54:00] <_joe_> would you care to open a task? [10:54:51] yep, will do [10:59:14] volans: hey, I'm getting a weird cookbook exception [10:59:19] when reimaging a host [10:59:23] https://www.irccloud.com/pastebin/wtrbtIkJ/ [10:59:47] figured I'd let you know before anything else [11:02:28] arturo: it's saying that a config file for the dhcp already exists, that usually means a reimage is already in progress or it was killed midway in a bad way (kill -9 or double ctrl+c) without allowing it to cleanup the file [11:02:53] the manual cleanup should be on install1003? [11:02:56] if you're sure there is noone else doing the same reimage you can safely delete the file from the install host and restart the cookbook [11:03:23] codf codfw yes, correct, in /etc/dhcp/automation/.. [11:04:42] done! [11:04:49] https://www.irccloud.com/pastebin/dJjqmg35/ [11:06:30] thanks volans, the reimage is now proceeding [11:08:41] great [11:16:17] I'm trying a reimage and it's looping on "Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb raised: Nagios_host resource with title sessionstore2001 not found yet" [11:16:41] retried the reimage after the last one timed out and I'm seeing the same. I think I've seen this happen to others before, any ideas? [11:19:33] hnowlan: in the past that happened when we had issues with puppetdb that was slow to replicate to codfw [11:19:36] let me check [11:20:28] thanks [11:20:50] cc jbond for additional ideas [11:21:03] latency is not ideal, but neither too horrible [11:21:04] https://grafana.wikimedia.org/d/000000477/puppetdb?orgId=1&viewPanel=7 [11:24:01] mmmh I don't see any mention of sessionstore2001 in puppetdb logs after 9:34 UTC on puppetdb2002 [11:24:43] and just the 2 deactivate on puppetdb1002 at 9:40 and 10:56... weird [11:25:03] jbond: do you know if anything happened on puppet that maybe doesn't store the report anymore on NOOP? did we change anything related? [11:25:07] on the puppet side [11:25:34] I think robh was possibly having the same issue yesterday [11:25:44] (among with others unrelated) [11:26:11] volans: just looking now. sessionstore2001 has not run puppet sunce after the re-image its getting an error [11:26:14] Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=puppet] [11:26:18] Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked): [certificate revoked for /CN=puppet] [11:26:53] does it have the puppet CA ? [11:27:08] looking now [11:27:40] I got a similar thing to what hnowlan is reporting: [11:27:42] [27/50, retrying in 81.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title cloudnet1005 not found yet [11:29:49] I wonder if the base image has changed in some way, arturo, hnowlan: which OS are you installing? [11:29:58] bullseye [11:30:13] I have 2 instances of the behavior now, same for the reimage of cloudnet1006 [11:31:46] ack [11:32:06] volans: buster [11:33:36] jbond: is Puppet_Internal_CA.pem supposed to already exists on the d-i image? I don't recall when that's added [11:34:11] volans: Puppet_Internal_CA.pem is not used by puppet it, akaik, uses /var/lib/puppet/ssl/certs/ca.pem [11:34:21] fyi i have checked using openssl and seem to get connected [11:34:25] openssl s_client -CAfile /var/lib/puppet/ssl/certs/ca.pem -cert /var/lib/puppet/ssl/certs/sessionstore2001.codfw.wmnet.pem -key /var/lib/puppet/ssl/private_keys/sessionstore2001.codfw.wmnet.pem puppet:8140 [11:35:22] rght, it' uses the other path and that exists [11:36:49] let me know if I can be of any help, debugging, etc.. I tried accessing the console of cloudnet1005/1006, but no root password has been set [11:37:14] you can access via install_console from cumin/puppetmaster hosts [11:37:52] that's what I tried, but there is no root_password set yet for the host [11:38:10] https://www.irccloud.com/pastebin/GOVQSVCh/ [11:38:55] I'm in [11:38:55] $ sudo install_console cloudnet1005.eqiad.wmnet [11:39:14] -_- [11:40:14] ok I was trying via `sudo install_console cloudnet1005.mgmt.eqiad.wmnet` and then `console com2` [11:41:04] looks like sessionstore2001 in unblocked, but was that a manual action? [11:41:17] I bet was john :D [11:41:22] im running puppet manually on sessionstore [11:41:36] but still not worked out the underlining issue [11:42:25] jbond: what's the recommended manual step to unblock it? [11:43:02] arturo: manuall;y remove th3e ssl dir i.e. on the server its self run [11:43:11] rm -rf /var/lib/puppet/ssl [11:43:25] then on the puppetmaster sudo puppet cert clean $server e.g. [11:43:32] sudo puppet cert clean cloudnet1005.eqiad.wmnet [11:43:47] then run puppet on the agent with wait e.g. [11:43:52] puppet agent -t -w 1 [11:43:57] then sign on the puppetmaster [11:44:05] sudo puppet cert sign cloudnet1005.eqiad.wmnet [11:44:10] right, the usual dance then [11:44:26] thanks! [11:44:34] if the cookbook is still polling then we should run the noop again, but it's a bit messy [11:44:39] yes and no problem [11:44:43] please don't just run puppet and assume that the reimage was successful [11:44:50] the reimage does a lot of things after the first puppet run [11:45:05] volans: indeed ill want to rebuild sessionstore properly [11:45:17] cloudnet are still configuered as insetup so i think they should be ok? [11:45:22] volans: ok! then I'll let cookbook timeout [11:46:38] cookbook just told me first puppet run failed on sessionstore2001, but there's a run still underway [11:47:08] yeah it probably timeout while waiting for the lock from the manual run [11:47:08] should I retry? [11:47:18] wait tha tthe manual one completes [11:48:30] cool [11:52:22] hnowlan: feel free to retry when things are done, however i dont think the issues is fixed [11:52:34] but i think/hope ill be able to reproduce on sretest [11:52:46] fyi all im going to rebuild sretest :P [11:53:16] go for it! :D [11:54:01] jbond: cool, thanks [11:54:11] I'll also standby [12:00:19] arturo: hnowlan: just an fyi that im going to grab some lunch and wilkl pick this back up when im back [12:00:29] 👍 [12:01:36] I'm tailing the logs of the sretest reimage to see if I can spot anything [12:03:35] jbond: cool, thanks! [12:03:58] as it stands if this is successful can I continue with this host or do we need to do a full reimage? [12:04:35] it's currently checking icinga service health and I don't want to pool the host if we're gonna have to reimage [12:05:06] the reimage doesn't repool it automatically, so it's safe to let it finish [12:05:43] I guess it dpeends on what's the issue, a clear reimage is always better, but if with the manual unblock the cookbook is able to complete it would be also ok for me [12:10:56] seems like it has been successful (despite an expected icinga fail), the manual step worked [12:11:00] and ofc it's not reproing on sretest1001 [12:11:13] went through and it's doing the puppet run [12:11:27] cc jbond (for when you're back) [12:34:32] ack thanks looking at cloudnet1006 which is still in a borked state [12:35:18] both cloudnet1005/1006 are in the same state [12:35:48] on 1005 the ssl dir was deleted though [12:36:11] arturo: i think you can try rebuilding 1005 its possiblethis was a transient issues as sretest and session sotre have all rebuilt successfully [12:36:47] volans: yes, sorry, I started the manual cleanup dance but stopped. I'll do a full reimage now [12:38:35] ahh i had hopped to keep 1006 to test :( [12:42:51] oh, sorry! [12:43:01] jbond: feel free to take over it. I can stop the reimage, no problem [12:43:57] arturo: iu imaging its allready gone to far ill just wait to see if it comes about again [12:44:12] jbond: ok... sorry! [12:44:38] no problem [13:46:07] just as a heads-up: vgutierrez and I will be upgrading to ATS9 on all cp hosts in eqsin, esams, eqiad today. no impact expected and the caches should be preserved. see T309651 [13:46:09] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [13:46:34] please let us know if you see something broken :) [13:49:31] depooling codfw sessionstore for a minute [13:57:31] jbond, volans: am I safe to (and is there anything useful I can do if I) reimage more sessionstore hosts? [13:58:09] hnowlan: please go ahead. we where unable to see any issues or recreate so the only thing is if you hit an error give me a ping so i can look further [13:58:20] jbond: will do, thanks [13:58:27] cheers [14:06:29] bblack: same applies for your reimage, for the puppet cert issue [14:06:53] some host hit that again this morning, but then succeeded when trying to repro it shortly afterwards [14:07:03] ping us if you hit it again [15:10:39] reimaging sessionstore2002 worked without issue [15:47:02] hnowlan: thanks for the update, now we'll never know :/ [18:18:12] having lookups in ./hieradata itself, foo: '%{lookup('profile::something::else')}'. Are they just fine in some use cases or do we hate them in general as anti-patterns? [18:21:59] mutante: I have used it quite freely (see hieradata/role/common/durum.yaml) but now I am interested in knowing if that was right or wrong too :) [18:22:28] ack, thx. it's more like a survey :) [18:22:41] thre is also %{alias()}" [18:23:06] good point [18:27:18] sukhe: I see, in the middle of an Icinga check_http command, heh. it's kind of nice though that it's not even done in puppet let alone in Icinga config itself to generate all the check commands [18:38:03] mutante: yeah, I think that's because how we structured the bird module, which is where the check_cmd comes from [18:45:24] ACK, makes sense [19:17:20] for those of you who have been curious about jupyter + pyspark in our analytics environment, a little sample for you at https://phabricator.wikimedia.org/F35546836 [19:38:10] cdanis: nice! yesterday I was thinking that it would be nice to have either on-demand or always-on an easy insight with a bunch of top_n/sum_n stats with 1m windows that can be easily aggregable over larger windows (yes even if the data is partial can still give quite some signal if N is not too small) from the live data (either raw or sampled) [19:40:04] and potentially we could even alert on some of them... [19:40:08] thoughts? [20:08:49] volans: I think that's quite reasonable, although really streaming druid/turnilo is the 'right' way to do that [20:09:59] indeed, that's one of the options I had in mind [20:10:01] basically I think how much effort to put into that depends on when we think T314981 might be done [20:10:02] T314981: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 [20:10:12] yes [20:12:10] also you reminded me to file some more tasks re: turnilo :) [20:12:16] :D [21:20:01] I'm staring at some deep stupid puzzles with some puppet circular dependency issues [21:20:25] it seems like something has fundmentally changed since I last looked hard at some of this stuff [21:20:32] but I can't put my finger on what changed where [21:20:47] [in our puppet repo and a lot of the very basic/wide dependencies] [21:21:03] but a good concrete example of behavior change [21:21:19] way back whenever, it used to be the case that even if some higher-level service things were failing to puppetize on the first run [21:21:33] it would still make progress on a lot of other basic things, like installing base packages and creating user accounts, etc [21:22:00] now it seems to consistently skip all kinds of basics due to "failed dependencies" that don't make any natural sense [21:22:28] like: [21:22:30] /Stage[main]/Base::Standard_packages/Package[colordiff] -> Skipping because of failed dependencies [21:22:47] ^ this happened, when the actual triggering failure was way off in bird/anycast-healthchecker stuff [21:23:19] somehow the loop involves things like Class['Apt'], Exec['apt-get update'], the git package, etc [21:23:27] it's very perplexing [21:23:48] has anyone run into the edges of this or does it sound familiar? [21:24:49] do you have a puppetboard link to quickly look? [21:25:17] I might endup 301ing you anyway towards jb..ond [21:25:46] https://phabricator.wikimedia.org/F35546970 [21:26:06] ^ I did a dot2png on the dep cycle (I reverted the change that created the cycle, which was just a dep between two Service[foo] [21:26:09] ) [21:27:08] volans: when I get a dependency cycle, it doesn't get far enough to report [21:27:27] Error: Found 1 dependency cycle: [21:27:27] (Exec[apt-get update] => Class[Apt] => Class[Profile::Apt] => Class[Base::Standard_packages] => Package[git] => Exec[git_clone_/srv/authdns/git] => Git::Clone[/srv/authdns/git] => Exec[authdns-local-update] => Package[gdnsd] => Systemd::Service[gdnsd] => Service[gdnsd] => Service[pdns-recursor] => Service[anycast-healthchecker] => Systemd::Service[anycast-healthchecker] => [21:27:31] right [21:27:33] Class[Bird::Anycast_healthchecker] => Class[Bird] => Apt::Package_from_component[bird2] => Apt::Repository[repository_bird2] => File[/etc/apt/sources.list.d/repository_bird2.list] => Exec[apt-get update])\nCycle graph written to /var/lib/puppet/state/graphs/cycles.dot. [21:27:37] ^ that's what I got on the run that made the graph [21:27:51] https://gerrit.wikimedia.org/r/c/operations/puppet/+/838266/1/modules/profile/manifests/dns/recursor.pp [21:27:59] ^ that's the new dep I was trying to add, which completes the circle [21:28:44] but again, it seems like something more-fundamental is afoot. Lots of very base/system level puppetization isn't being applied (skipped for failed dependencies) in ways that don't make sense, even without this new cycle [21:28:57] https://puppetboard.wikimedia.org/report/dns4003.wikimedia.org/66eb726bd638c5912ac4fa3d3c9a87b06c6feb50 [21:29:09] ^ that's a failed run without the cycle-inducing change that's trying to fix things [21:29:38] but how this infects base::standard_packages all bailing out and skipping, is puzzling [21:30:36] something to do with the bird module's use of apt::repository it seems, but I donno [21:30:51] my hunch is that because of the bird module that requires a component [21:30:56] so a change in apt config files [21:31:02] that requires an apt-get update [21:31:35] yes [21:31:40] that all seems sane [21:32:06] but then why does Exec['apt-get update'] seem to "require" all of Class['Apt']? it's just one of the things executed within it [21:32:31] and why does base::standard_packages require git, as opposed to just installing it? [21:32:43] I'm sure there's answers to those questions, but they're opaque to me at present [21:32:58] err sorry I think I said that backwards [21:33:25] it might be related to ensure_packages in some way [21:33:50] yeah I think so [21:34:05] so the arrows point backwards for my brain, making it extra confusing [21:35:52] but so apt::repository does a notify to apt-get update (makes sense) [21:36:27] yeah sorry, too late here to have a proper look right now [21:36:33] yeah me too I think [21:37:00] I'll poke jb tomorrow, maybe he has some idea. It can't have been *that* long since we reimaged one of these from scratch