[07:21:24] gehel, dcausse : if you have time, here's my proposal on the whole KafkaClient/abstracting prefixes away: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/723214 - I went with the approach to abstract minimal set of functionalities that require a prefix, so that the main functions don't have to deal with that [07:21:49] zpapierski: thanks!, looking [07:21:53] I'm not sure if this is a better solution, but at least the atrocious substring is gone [07:22:18] then it's the right solution :) [07:22:25] don't worry about the build, it will fail since I didn't really work on the style yet [07:23:03] I'm also thinking of dropping TopicPartition from KafkaClient interface [07:23:43] but I'm not sure if that's such a good idea, class itself is useful, I'd probably replace it with something similar [07:25:06] yes it's a simple data object, I agree that it'd be better to hide the underlying kafka-client implementations but if the new data class is exactly the same there's little benefit [07:45:24] The KafkaClient being passed around as a method parameter feels a bit weird to me. That instance looks more like configuration to me and intuitively I would like to have it as an instance variable. [07:45:33] But that's all not very clear in my head. [07:45:41] Good enough for me as it is! [07:49:37] gehel: KafkaClient needs to be initiated and closed per function call, because it contains KafkaConsumer. In this situation, I can't really prepare it before time and leave as a configuration. I could rework that aspect, but it would make the code more complex. [07:50:16] No need ! Good enough ! [07:50:40] I need to take care of something, might be 5m late to the graph consultant meeting [08:45:12] errand [10:32:05] gehel: do you have an idea what that message about merge fail is about - https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/723214 ? [10:44:39] gehel: nvm, I see it's a global issue [12:06:04] lunch [13:15:13] zpapierski: it self fixed after ~ 40 minutes. Somehow CI managed to get too many ssh connections with Gerrit :\ [13:31:33] gehel: meeting? [14:34:57] I swear that spicerack build process has a mind of its own - I can't explain different results from prospector otherwise :) [15:24:27] \o [15:48:39] o/ [17:37:16] ryankemper: looked into pcc with andrew, andrew found that common.yaml might not actually be loaded. Looking through it seems plausible, can test with this patch: https://gerrit.wikimedia.org/r/c/labs/private/+/725049 [18:26:17] ebernhardson: thanks will test it. I wonder if that means we never needed the previous block [18:27:18] ryankemper: we had it set to null in the prod hiera, so it "works" even without loading that. But that should have let us see pcc against wcqs-beta-01 render the creds [18:27:29] which i thought i did check..hmm, running a new pcc to see if it gets loaded there :) [18:30:51] yup, the prod catalog here loads the oauth_settings appropriately: https://puppet-compiler.wmflabs.org/compiler1003/31402/wcqs-beta-01.wikidata-query.eqiad.wmflabs/index.html [18:31:09] so it *is* referenced, but maybe only against .wmflabs and not .wmnet? Hmm, have to find the hiera config's for them [18:31:16] ryankemper: you seem to be pairing with Erik already, should we cancel our pairing session? [18:31:53] gehel: yeah let's cancel [18:32:03] cool! [18:33:56] ebernhardson: any idea what would determine if it gets referenced for `.wmflabs` but not `.wmnet`? (just merged the patch btw so testing a new PCC run now) [18:34:38] ryankemper: puppet has a hiera.yaml on the puppetmaster that declares the load-order for variables. I'm pretty sure prod and labs have different files (found in modules/puppetmaster/files/*.hiera.yaml) [18:34:50] but i dont see how that gets to puppet-compiler, they aren't mentioned in it's catalog [18:35:52] and the labs.hiera.yaml explicitly has 'common.yaml' in it, which production.hiera.yaml doens't include. So i suppose that explains it [18:35:57] Seems like a footgun ;P [18:36:39] it does indeed [18:40:04] From here though, i don't see anything in the labs side that would load the new location i have in the patch [18:40:28] since we are trying to get both wcqs prod and wcqs-beta to see this variable in pcc [18:40:59] maybe we just add it twice and call it good enough, might be reasons they don't overlap that aren't immediately obvious [18:41:29] Yeah that sounds reasonable to me [18:44:18] hmm, something else is going on here that still isn't clear. labs/private has hieradata/profile/*, but 'profile' isn't mentioned anywhere in either hiera.yaml [18:47:22] magic via wmflib::expand_path. afaict that takes the request for `profile::query_service::oauth_secret_key` pop's the final bit and then opens profile/query_service.yaml [18:47:37] * ebernhardson feels like he probably looked that up before and forgot :P [18:48:13] but labs.hiera.yaml doesn't use wmflib::expand_path, only the prod side, so we probably need it wice [18:48:16] *twice [18:49:05] ebernhardson: https://gerrit.wikimedia.org/r/c/labs/private/+/725084 [18:49:10] we should only need the secret and not the settings right? [18:50:09] ryankemper: right [18:57:24] ebernhardson: https://puppet-compiler.wmflabs.org/compiler1003/31406/wcqs2001.codfw.wmnet/change.wcqs2001.codfw.wmnet.err [18:57:25] > Error: Evaluation Error: Error while evaluating a Resource Statement, Profile::Query_service::Blazegraph[wcqs-blazegraph]: parameter 'oauth_settings' references an unresolved type 'Wmflib::HTTPSUrl' (file: /srv/jenkins-workspace/puppet-compiler/31406/change/src/modules/profile/manifests/query_service/wcqs.pp, line: 39) on node wcqs2001.codfw.wmnet [18:58:36] new error at least [18:58:40] so will this not work until we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/724829? [18:58:53] oh wait pcc should take that into account since im running it on that patch, nvm [18:59:12] this part's a bit suspicious: [18:59:17] > Dropping the type specificity to Hash at the profile level should be acceptable as it will be checked as an OAuthSettings type when passed on. [19:07:12] ryankemper: hmm, i thought i fixed the Wmflib::HTTPSUrl, it should be Stdlib::HTTPSUrl [19:07:19] but i guess not :) [19:07:54] ah, will add that to the patch [19:12:29] ebernhardson: https://puppet-compiler.wmflabs.org/compiler1003/31411/wcqs2001.codfw.wmnet/index.html woot that fixed it [19:13:53] ryankemper: great! thats looking appropriate [19:14:43] ryankemper: although, i think it might still need profile::query_serivice::oauth set to true, not seeing the nginx config in there [19:15:34] ebernhardson: Yeah we need both those changes in one last puppet patch and we should be good at that point I think [19:16:28] ebernhardson: for the nginx config, are we envisioning just having like a `profile::query_service::redirect_url`? or can we construct it based off `@oauth_settings['oauth_index_url'] ` or something [19:19:12] ryankemper: thats https://gerrit.wikimedia.org/r/c/operations/puppet/+/724830 I put the url in the oauth settings, but then pass only the url to the part needed. It's kinda indirect, but that allows us to avoid manually checking if everything is defined by delegating to the OAuthSettings type [19:22:50] i suppose we can remove the optional's around oauth stuff there soon too, since it was only there to allow setting up the servers prior to having this all figured out [19:23:03] ebernhardson: meh so the `labs/private` stuff is correct but the actual `profile::query_service::oauth_consumer_secret` isn't getting set right now, so puppet is failing on wcqs [19:23:51] ryankemper: wcqs-beta or wcqs prod? [19:24:02] ebernhardson: wcqs prod [19:24:17] I think we just need an entry in a `hieradata/common/wcqs/public.yaml` or similar [19:24:22] ryankemper: hmm, so that would imply it's not finding it in the puppet private repo? [19:24:50] ebernhardson: oh right it's the secret not being found [19:25:12] ebernhardson: I see, I need to update `/srv/private` because it's still got the old way with the secret embedded in the oauth_settings I think [19:25:12] i suppose refer back to production.hiera.yaml [19:25:22] yea [19:26:47] ebernhardson: okay patching `/srv/private` such that `hieradata/role/common/wcqs/public.yaml` only contains `profile::query_service::oauth_consumer_secret: REDACTED` [19:28:52] okay that did it [19:31:16] great [19:46:11] ebernhardson: probably makes sense to flip oauth to `true` in https://gerrit.wikimedia.org/r/c/operations/puppet/+/724830/ as well right? (I just pushed a patch to that effect) [19:46:44] ryankemper: yea might as well do it all together. I'm probably creating a few more patches than necessary [19:51:56] https://puppet-compiler.wmflabs.org/compiler1001/31420/wcqs2001.codfw.wmnet/index.html looks good, merging [19:53:33] ebernhardson: interesting now wcqs can run puppet but wdqs can't, `Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Function lookup() did not find a value for the name 'profile::query_service::oauth_settings' (file: /etc/puppet/modules/profile/manifests/query_service/gui.pp, line: 1) on node wdqs1003.eqiad.wmnet` [19:54:02] ryankemper: ohh, hmm. [19:54:05] Ah we only set `profile::query_service::oauth_settings` for wcqs currently [19:54:23] Should we default it to `{}` in common or in a wdqs-specific hieradata? [19:55:25] ryankemper: i'd go with defaults i suppose, although i never know why :) [19:56:04] I like the idea of needing to explicitly override it to activate the oauth stuff [19:56:57] ah `hieradata/common.yaml` wouldn't work, it won't look up stuff with `::` apparently [19:57:11] oh I guess we should just set the default in the lookup command [19:59:52] yea that seems sane to me [20:04:27] ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/725104 seem right? running pcc now [20:05:38] ryankemper: i think only in profile::query_service::gui, i guess it has to be a followup but the idea was to make oauth non-optional in the config for wcqs [20:07:14] i guess no reason it has to be another patch [20:07:50] ebernhardson: so should I also get rid of the default for wcqs? or do we want it but we just want to remove the Optional [20:07:50] the difference is that ::gui is used by both, so has to be optional. ::wcqs should only be used on wcqs and that should always have oauth [20:08:16] ryankemper: i think we remove the optional and the defaults for ::wcqs, so it always has to be configured [20:12:41] lgtm [20:14:01] can't actually get pcc to run, `urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='integration.wikimedia.org', port=443): Read timed out. (read timeout=10)` [20:14:23] not sure if that means `integration.wikimedia.org` is having problems or something's wrong with the patch [20:14:53] intermittent CI timeouts, try again in 5 min :) [20:21:31] ebernhardson: hmm, https://puppet-compiler.wmflabs.org/compiler1002/31422/wdqs2002.codfw.wmnet/prod.wdqs2002.codfw.wmnet.err does the default not solve the lookup problem? [20:21:40] > Error: Function lookup() did not find a value for the name 'profile::query_service::oauth_settings' (file: /srv/jenkins-workspace/puppet-compiler/31422/production/src/modules/profile/manifests/query_service/gui.pp, line: 1) on node wdqs2002.codfw.wmnet [20:23:17] the default should avoid that happening :S [20:25:36] my first thought was wrong PS maybe, but that run reports using b2123f02a2 which is PS5 as expected, [20:25:56] Yeah it's so bizarre [20:29:14] I tried asking in #sre to see if someone knows [20:29:39] ebernhardson: unrelatedly, we got `PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wcqs_443: Servers wcqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal` [20:30:06] I acked those, but I'm a bit confused what's going on there. Usually we get that error when pybal wants to depool a service failing its readiness checks, but it can't bc of the depool threshold [20:30:26] But at the time of the alert all 3 hosts were pooled so I'd think it should have been able to just depool `wcqs1003`, which I ended up doing manually without issue [20:30:32] suggests it doesn't like some nginx config i would guess, looking [20:30:41] (I also don't understand why `wcqs1003` is down in the first place but yeah) [20:31:10] readiness-probe returns 500, so pybal complaining is correct. As to why...looking :) [20:31:35] it's probably the first host to get the oauth config via puppet [20:32:27] ohhh...the readiness-probe just gets rewritten into a sparql query, that then requires auth. [20:32:42] Ah, so the probe is just gonna fail now [20:32:47] Hmm [20:32:55] so 1) auth probably doesn't work first try. big surprise :) 2) we have to find some way to skip the auth in this case [20:33:40] ebernhardson: as far as making wcqs happy for now, it prob makes sense to just flip `oauth` back to `false` while we figure out how we're skipping auth right? [20:33:47] ryankemper: yes i think so [20:35:31] Okay for now I'm reverting https://gerrit.wikimedia.org/r/c/operations/puppet/+/724830 which will fix the puppet runs on `wdqs` while I wait to hear back why `default_value` isn't doing what we expect, and will also end up flipping `oauth` back to `false` since that patch was where we set it to true as well [20:36:22] sounds good [20:40:37] well, auth_request can't be used in any kind of conditional. Perhaps the most direct would be to not rewrite readiness-probe through the typical path and instead give it it's own config block passing into blazegraph? [20:44:03] i dunno, thats signficant amounts of duplication though...hmm [20:45:22] the other option might be to have the /_check_auth location always return 200 for some class of requests. Not sure how to cleanly flag something for that though [20:49:43] Okay so the problem with the `default` was our misunderstanding PCC; what we were looking at was the current state of production, not how things would be after the change [20:50:37] So https://gerrit.wikimedia.org/r/c/operations/puppet/+/725110 was fine and I merged it, and then I just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/725110 to disable oauth on wcqs again [20:56:07] sounds good. I'm trying to learn a bit more about this and figure out what we need. I think there might be merit to having the config somehow flag /readiness-probe requests somehow, and then have auth ignored on those flags. But can't figure out yet how that can be done in a way that's only accessible internally [20:57:28] ack. yeah I'm a bit stumped on the 'only accessible internally' part as well [20:57:44] wcqs / wdqs are back in a good state, gonna go grab lunch, back in ~40 mins [20:57:48] kk [21:27:40] so stupid idea, can we match the query string? if $qs == 'expected' { return 200 } [21:28:01] i guess i should get an nginx install going again locally to test :P [21:54:06] ryankemper: the answer ends up being stupendously simple: https://gerrit.wikimedia.org/r/c/operations/puppet/+/725120 [22:22:24] i suppose it's noop because it requires the other patch that turns on oauth. Could squash them together? [22:38:42] Yeah I (manually) added the oauth true as well [22:38:44] Just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/725120 [23:07:08] so the next question would be why localhost:9999/oauth/check_auth returns 403 (forbidden) on wcqs-beta, but 404(not found) on prod... [23:13:18] i suppose an obvious difference, the command line reported by ps has mw-oauth-proxy-0.3.74.war as the final component. Suspect blazegraph requires a restart after puppet runs [23:16:34] yea, wcqs1001 had run puppet already and restarting now gives the 403 forbidden as expected [23:37:16] hmm, going to require a bit of debugging for oauth. tomorrow :) Also gotta check with Z on cross-server sessions and cache's should work [23:49:39] ack