[00:06:06] yup, looks happy from here [00:23:54] ebernhardson: okay we're in the `production` state and I'm rolling out the dns change now [00:24:14] the authdns update succeeded so just need to add entries to `modules/profile/files/configmaster/disc_desired_state.py` as the final step [00:25:25] Patch for that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/724545 [00:25:52] ryankemper: excellent [00:28:38] ryankemper: i dunno if i call it lucky or what, but i dropped commons-query.wikimedia.org pointing at text-lb.ulsfo into my /etc/hosts. It loads, and the douglas adams example even gives results :) [00:28:56] lucky because not all the servers actually loaded their data. I probably have to do it again, but this is great! [00:29:15] awesome! [00:29:27] do we have wcqs-specific examples? since it's a different dataset [00:29:50] yes, i think there is a wiki page on commons with a special template for each one that it loads from [00:30:15] makes sense [00:30:23] when in doubt, it's either a mediawiki extension or template magic :D [00:30:28] usually :) [00:31:43] ebernhardson: okay just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/724545 so (once puppet has auto ran wherever it needs to) that should be the last of the lvs work, fingers crossed [00:32:12] Looks to be https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples [00:33:16] neat [00:46:01] Stepping out to go exercise, puppet's ran on the puppetmasters (which are also the `configmasters` presumably because they have to talk to etcd) and the `disc_desired_state.py`'s updated as expected [00:46:10] sounds good, thanks a bunhc! [00:47:07] likewise :) [05:33:52] dcausse: hi, I don't know if you are aware but 20% of wall time of all job runners is just spending time to push the cirrus search jobs to the queue (CirrusSearch\Updater::pushElasticaWriteJobs). Not running the jobs, just trying to queue them [05:34:44] I check why eventbus is so slow at taking them but can it be that these jobs have massive payloads? [05:52:13] I might be going crazy but ElasticaWrite::build returns self, while it should return JobSpecification instead [06:08:50] Amir1: looking, from my memory Updater::pushElasticaWriteJobs is pushing 3 messages (one per elastic cluster) they should not be massive and should only contain page metadata not the content [06:12:15] Amir1: do you have a flamegraph somewhere? [06:13:53] dcausse: I thought I linked it [06:13:54] https://performance.wikimedia.org/arclamp/svgs/daily/2021-09-24.excimer-wall.RunSingleJob.svgz [06:14:00] sorry [06:14:16] thanks! [06:25:31] the json payload is quite small, but this queue is at least 3x the edit rate [06:25:39] which is expected [06:28:05] but I'm surprised that writing to event-gate is much slower than elastic itself [06:28:41] well, I might misinterpret the flamegraph tho [06:41:20] looking at historical data the volume sent to this topic is has not changed [06:45:15] Amir1: I can't say if it's "normal" but at a glance I don't see anything particularly wrong here nor something that changed in the past few weeks. I'm surprised as well that pushing to eventgate is that slow compared to writing to elastic (Elastica\Transport\Http 3% to 4% of the time) which needs to push the content as opposed to just the page metadata [06:47:11] yeah, it can be something wrong there I will look into it but debugging kafka and eventbus is not fun :D [06:47:28] Thanks for checking Cirrus-side! [06:47:52] Amir1: yw! let me know if you need something else :) [06:48:10] sure, I let you know if I make some progress [07:06:01] errand [07:41:04] errand 2 [08:35:33] zpapierski: are you back? [08:35:56] ping me when around, I'd like to discuss how we screen for a Graph Consultant with you and dcausse [09:14:09] sigh we're still getting OOMKilled by k8s sometimes... [09:25:08] gehel: I am now [09:25:17] dcausse: do we recover from that? [09:25:34] yes it does [09:25:36] meet.google.com/ons-crjs-sjf [09:25:40] and is it related to that metaspace issue we had in the past? [09:25:48] but it can't take a savepoint [09:26:04] I mean it's like "sometimes it works" [09:26:54] hmm, meet is doing an "infinite wait thing" [09:27:51] or maybe it's my internet today [09:28:09] gehel: one sec, need to do something about my router [10:09:29] lunch [10:40:09] relocation&lunch [12:51:20] how time flies. Meeting coming up, but I have time at 15:30 to continue that discussion on our Graph Consultant [12:51:36] dcausse, zpapierski: would that work for you? [12:52:02] gehel: yes, I have to run at 16:30 [12:52:10] gehel: I'm cool with that [12:52:18] I have another meeting at 16:00 [13:22:17] zpapierski: when you have a sec: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724738 [13:23:13] thanks! [13:23:19] yw :) [13:23:26] but I can't +2 there [13:24:10] gehel: I took a stab at refactoring kafka spicerack module to limit operations with site prefix, and I think slowly but surely I'll reach your version of code :D [13:24:27] or at least something similar [13:24:47] hm.. I think you need +2 in this repo, it's like the wds-deploy repo, you +2 you deploy [13:29:43] I think it slipped my mind in the past [13:29:59] that happens when I don't write things in my personal Trello log... [13:33:28] dcausse: https://meet.google.com/ons-crjs-sjf [14:00:31] break [14:23:01] going out, might be back later this afternoon [14:29:57] Trey314159: I'll be 2' late, need a quick break in between meetings [14:30:59] no worries [16:14:03] :o https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/ [16:16:05] zpapierski: dcausse ^ [18:37:08] ebernhardson: any guess what ticket corresponds to the wcqs oauth work? I found https://phabricator.wikimedia.org/T290300 which seems close but wasn't sure [18:37:32] (just want to include a link to the phab ticket in a comment above the new entry im adding to the private repo [18:37:36] ) [18:41:13] Ah nevermind it'd definitely be https://phabricator.wikimedia.org/T280006 [18:41:16] hmm, theres certainly something [18:41:27] yea seems approrpiate [18:41:27] (just found it) [19:01:35] ebernhardson: okay `hieradata/role/common/wcqs/public.yaml` with the specified oauth secrets has been created in the `/srv/private` repo [19:02:10] as a side note, I wonder if https://github.com/wikimedia/labs-private/commit/e851339dcf1f55100d3b8aca75b265f17968dae7 is still necessary. my guess is yes [19:02:41] I was going to add a corresponding `hieradata/role/common/wcqs/public.yaml` to labs-private since usually we need to duplicate a dummy version of /srv/private changes, but turns out that doesn't apply to hieradata stuff (usually): https://wikitech.wikimedia.org/wiki/Puppet#Private_puppet [19:03:02] ryankemper: awesome! I've been looking over the rest of it, i suspect at least two issues to crop up. 1) the redirect after auth success is hardcoded to wcqs-beta and 2) we will need to set profile::query_service::oauth true in hiera [19:03:06] I think maybe if I did add that, then we could get rid of https://github.com/wikimedia/labs-private/commit/e851339dcf1f55100d3b8aca75b265f17968dae7, but I don't see any benefit of doing so so I'll just leave it how it is [19:03:31] ebernhardson: okay I can work on `(2)`, where is the hardcoding of `(1)`? [19:04:22] ryankemper: i dunno if it's entirely accurate, but i estimate that if PCC compiles without failing that the labs/private repo has all the needed hiera information [19:05:04] ryankemper: hardcoding is in puppet, modules/query_service/templates/nginx.erb the only hard part there is naming :) [19:05:18] (for the oauth success redirect) [19:05:51] so far i had @public_url, then @oauth_success_redirect_url, then i was pondering where it actually best fits in the hiera we have today [19:05:58] ebernhardson: do we want to do any branching logic or w/e to preserve the beta functionality, or just replace it with the link to the actual production service [19:06:02] but honestly most of that doesn't matter and we can just fit it in however :P [19:06:33] ryankemper: hmm, i was guessing we would stuff it in a variable and let beta keep it's old value, since we agreed to keep beta running for a few more months [19:06:52] sounds good to me [19:13:43] i suppose one awkward part is right now we have the oauth_settings key that has both the secrets and the public info. Tempted to split, but the other way to think of this is the oauth success url doesn't really have much to do with oauth. We are telling it where to send the user after auth success [19:14:19] which also reminds me of something to test, does our oauth break links? In some auth impl's you follow a link to a specific thing, and then after auth get redirected to blank homepage intead of say the query you were trying to run [19:32:04] i guess looking at this, the only actual secret value is oauth_consumer_secret, will separate the secret from the rest of the settings so that most of it can be in normal hiera with only the single secret value in private [19:32:27] nice, that sounds ideal [19:34:23] So will the `X-redirect-url` be a key in the `oauth_settings` hash basically? [19:35:34] ryankemper: i'm thinking yes, although i'm not sure if we pass the whole oauth_settings down that far, probably just the url at some point. I guess just thinking if we already have all these settings together it seems this needs to go there too [19:35:47] but i was mildly annoyed non-secrets then go in a secret thing :) [20:03:25] yeah that is a little awkward [20:04:21] but if we pull out `oauth_consumer_secret` doesn't that avoid the problem? since oauth_settings wouldn't have any secrets [20:04:33] or is the problem the "not sure if we pass the whole oauth_settings down that far" part [20:04:37] yup, thats what https://gerrit.wikimedia.org/r/c/operations/puppet/+/724829/1 tries to do [20:05:34] oh, that part about how far to pass is really unimportant and just semantics. There is a theory that if you need one value you don't pass the entire database, you just pass the one value. Similarly here if we only need a single value from $oauth_settings it seems odd to pass the whole thing [20:06:04] but we can still pull it from hiera that way, no biggie. Just minor structure things [20:07:14] understood, thanks for the explanation [20:08:48] sometimes i get off track..puppet isn't a typical programming language and sometimes trying to apply more general practices to puppet makes things more difficult...it's always hard to know when :) [20:10:29] doh, for some reason i put Wmflib in those patches :P [20:10:35] (instead of Stdlib::HTTPSUrl) [20:16:00] * ebernhardson will some day remember to attach phab tickets [20:38:56] ebernhardson: the oauth consumer key made it into https://gerrit.wikimedia.org/r/c/operations/puppet/+/724829/3/hieradata/role/common/wcqs/public.yaml :P [20:39:18] ebernhardson: we'll need to remove the other settings from `/srv/private` after we merge this puppet patch anyway, so might as well just generate a new oauth key? [20:39:20] ryankemper: right, the consumer key is like the username, the consumer secret is like the password [20:39:25] oh [20:39:25] doh [20:41:05] bd808: could i double check, there's nothing secret about the oauth consumer key? [20:41:53] ebernhardson: correct. the consumer key is public data. The secret key is secret. :) [20:42:15] bd808: perfect, thanks :) [20:52:54] hmm pcc failing after merging https://gerrit.wikimedia.org/r/c/labs/private/+/724831/3 [20:53:02] not sure if a puppet run needs to occur somewhere...I imagine it must [20:53:16] not sure what hosts we do compilation on [20:53:39] should be the same hosts the facts upload targets [20:53:41] sec [20:54:28] should be any of the hosts hardcoded into COMPILERS var of modules/puppet_compiler/files/compiler-update-facts [23:14:12] hmm, i can't really see why pcc isn't finding the psuedo-secret