[09:26:56] ebernhardson: perhaps not but unsure, someone added a way to apply the logistic function on a tree score (hacking the xgboost "objective" param to something with logistic in the model prior to uploading) but unsure if this is affecting ranking or not... is the rank of sum(raw tree score) identical to sum(logistic(tree score))? [09:29:35] wait the logistic is applied to the sum so should be fine I think [09:29:47] https://github.com/o19s/elasticsearch-learning-to-rank/blob/main/src/main/java/com/o19s/es/ltr/ranker/dectree/NaiveAdditiveDecisionTree.java#L69 [10:26:30] dcausse: since we had that discussion regarding kafka log retention vs. compaction; turns out it’s not an explicit OR, by specifying cleanup.policy=delete,compact; both strategies are applied: https://docs.confluent.io/platform/current/installation/configuration/topic-configs.html#cleanup-policy (since 0.10.1.0, see [10:26:30] https://cwiki.apache.org/confluence/display/KAFKA/KIP-71%3A+Enable+log+compaction+and+deletion+to+co-exist) - So I would propose this as topic configuration for page_rerender to reduce storage needs. [10:38:38] pfischer: nice! [10:40:03] errand + lunch [13:38:59] inflatador: is T352083 related to your work on adding checks on LDF (T347355) [13:38:59] T352083: ProbeDown - https://phabricator.wikimedia.org/T352083 [13:39:00] T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 [13:47:04] dcausse: I'm looking at T351942, do we have a reasonable way to know which statements are indexed? And thus, should I just close this ticket as not indexed? [13:47:05] T351942: wbstatementquantity search keyword seems broken - https://phabricator.wikimedia.org/T351942 [13:47:48] gehel: looking [13:49:17] til that we have wbstatementquantity... [14:09:11] gehel y [14:09:56] back in ~90 [15:30:40] back [17:03:20] so, how many times was i wrong? :) [17:13:21] I’m not aware of any mistakes, it was very informative and just the right amount of detail for the available time frame. [17:17:04] i was surprised to finish at :45 after, completely guessed at how much content to include :) [17:17:11] good guess i suppose [17:17:41] workout, back in ~40 [17:26:54] ebernhardson: definitely! [18:00:20] back [18:16:33] hmm, compare-clusters.py says testwiki_content is the same on relforge and eqiad. good sign [18:17:13] dcausse pfischer do y'all feel comfortable changing prod rdf streaming updater over tomorrow during pairing? Don't need an answer right away, feel free to ping back tomorrow morning your time [18:18:10] I need to get ol' compare-clusters into the rolling operation cookbook, thanks for the reminder ;) [18:18:50] also, Yet Another LDF Monitoring patch is ready for review, just fixing a typo https://gerrit.wikimedia.org/r/c/operations/puppet/+/978118 [18:19:38] it also claims frwiki_content is aligned between the two clusters. It's a simple script so i'm pretty sure it works, but somehow i'm still a bit surprised things look to be working [18:20:31] inflatador: as you want [18:21:04] ebernhardson: indeed, seems surprising since we don't have page rerender for frwiki? [18:21:59] dcausse sounds good...I'll prepare before our pairing session tomorrow [18:23:35] dcausse: this isn't a full content compare, it checks that a given page_id either exists or doesn't exists no both sides, and compares the rev_id and the page title [18:23:44] so it seems plausible those stay correct without rerenders [18:23:56] oh that makes sense [18:24:44] hmm, i suppose could add a few more fields to compare pretty easily. But indeed without rerender events it would only tell us what we already know [18:33:41] expanded to also compare the opening_text. testwiki_content passes, frwiki_content passes, itwiki_content passes. Feeling more dubious :P [18:36:16] lunch, back in time for pairing [18:41:28] :) [18:41:59] actually i did an incomplete job of adding the extra field, so that means nothing :P [18:42:08] it is easy, but a little more involved than at first glance [18:52:28] i'm reasonably certain it's now comparing the right thing, frwiki_content still passes on opening_text. rerunning against the main text field now but it hasn't found anything in a couple minutes of running [18:52:47] i wonder how much of the refresh's turn into noops, and if we can track that in the pipeline [19:01:43] more metrics would be nice indeed [19:01:46] dinner [19:20:18] * ebernhardson is mildly amused that ElasticsearchWriter null checks metricGroup *after* it invokes a method on it :P [19:21:50] also somewhat annoyingly, the SinkWriterMetricGroup that is provided is highly specialized and can't emit arbitrary metrics. [19:26:42] maybe we could do something extra silly like a custom RichSinkFunction that delegates to the real sink.. [19:28:06] * ebernhardson is so far highly dissapointed with the builtin elastic sink :P [19:28:10] back [20:36:55] ebernhardson you were right, test patch shows the same failure [20:40:30] ebernhardson ryankemper https://gerrit.wikimedia.org/r/c/operations/puppet/+/978134 is ready if y'all can take a look [20:42:22] inflatador: I think we still need the `/bigdata/ldf` to be part of the replacement string right? [20:45:11] ryankemper 99.9% sure you're right, let me check though [20:46:55] yup, you're right. We also might need a valid TLS cert for that new CNAME too [20:50:12] oh! yea thats a good cal [20:50:14] l [20:50:39] i wonder if the nginx will also need an appropriate virtualhost [20:51:50] nope, looks like it's configured as a default and responds to any name [20:55:42] * ebernhardson wonders if we should be forking and upstreaming changes to the sink, doing more awkward wrapping to get custom metrics is just meh [20:55:54] mutante is suggesting we use the discovery.wmnet TLD...he's gonna help with CRs. Just FYI [20:56:33] cool, it does seem reasonable to keep it all the same even if it doesn't add much here [20:58:13] hi [20:58:24] so i recommend you just _pretend_ you already have a codfw backend [20:58:36] but do NOT worry about setting up LVS and all that.. way too complex for now [20:58:49] but you can _still_ totally have a .discovery.wmnet name [20:59:03] if you look in DNS repo in the wmnet template [20:59:12] there is a section DISCOVERY SERVICES [20:59:21] and further down after that there is this line: [20:59:26] ; misc web services with multiple backends but without geoip [20:59:44] example looks like this: [20:59:45] peopleweb 300 IN CNAME people1004.eqiad.wmnet. [20:59:45] ;peopleweb 300 IN CNAME people2003.codfw.wmnet. [20:59:54] see how the codfw one is simply commented out? [21:00:19] that's how we switch over discovery names to the _other_ backend when we dont have geoDNS/LVS/etc [21:00:47] this will make you more standard compliant.. while you dont have to do more just yet [21:01:36] mutante cool, patch is up at https://gerrit.wikimedia.org/r/c/operations/dns/+/978142 [21:02:17] lgtm, +1 [21:02:24] now.. either way you need a TLS cert [21:02:32] is this using envoy in the backend? [21:03:10] Y [21:03:50] at least I think so? We have nginx too. ebernhardson or ryankemper is envoy involved? [21:04:29] ok, so you need to create the cert by first going to the puppetmaster [21:04:34] where the private repo is [21:04:36] modules/secret/secrets/certificates/certificate.manifests.d/ [21:04:55] see this: https://wikitech.wikimedia.org/wiki/Cergen#Cheatsheet [21:05:07] you gotta create a yaml file there [21:05:23] inflatador: yeah we have envoy on wdqs hosts [21:05:26] it will be for foo.discovery.wmnet [21:05:36] and in alt_names: you put the backend host names [21:06:32] then you run the cergen command and get a private key and a public key [21:06:59] finally you copy the public part to the public repo, operations/puppet [21:07:07] and commit in both places [21:08:42] the public files are in modules/profile/files/ssl/ [21:08:47] and need to be renamed to .crt [21:09:03] put it next to these: ./modules/profile/files/ssl/planet.discovery.wmnet.crt [21:09:15] ./modules/profile/files/ssl/wdqs-internal.discovery.wmnet.crt [21:09:36] then also commit in "labs/private" and add a fake private key :p [21:09:46] cool, I've done this but it was probably like the first month I got here...been awhile ;) [21:09:53] remember labs/private is neither labs nor private :) [21:10:06] I hope my instructions are still accurate [21:10:15] and not like something changed since I did it last [21:10:22] but I do see the files and wikitech page.. so.. [21:10:47] I've also looked at https://wikitech.wikimedia.org/wiki/PKI/Clients but it looks like it's for client certs [21:11:11] ack, I think this is still "cergen" then [21:11:21] or the repos would have been cleaned up [21:12:21] agreed...starting the cergen fun ;) [21:12:24] once you have the cert and if you use envoy, you will tell it the cert name in Hiera. lik this: [21:12:27] hieradata/role/common/planet.yaml:profile::tlsproxy::envoy::global_cert_name: "planet.discovery.wmnet" [21:12:43] * inflatador misses vault and terraform for PKI [21:13:00] hieradata/role/common/wdqs/public.yaml:profile::tlsproxy::envoy::global_cert_name: "wdqs.discovery.wmnet" [21:13:03] hieradata/role/common/wdqs/internal.yaml:profile::tlsproxy::envoy::global_cert_name: "wdqs-internal.discovery.wmnet" [21:13:06] modules/role/manifests/wdqs/public.pp: include profile::tlsproxy::envoy # TLS termination [21:13:09] modules/role/manifests/wdqs/internal.pp: include profile::tlsproxy::envoy # TLS termination [21:15:12] mutante: one thing I'm not sure of wrt envoy, this ldf wdqs host functions both as a normal host and as the ldf host. So wouldn't there be two competing `global_cert_names`, `wdqs.discovery.wmnet` and `wdqs-internal.discovery.wmnet`? [21:15:36] mutante: er, sorry that last part should be `wdqs.discovery.wmnet` versus `wdqs-ldf.discovery.wmnet` [21:16:11] ryankemper: I don't see a problem with having 2 discovery names pointing to the same host. after all this allows you to change it in the future [21:16:29] ryankemper: but what might be a problem is having 2 envoy configs on the same host.. hmmm [21:16:42] yes the latter is what i'm wondering about [21:16:52] last time I checked I couldn't :( [21:16:58] but that's been a while [21:17:04] I think it will be OK, so long as we have both domains in the TLS alt names [21:17:08] that's basically why I have it stalled for aphlict [21:17:25] if you use one cert with 2 alt names that could solve it [21:19:18] I mean.. or you just use wdqs.discovery.wmnet in this: [21:19:20] ip4 => ipresolve('wdqs1015.eqiad.wmnet', 4), [21:19:31] if you know you never want to decouple the 2 things... [21:19:44] then you dont need to make a new discovery name nor alias [21:19:52] just use the existing discovery name then [21:20:19] it's only that this means you always have to switch 2 things at once [21:20:50] separate is that you want a codfw equivalent at some point [22:02:21] ryankemper updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/978134 based on your comment if you wanna take another look [22:12:32] ebernhardson: are the changes regarding the sink already in gitlab? I could try to generalize them so we can create an upstream patch. [22:13:31] pfischer: not yet, i was looking over the implementation and trying to figure out how to get the right context in, but unfortunately from our side we can't replace enough classes without using reflection to call private methods [22:14:13] and it would be an ugly hack even without the reflection, it seems like if we want maintainable we would have to add response metrics directly to the sink, perhaps extending what dcausse worked on for response error handling [22:40:06] ryankemper looks like you provisioned a new wdqs discovery cert on Aug 31st. Do you remember if you had to distribute the cert anywhere after updating? public puppet patch is https://gerrit.wikimedia.org/r/c/operations/puppet/+/954123 [22:41:25] the puppetmaster wants me to run `puppet cert clean` and I just want to make sure it won't invalidate the existing certs for wdqs.discovery.wmnet everywhere [22:43:34] Sorry forgot to say I stepped out for lunch [22:43:48] Back now [22:44:18] inflatador: sec, lemme look what’s going on in srv/private [22:44:52] welp, cergen said it wouldn't do generate a new cert, but it looks like it actually did [22:45:13] ryankemper np, take your time. I'm going to stick around a little late today since I missed this morning [22:47:38] inflatador: okay, yeah /srv/private looks good [22:47:59] Is your puppet cert clean question still outstanding or was that addressed by cergen generating hte new cert [22:49:44] I think we're OK...mainly wondering if you had to do anything besides committing the new key to the private repo/new cert to public repo, and running puppet on the wdqs hosts [22:56:14] inflatador: yes there is one more step, changing https://gerrit.wikimedia.org/g/operations/puppet/+/6675171fd0f9e9ae6ec0668724206cafd0594d83/files/ssl/wdqs.discovery.wmnet.crt [22:57:31] oh, actually you probably already did that based off the `new cert to public repo` part of your comment. but that implies we still need the `labs/private` patch [22:57:59] I haven't done it yet, but I'll get a patch up [23:03:09] ryankemper just saw an alert for envoy on wdqs2020 ... hoping I haven't borked SSL but I'm disabling puppet on all wdqs hosts just to be careful [23:03:39] inflatador: let's hop on a call once you've got that disable cmd ran [23:04:38] ryankemper ACK [23:04:54] https://meet.google.com/svv-yttn-kum [23:52:16] Alright, we got all that stuff sorted out. Disabled puppet everywhere, revoked old cert, regenerated the cert and made a corresponding puppet patch (we didn't need a labs/private patch since we're updating an existing cert) and all is well again after running puppet [23:53:45] {◕ ◡ ◕}