[08:20:41] ebernhardson: sorry for the naming, it sounded fun in the beginning... [10:31:45] mpham: thinks it's enough? https://docs.google.com/document/d/1NPpv-uPw2oDv6JgmXQ3V4wKlxnRcyXNkX0PxO8EBth8/edit?usp=sharing [10:32:14] btw, isn't CET peaceful at this time :) [10:32:16] ? [10:32:45] wait, you're actually one hour later, but still [10:50:22] ejoseph: let me know when you're around [11:19:19] lunch [11:28:14] ejoseph: or ignore that, I just noticed you're out today and tomorrow [11:37:39] jenkins-bot isn't super keen to add reviewers lately, https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/744772 didn't get them either [13:02:21] errand + lunch [15:52:05] \o [15:53:55] o/ [15:54:00] o/ [17:13:43] the cookbooks need to vary on a variety of parameters between wcqs and wdqs instances, things like service names, data paths, etc. Rather than duplicating all the things in the cookbook, i was thinking of parsing (and adding to as necessary) /etc/query_service/vars.yaml from the target host to get all that data. Seem reasonable? [17:14:13] the mutation topic would go there too [17:14:30] is there some silly reason related to categories that makes it a pain? [17:14:50] isn't vars.yaml controlled by scap? [17:15:02] hmm, i thought it came from puppet? ok sec :P [17:15:41] I vaguely remember something painful regarding this, vars.sh vars.yaml and scal [17:15:43] s/scal/scap [17:16:42] wikidata/query/deploy scap/config-files.yaml says it writes ldf-config.json and vars.sh, using vars.yaml as input to each [17:17:15] yea, vars.yaml comes from query_service::common in puppet [17:19:26] ok [17:20:39] why would you need the topic to be there? [17:20:55] the cookbooks currently have MUTATION_TOPIC, which is hardcoded to wdqs [17:21:01] IIRC the offset handling is run from the cumin hosts [17:21:33] right, but cookbooks can ssh, i was thinking to effectively `ssh foohost cat /etc/query_service/vars.yaml` and parse it [17:21:38] ok [17:22:16] similarly it does things like `service wdqs-blazegraph restart` which has to vary for wcqs hosts, and the data path is /srv/wdqs vs /srv/query_service, and other variances [17:22:16] just our of curiosity, if that data is provided by puppet anyway, could it make sense to ship a config file to the cumin host that will be parsed by the cookbook ? [17:22:18] I think the cumin host can have some config pushed to it but not sure how generic this can/needs to be [17:22:53] we can ship config in /etc/spicerack/cookbooks with the FQDN of the cookbook (like sre.switchdc.services.yaml) [17:23:05] it's a supported feature [17:23:13] it would need per-host data about what's appropriate, or we need extra bits to map a host to a cluster and have per-cluster config written somewhere [17:23:24] i was liking the per-host thing because it's nothing new to manage :) [17:23:39] ah, it's per host, I was hoping per cluster [17:25:06] well, per- (cluster, dc) pair, as long as staging is a separate cluster (i think it is) [17:26:01] flink@staging is not read by anyone yet [17:26:38] so it's one topic per DC at the moment [17:26:48] ok, so no blazegraph on the other end. no variance to deal with :) [17:27:42] yes this flink staging is just to quickly test the pipeline, it's nowhere near a full end-to-end env [17:49:15] not clear if the alert about update rates is meaningful, grafana shows a ~30 minute gap in data collection which just come back (at normal-ish levels) [17:49:36] yeah... I was looking into graphite [17:49:46] I wonder if it's not related to a problem there [17:50:29] it's using MediaWiki.CirrusSearch.${site}.updates.all.sent.rate [17:51:34] same, there's a 30 min in graphite too [17:51:48] yea poking at other dashboards, seeing similar missing metrics [17:58:57] confirmed to be graphite related apparently [18:05:10] ^ are those for historical metrics? because https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite happened [18:06:24] ryankemper: the cirrus update rate alert fired around 9:45, suggesting few or no updates were successfull. But looks to be a new similar incident with metrics collection [18:07:44] which reminds me, i hadn't thought to re-tune the per-shard snapshot limits so the smaller commonswiki_(content|general) indices still took awhile, i'll be running the catchup routine on those in a bit and putting cirrus traffic back later today [18:08:16] thanks for taking care of this! [18:08:48] ebernhardson: wrt wcqs, wcqs is in `lvs_setup` currently. next up is `monitoring_setup` followed by `production`. Do we want it all the way back to `production` or is there oauth stuff to iron out first? [18:09:23] ryankemper: i think for the health check to be happy it needs a few patches i put up, lemme find em [18:09:45] There was https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/744772 which is now merged and needs to be deployed [18:10:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/743499 is needed to make the logging configuration loadable by oauth [18:11:25] And then also https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/743500, which is merged and runs the same deployment as the above. [18:11:50] So, i think if we merge that puppet patch, run the CI deployment, and then sync that to wcqs, it should work [18:11:57] (or at least, it worked on wcqs1001 when i hacked all that into place :P) [18:14:25] there is still an unresolved problem with the nginx config, it "works" but when bypassing auth for metrics it seems to issue the /_check_auth request to /sparql with no query string (that then gets closed early and we get logs about EOF) [18:15:03] i suspect it has something to do with https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/ but i havn't tracked down how exactly [18:23:28] we're tagged on a train blocker? T297221 [18:23:29] T297221: Search backend error during sending 1 documents to the commonswiki_content_1617495209 index(s): primary shard is not active - https://phabricator.wikimedia.org/T297221 [18:25:39] cbogen_: those are unrelated to train, they are due to the snapshot restore running [18:25:46] i'll note on ticket [18:26:18] thanks! [18:42:09] looks like it finished copying from swift and started the primaries, it should stop complaining now. Still has to copy those primaries out into replicas [19:48:06] can someone with admin in wikidata-query-deploy in gerrit add me here: https://gerrit.wikimedia.org/r/admin/groups/7213f78459bb36f26379f4af595a2ca62de45727,members [19:48:26] if you have admin there should be an add button just above the members list [19:49:39] ebernhardson: done [19:50:12] gehel: thanks! [19:51:32] ryankemper, ebernhardson, zpapierski: I've also added you as owners of that group [19:51:44] woot [19:57:41] ryankemper: i ran a deploy to all the wcqs instances, so that part is now done. The only part for lvs should be the puppet patch from above to make the log config file loadable [19:58:30] ebernhardson: Merged that a few mins ago, didn't manually run puppet tho so prob hasn't propagated yet [19:58:48] Will run puppet on w*qs* [19:58:51] kk, i can run it on one of the instances manually to verify [19:58:54] ok, that works too :) [19:59:34] ebernhardson: or more specifically, running on wcqs1001 to test then the rest of fleet...so what you said basically [20:00:34] ebernhardson: I assume I'll need to kick over `wcqs-blazegraph.service`? [20:00:54] ryankemper: yea that should do it, then check localhost:80/readiness-probe [20:02:13] :9999/oauth/check_auth gives a 403 which is promissing (was 5xx before) [20:02:35] this looks plausible to me [20:02:58] Great [20:03:12] * ebernhardson should really figure out this connection reset by peer thing, its very verbose... [20:04:25] Interestingly the readiness-probe seemed to work before restarting blazegraph [20:04:44] It's blazegraph that loads from `logback.xml` right? [20:04:45] yea that one isn't strictly tied up in the oauth problems, it skips auth [20:05:23] both blazegraph and mw-oauth now load logback.xml, can think of them as separate applications running inside the same server [20:32:51] not sure it means anything, but i wonder why nginx talks to jetty using http/1.0 [20:39:24] ebernhardson: http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_http_version looks like that's the default [20:47:57] hmm, i suppose it doesn't matter. keepalive and ntlm shouldn't be important [21:11:06] so on closer look, the EOF errors appear to be because this is matching /_check_auth: rewrite ^ /bigdata/namespace/$aliased/sparql last; [21:12:29] not sure how order of operations is decided there, _check_auth is declared earlier in the file but it doesn't seem to get evaluated before this [22:47:10] i think https://gerrit.wikimedia.org/r/c/operations/puppet/+/744892 will fix it [22:47:21] well, it makes 1001 happy at least :P