[07:22:49] ryankemper: I don't think we'll need a wdqs host in the wmcs search horizon project next year [10:02:25] lunch [10:03:20] lunch [13:22:13] o/ [13:44:40] o/ [14:00:40] dcausse: I started implementing the rules for the WDQS updater. IIUC the SubgraphRules have to be matched against the lists of Statements (aka triples) fetched for each (from/to) revision. But so far, applying those rules is only implemented for reading from a (hive?) table/dump in SubgraphRuleMapper.scala? [14:01:36] (I was wondering if and how much code could be reused from SubgraphRuleMapper) [14:32:38] pfischer: SubgraphRuleMapper is really optimized for spark (by applying rules via a spark query) so not sure we can re-use much of it [14:39:31] perhaps there are some bits in org.wikidata.query.rdf.spark.transform.structureddata.dumps.SubgraphRuleMapper#filterFromValue (i.e. the logic that analyzes the bnode id) but unsure that is worth the effort and necessary abstraction [14:40:59] Yes, the closer I looked the more I thought so. Thank you. [14:52:43] meh, there does not seem to be simple method converting a python datetime object into a int representing the millis since epoch [14:54:07] For today's SRE staff mtg, I was going to mention that the new streaming updater is in prod. Any docs I should point to? [14:59:07] inflatador: Uhm, nothing in particular, I did not prepare anything. [15:00:24] dcausse: your_datetime.datetime() gives you a float, so * 1000 and then apply round/int/math.floor to it depending what you need [15:00:46] volans: ah thanks! [15:00:47] sorry I meant .timestamp() [15:00:56] https://docs.python.org/3/library/datetime.html#datetime.datetime.timestamp [15:01:10] thanks! :) [15:01:34] give it a read for the the TZ related cautions [15:01:48] but should be doable fairly easily [15:02:21] weekly triage/planning time [15:03:00] oops [15:23:30] ebernhardson: Is the sanetizer currently enabled for the codfw SUP? I can’t see any metrics: [15:23:30] https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?orgId=1&var-k8sds=codfw%20prometheus%2Fk8s&var-opsds=eqiad%20prometheus%2Fops&var-site=eqiad&var-app=All&var-operator_name=rerender_fetch_failure_sink:_Writer&var-fetch_operator_name=fetch&var-fetch_operator_name=rerender_fetch&var-quartile=0.5&var-fetch_by=authority&var-top_k=1 [15:44:25] pfischer: hmm, it should be i was seeing metrics last week [15:44:39] pfischer: oh, you need to set the filter at the top [15:45:22] pfischer: if you set app: flink-app-consumer-search it should show up [15:46:26] maybe need to change the query a bit to remove that limitation [16:01:02] \o [16:19:33] workout, back in ~40 [17:22:02] dinner [17:25:15] back [17:38:51] sending 25% of eqiad writes to SUP [17:39:36] nice [17:40:05] set the timestamp to 17:00 (about 40min ago), will see if it manages to not do a giant backfill this time [17:51:43] all looks pretty normal, lag is down to 0 now [18:33:15] Awesome! Keep us posted [18:33:20] Hitting lunch but will be back in ~1h [19:48:38] sigh...the CI release for discolytics fails with an http acess denied :S Now to remember how auth works here.. [19:48:46] when trying to push the tag into gitlab [19:53:37] sorry, been back [19:56:30] ebernhardson what dashboards are you watching as you add more writes? Guessing flink-app but LMK if anything else [20:01:31] inflatador: flink-app and cirrus streaming updater [20:01:55] and then `JobQueue Job` when changing the cirrus side (will ship this hour, hopefully) [20:02:16] ACK, I think you or David linked that one earlier [20:17:07] working on the docs rewrite for the search page. Possibly dumb question: why do we reindex? Is it just to pick up analyzer changes or saneitizer fixes, or are there other reasons? [20:59:40] current search page is ~38 pages long in Google Docs...time to get out the brass scissors ;) [21:09:34] didn't manage to deploy cirrus writes changes today, too many patches in the deploy window. And things are so slow now can only fit like 2 patches in the hour :S [21:10:55] Uh, I wasn’t aware of that bottleneck. :-( [21:11:26] if you want to deploy anything thats not part of the train it has to go through backport windows, those used to take ~5 min per patch, maybe 20 if it was an extension instead of a config change [21:11:28] now its 30m + [21:11:34] (part of k8s) [21:12:31] Okay. And what takes so long? Do they wait until all pods have restarted? [21:14:46] basically, yes [22:20:33] ebernhardson error rate creeping up according to envoy, should we take a look? https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_eqiad&viewPanel=16 [22:20:45] inflatador: yea checking [22:21:17] will check the old layer 2 stuff as well [22:21:32] but was wondering if SUP could be a factor [22:21:49] ryankemper ^^ [22:21:56] inflatador: failed to join {elastic2092-production-search-omega-codfw}{rXT3SeWRSaWDf0ZJg16W5w}{MscHK06NRge1o6JMd5BTVQ}{10.192.16.228}{10.192.16.228:9500}{dimr}{hostname=elastic2092, rack=B2, row=B, fqdn=elastic2092.codfw.wmnet} with [22:21:58] JoinRequest{sourceNode={elastic2090-production-search-omega-codfw}{25bGrv6vSKuazquM-nmu1Q}{TGz0IkxRQ9aOlN33VgjEdw}{10.192.0.97}{10.192.0.97:9500}{dir}{hostname=elastic2090, rack=A7, fqdn=elastic2090.codfw.wmnet, row=A}, minimumTerm=1202, optionalJoin=Optional[Join{term=1202, lastAcceptedTerm=1167, lastAcceptedVersion=478274, sourceNode={elastic2090-production-search-omega-codfw}{2... [22:22:08] nodes are having trouble reaching each other [22:22:33] 2092 can see 2090 at least, hmm [22:23:16] Caused by: CoordinationStateRejectedException[join validation on cluster state with a different cluster uuid vDSVzYveRzaS0CP1t_y0zQ than local cluster uuid hLqPym4_SUOcKhM8GfQmZw, rejecting] [22:23:19] I'm looking at the eqiad envoy [22:23:30] ahh, i'm looking at the elasticsearch error logs :P [22:24:01] so 2090 failing to join the cluster is a problem, but a different problem :P [22:24:11] elastic2090 has a separate ticket, that has something to do with me accidentally assigning it to both secondary clusters [22:24:28] I've fixed it but I guess there are some knock-on effects [22:24:51] inflatador: for 2090 you'll have to delete the data dir probably [22:27:03] doubt it will make a difference, but I'm gonna depool the hosts from the last incident just in case [22:27:05] from the cirrus side, it's seeing operation timeout, connection termination, upstream connect error, etc. [22:27:50] https://logstash.wikimedia.org/goto/debfa1b408641527a15292f51d3ae1b4 [22:28:28] i have no way to know which host they were talking to, if it's localized or general [22:29:23] the alert cleared immediately after I depooled those hosts...checking if that was a coincidence though [22:30:21] poked over the cluster overview graphs, in the past we've had occurances where 1 host max'd its cpu and stayed there, causing knock-on problems. But nothing obvious this time [22:31:49] the errors in logstash are curiously bursty [22:32:05] I believe our VIPs should be on the 'high-traffic' LVS hosts...possibly some got the VLAN fix and some didn't. Checking [22:32:14] 22:18:00-22:18:30, 22:21:30-22:22:00, 22:27:30-22:29:00 [22:32:17] with no complains between them [22:34:07] they look to still be coming in, another burst [22:34:21] alert is firing again [22:36:09] from graphs, i'd say this started 21:47:20 or so [22:36:32] at the same time the shard movement graph goes crazy [22:36:58] it looks like every 2 minutes we have new unassigned shards? Is there a restart runing? [22:37:11] ebernhardson Y, ryankemper and I started one around that time [22:37:12] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&from=now-3h&to=now&viewPanel=64 [22:37:26] hmm, maybe i'm only seeing bits of that and its unrelated [22:37:52] but lots of graphs start going wonky then :P [22:38:18] i suppose i can pause SUP to rule that out [22:39:40] consumer-search undeployed in eqiad [22:40:43] i'm a bit suspicious of the restarts still...the errors coming from mediawiki in groups, instead of constant suggests there is something happening every couple minutes to cause it [22:41:40] could it be something with not properly draining the instances before shutting them down? Could this be the restart killing all live connections instead of finishing them before shutting own [22:42:10] * ebernhardson doesn't know how to even get elastic to do that though, it always seems to drain cleanly :P [22:42:48] we should probably stop the restarts before you rollback any of the SUP stuff [22:43:10] ryankemper you around? Can you kill the eqiad restart? I can do it if not [22:43:55] inflatador: I got it [22:44:08] ryankemper too late, just killed it ;P [22:44:33] sup isn't a big deal, cirrus is still doing writes so sup is basically just deployed but doing nothing useful [22:46:18] so I think the cookbook got changed to restart envoy as well [22:46:53] ahh, that seems like a possibility then [22:47:49] actually i don't see anything in the code that does that [22:47:57] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1023863 might have been mixing it up with this patch [22:48:45] so far no errors since 22:44, but maybe too early to call yet [22:50:38] unrelated, but i wonder if this suggests we should at some point move from nginx to envoy for tls termination, to get all the other things that come with enoy [22:50:41] well, quasi-related [22:51:05] in part i was thinking it would have been nice if we had per-server envoy metrics that showed where errors came from (maybe) [22:51:15] We ran the same cookbook from codfw this morning, I wonder if we got any errors there [22:51:27] I don't think the cookbooks have changed recently, but the TLS stuff has for sure [22:52:03] logstash is still quiet, i'd say strong chance it was the restarts. But what about it :S [22:53:25] I'm curious if this has always been happening and we've just never cared previously because we weren't alerting on >.1% envoy requests failing [22:53:38] quite possibly [22:53:41] Still verifying the time, but it does look like we got more than .01% error rate in codfw during the restarts https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=search-https_codfw&from=now-12h&to=now&viewPanel=16 [22:54:41] could probably dig up a previous restart time and re-run the queries to verify, but on pondering it seems possible this always happened. [22:55:56] Yeah, I'm voting for 'this always happened and we just never measured it' [22:56:48] so then, what do we do about it? :P [22:56:48] not sure of the next steps, but I guess we can worry about it tomorrow ;) . [22:56:56] sure, sounds reasonable [23:03:52] I'm gonna take another look at the alert as well as what volume of errors we saw in the original incident that led to it [23:07:06] > Upstream error rate (as recorded from Envoy) begins to rise from < 0.1% to above 1% [23:07:11] (from https://wikitech.wikimedia.org/wiki/Incidents/2024-04-26_Search_unavailable_for_some_eqiad_users) [23:07:45] I don't see value in alerting when we're >.1% error rate, that seems too low, I think we should bump it to .5% at least [23:10:56] i dunno, i could go either way. This seems like a real problem. But we managed to not notice for potentially years, maybe that means it's too low of an error rate to matter? i dunno [23:11:39] anyways i gotta run for kid activities, but will ponder