[10:50:33] dcausse: do you have 15mins [10:56:39] ejoseph: sure [11:03:24] ok [11:18:24] lunch [12:00:31] lunch [13:03:49] greetings [13:09:31] o/ [13:14:49] o/ [14:01:18] mpham: blazegraph meeting https://meet.google.com/yau-mkip-tqg [14:53:42] new google doc is up for retro, please add to it if you have time! https://docs.google.com/document/d/1nqnrUBrrK7B_DpC9wivQuYwkPyAQhe7twG7R5Nd9TPQ/edit#heading=h.9bvnh6beyt6f [15:01:44] \o [15:07:44] can we shut down wcqs-beta-1 now? cloud sre needs to reimage the host it's running on [15:23:22] o/ [15:27:31] i suspect we are done with wcqs-beta-1, it shouldn't be serving anything [15:45:12] ^^ per conversation in https://phabricator.wikimedia.org/T304581#7802342 , I vote for option 2 as ebernhardson mentioned above. gehel ryankemper dcausse if you have any objections let me know. I'd like to update the ticket with a choice by my EoD [15:46:19] Is there any reason to keep this VM now that we have beta 2 ? [15:46:33] I think we could (and should) just delete it [15:50:47] agreed I think we should delete it and repurpose it for something else [16:52:47] not really puppet, but since it's puppet deploy window i wrote two cookbook patches: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/773562/1 and the parent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/773561/1 [16:55:41] just getting coffee, will join window shortly [17:08:51] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/770978 [17:50:23] inflatador: one other thing to keep an eye on, wdqs2001 has been flapping the free allocators alert today but i suspect that will stop when you restart codfw: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=32 [17:50:41] ebernhardson thanks, will keep an eye out [17:51:04] Also, I think I figured out why my audio was messed up. Apparently I was in another google meet at the same time [18:11:37] had some weirdness with the wdqs cookbook run for eqiad, I think because the match in cumin (wdqs[1003-1008].eqiad.wmnet) does not reflect what's in the puppet for confctl/LVS https://github.com/wikimedia/puppet/blob/production/conftool-data/node/eqiad.yaml#L299 [18:13:36] However, all the hosts that were correct in both cumin and confctl appear to have restarted cleanly, still checking [18:50:30] inflatador: ebernhardson: yeah RE https://phabricator.wikimedia.org/T304581#7802342 yeah let's just shut it down entirely IMO [19:17:48] Just saw this bug come in: https://phabricator.wikimedia.org/T304646 [19:20:49] in theory this should be the 5xx rate for wcqs, it is showing a few more 5xx than typical: https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=text&var-origin=wcqs.discovery.wmnet&viewPanel=12&from=now-3h&to=now [19:31:34] inflatador: we're in https://meet.google.com/eki-rafx-cxi [19:39:03] mpham: the error rate is still really low, so not sure that it is significant [19:39:42] we had a quick look into that ticket with Ryan / Brian, nothing obvious going on at the moment. Given the very limited info in the ticket, I doubt we can find anything more. [19:39:55] Maybe David or Erik have ideas on how to investigate more [19:40:06] not in particular :) [19:40:08] seems related to oauth given the last comment but unsure [19:40:28] and the 500 seems to be coming from nginx, which would confirm [19:41:19] seeing "/oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms" on wcqs1001 [19:42:31] hmm, thats quite odd. The /check_auth endpoint doesn't do any IO, it just checks the cookie against the secret and says yes or no [19:46:50] ok. wanted to make sure it wasn't anything breaking/super serious. It sounds like there isn't that much detail; if there's anything else we can ask for, maybe the person who filed it can give us more info [20:32:45] seeing a lot of eventgate-analytics.discovery.wmnet' errors in blazegraph [20:33:35] Do we care about this? Looks like they've been logging for at least a few wks [20:35:09] hmm, that would be the per-query event logging. That's the kinda data that andrea has been analyzing as part of deciding how to switch. I'm not sure how internally resilient it is to those errors, if that means dropping events or they just retry later or what [20:37:07] I'm only looking at a single host ATM, 148 connection failures in the last 3 wks or so. Doesn't seem that terrible, but it's a bit off-putting when you see it in 'systemctl status bg' or whatever [20:38:52] oh i wonder, i think we had this exact same problem a few months ago [20:39:17] * ebernhardson tries to remember [20:43:17] mostly failing :P It does have an internal queue that gets pushed though so seems plausible as long as its getting through most of the time it shouldn't be losing anything. [20:57:08] ah OK, I'll ignore for now [21:19:43] out for school run [22:01:19] back