[09:58:43] lunch [10:43:19] lunch [12:21:04] dcausse: I release the parent pom [12:33:58] gehel: thanks! [13:00:56] greetings [13:03:21] o/ [13:14:11] do we need to be concerned about these airflow alert emails, subject is like " inflatador: yes! [13:15:20] Either it is a problem that needs to be addressed, or it is noise that needs to be reduced. [13:19:30] Indeed. So what do you think: problem, or noise? [13:31:04] I think it's a known issue but yes we need to address it at some point [13:31:17] but we might need help from data-eng on this one [13:38:04] it's related to Erik's last comment here: https://phabricator.wikimedia.org/T303831#8101961 [13:38:33] lots of new spotbugs warnings with the new parent pom [13:39:05] and one that I can't seem to understand/fix nor silence THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION [13:46:21] ah, that comment sounds familiar [14:21:53] do you think we should reach out to data-eng? Maybe wait until e-bernhardson gets in and ask his opinion? [14:47:59] dcausse: about THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION, do you have a link to the code? [14:48:25] dcausse: I just wrote T313813, do you have time to review / comment / correct? [14:48:26] T313813: API Gateway to provide authorization and capacity management for W[CD]QS - https://phabricator.wikimedia.org/T313813 [14:48:46] sure [14:49:56] gehel: it's this method: https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia-event-utilities/+/refs/heads/master/eventutilities/src/main/java/org/wikimedia/eventutilities/core/event/JsonEventGenerator.java#168 [14:50:39] I wonder if it's catch (ExecutionException | UncheckedExecutionException e) [14:52:12] or perhaps the lambda of the guava Cache [14:53:36] perhaps because of Callable: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Callable.html#call-- [14:55:03] also not super clear to me why this is raised :/ [14:55:29] I think it prevents you from writing a method that "throws Exception" [14:55:43] but here it's a contract I follow (Callable) [14:55:55] seems like a false positive to me [14:58:05] might be related: https://github.com/spotbugs/spotbugs/issues/2040 [14:59:37] trying with spotbug 4.7.1 [15:00:48] or should we just ignore that rule globally? [15:01:22] perhaps? [15:01:45] SD / Search meeting starting: https://meet.google.com/rjx-uhsq-zqs [15:01:53] ebernhardson: ^ [15:02:05] sec [15:42:39] ejoseph I'm handing off note-taking to you, LMK if this is not cool [15:43:58] It's cool for me [15:57:06] dinner time! [16:00:50] workout, back in ~30m [16:02:09] * ebernhardson needs a better reindexing process than printing out the non _doc indices alphabetically and retrying the ones that come earlier in the alphabet than where the reindexer currently is [16:14:29] 8.2k indices migrated at least, another 2k to go [16:18:42] errand [16:34:53] back [16:44:14] I'm writing a simple python script to grab real-time ES info (active cluster members, OS, row/rack etc) from /_nodes , does anyone have something like that already? Otherwise I'll whip something up [16:50:31] inflatador: if this is this for monitoring there's a python script in puppet IIRC [16:51:16] dcausse it's for monitoring progress of the bullseye/elastic upgrades, but I'll check puppet [16:51:28] hit me up if you have a link [16:52:23] inflatador: probably thinking of modules/prometheus/files/usr/local/bin/prometheus-wmf-elasticsearch-exporter.py in puppet, but it doesn't happen to talk to _nodes [16:52:44] inflatador: or maybe modules/elasticsearch/files/es-tool.py (should maybe be dropped as unused these days?) [16:53:05] es-tool was pre-cookbooks/spicerack [16:53:32] i think we had salt stack back then, i don't remember anyone being a huge fan though :P [16:54:19] Yeah, saltstack was the first config mgmt tool I used...let's just say I moved on pretty quickly [16:54:47] I had a friend who moved to Utah to work there, instant regrets [16:55:43] perhaps curl + jq might be doable? [16:57:20] that would work, but it's easier (for me at least) to format stuff in Python. I think I'll give it a shot [16:57:29] then g-ehel can help me refactor ;P [17:00:07] good point! :) [17:10:05] * ebernhardson doesn't know how to feel about puppet templating using `elsif`. As if `else if` and `elif` weren't enough variants :P [17:10:11] and elseif in php i guess [17:14:16] It's just like people naming their kid Ashleigh or Linzee or whatever, just to be different ;) [17:19:42] lol, i suppose so :) [17:23:49] ryankemper did you see the bug from joe in mediawiki_security? We need to pull elastic2049 out of pybal permanently. Do you know if that's conftool, or do we have to make a puppet patch as well? [17:41:12] update: decommissioned elastic2049 with confctl (ref https://wikitech.wikimedia.org/wiki/Conftool#Decommission_a_server ). I do believe we'll need a puppet patch as well [17:41:20] In the meantime, lunch! Back in ~1 h [17:52:35] dinner [18:01:22] Back now [18:01:36] inflatador: thanks for looking at 2049 [18:23:39] back [18:23:48] ryankemper np [18:29:31] Looks like Dominic is blocked on T307391. We already estimated, could we move this forward sooner rather than later? (cc: ryankemper, inflatador, ebernhardson) [18:29:31] T307391: Enable CORS support for WCQS SPARQL endpoint access - https://phabricator.wikimedia.org/T307391 [18:30:35] gehel: i looked at it yesterday, i can add some random headers but not sure where is appropriate. Does the primary html need the header? is it the responses from wikidata (they come from different servers). Do we attach the header to everything regardless? (feels hacky, but probably works) [18:30:44] err, the responses from wdqs [18:31:21] I'm not sure :/ [18:32:08] i suppose i also got a bit confused because they say it works for wdqs, but i couldn't find where the header being returned there (my initial thought was return it wherever query.wikidata.org does) [18:33:28] oh i guess i wasn't paying enough attention before, i see now that query.wikidata.org/sparql returns the header. will have to see why commons-query.wikimedia.org/sparql doesn't [18:33:58] hmm, no wcqs does return it there already. So must be something else [18:34:45] is it failing in the auth flow prior to CORS? Because a known limitation of the existing auth is you cannot auth from an xmlhttprequest [18:34:55] (the browser rejects it) [18:36:03] i can add those notes to the ticket i guess, but the problem with authing through redirect bounces during xmlhttprequest is unrelated to cors iirc [18:37:09] gehel ryankemper exact cmd I ran: ` confctl decommission --host elastic2049` [18:37:45] `WARNING:conftool.announce:conftool action : set/pooled=inactive; selector: name=elastic2049` [18:44:40] decom ticket: https://phabricator.wikimedia.org/T313842 [19:44:09] nginx is an eternal footgun, just like apache :P "There could be several add_header directives. These directives are inherited from the previous level if and only if there are no add_header directives defined on the current level." [19:44:49] add a caching header? now we drop your x-served-by header :P [19:57:49] never a good sign when there's a separate module for what seems like basic web server functionality https://github.com/openresty/headers-more-nginx-module#readme [20:07:49] joe also noted that wdqs1004 is critical and icinga, I don't have any alerts in my email, does anyone else? [20:08:08] critical **in** icinga, that is. It's possible I'm accidentally filtering it [20:08:58] hmm, not sure about that one. I have puppet disabled on all wcqs instances right now, but thats unrelated [20:09:34] looks like it's tossing out 503s, at least according to icinga [20:09:59] hmm, does pybal auto-depool it? if not we should [20:10:24] logstash says 1004 is throttling all the requests [20:10:48] ebernhardson I don't know how to read https://config-master.wikimedia.org/pybal/eqiad/wdqs . "enabled" doesn't tell me if it's passing health checks or not, does it? [20:10:50] or well, the only recent logs are about throttling. maybe not the same :) [20:11:13] inflatador: hmm, sadly i dont know either. looking [20:11:33] i'm pretty sure that config-master isn't telling us though, that it's saying the state of config rather than the state after taking health-checks into account [20:12:07] i suppose icinga should have something to say usually about health checks failing and kicking it out [20:12:23] https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_wdqs_outage first hit in wikitech [20:13:00] but icinga is *slow*...will take 10+ minutes to find anything in there [20:13:15] looks like the alerts are LVS-centric rather than service-centric? [20:13:39] they should be per-host, example message is "PROBLEM - PyBal backends health check on lvs2009 is CRITICAL" [20:14:52] lol, loading icinga.wikimedia.org crashed the browser tab :P [20:14:58] * ebernhardson tries another browser [20:18:33] inflatador: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=wdqs1004 is rather not clear, but it doesn't seem like the health check is failing. I've run `sudo depool` on wdqs1004 now [20:20:32] yeah, I don't get it. I'm also wondering why we didn't get notified on these alerts. I can't find the alerts from the 2022-03-27 , did you get those in your email? Wondering if I need to be added to a list or something [20:21:23] inflatador: nope, no alerts coming through today, but i suppose i'm not sure if those alerts ever came into discovery-alerts [20:22:50] querying blazegraph directly on 9999 (maybe with a badly formed request, not sure) says "Service load too high, please come back later :)" [20:23:24] hmm [20:23:38] using `localhost:9999/bigdata/namespace/wdq/sparql?query=%20ASK%7B%20%3Fx%20%3Fy%20%3Fz%20%7D` which should be the same as the readiness-probe [20:24:03] should be well formed, same request on 1005 returns an xml document that says 'true' [20:24:19] maybe restart blazegraph on 1004? randomly guessing it seems stuck, perhaps same issue as the deadlock before [20:24:30] or we can leave it depooled for someone who knows more to investigate [20:24:36] (==david :P) [20:25:18] hmm, looking if we have better details on how to see if its the same deadlock [20:25:28] oop, just restarted [20:25:40] no worries, random rare deadlocks are a thing blazegraph does it seems :P [20:27:22] scrolling back my email history for a search on `icinga alert` shows i also didn't get alerts for the 2022-03-27 ones [20:27:41] yeah, I was checking SAL and they didn't fire there either [20:29:04] 1004 looks happy now, repooling [20:29:27] * ebernhardson realizes he's only used `sudo depool`, have to check how repool works :P [20:29:34] ebernhardson: `sudo pool` [20:29:41] hmm, I guess icinga doesn't actually make into the SAL [20:29:48] all updated, it should be back in now [20:29:53] icinga alerts don't show in SAL [20:30:29] ryankemper ACK, did you get any alerts for wdqs in your email today or during that March outage? I don't see any [20:30:41] it seems like the health check should have failed and it should have been auto-depooled, i was querying basically a manual version of the readiness-probe endpoint which i think is what lvs is checking [20:31:46] I think I remember seeing some emails for wdqs1004 over the weekend, lemme check [20:32:15] yeah july 23rd: `** PROBLEM alert - wdqs1004/Query Service HTTP Port is CRITICAL **` [20:32:17] double checked lvs config (in hieradata/common/service.yaml), lvs should be using the readiness-probe endpoint. hmm [20:33:40] ryankemper interesting, what email address did those alerts go to? [20:34:32] I have from `nagios@alert1001.wikimedia.org` to `rkemper@wikimedia.org` [20:35:13] I don't remember having had to set anything up specifically to get those, but maybe it was an onboarding step somewhere that I don't remember? [20:37:12] not sure myself, I do remember some steps related to icinga [20:37:18] created phab task for this BTW https://phabricator.wikimedia.org/T313855 [20:39:05] Not convinced we need a ticket, but I might be missing context [20:39:29] We had one host down and not depooled, which is not good, but I wouldn't call that an outage [20:39:53] Seems like we'd want just a general ticket to figure out the process to get nagios emails coming through properly if they're not [20:40:13] and/or maybe porting to alertmanager alerts where we have multiple alternatives for firing (irc, email, etc) [20:40:14] Yeah, I guess "outage" is a little too loaded of a term [20:40:43] OK, fixed the title, and yeah we do need to figure out where that comes from [20:41:04] anyway i'd have expected pybal to depool this, since it will depool a host failing health checks if there's not already too many depooled [20:41:19] I'd guess in this case that the health check itself was passing, but need to check that [20:41:46] does observability own alerting? If so, we might just ask them in IRC [20:42:16] i suppose it depends what passing means, sadly curl hides status codes so i'm not sure what the http status code was in the response, but it was certainly an error message that i would have expected to be paired with a 5xx error code (probably 503) [20:42:39] "Service load too high, please come back later :)" [20:42:47] it was def a 503, I saw it in icinga [20:43:17] inflatador: I'd ask either o11y or just ask in #wikimedia-sre, I'd bet some people outside of o11y know what's involved in getting the nagios emails [20:44:17] per https://phabricator.wikimedia.org/T297907 it might be that the icinga emails are handled by victorops, altho that seems a bit weird [20:45:20] inflatador: take a look at https://portal.victorops.com/dash/wikimedia#/routekeys , do you see SRE: email optin under the `icinga` routing key? [20:47:12] ryankemper I do [20:48:55] inflatador: hmm yeah I'd ask in #wikimedia-sre then if there's anything particular that needs to be done to get emails from nagios/icinga [20:49:39] I found the one from Mar 27 in my email, it's possible the other ones are being eaten by a filter [20:50:53] inflatador: ah yeah try clicking `All mail` and see if they show up [20:51:34] looks like I only got alerts for May 9 and Mar 27 [20:51:50] they have " #page" in them, so I'm guessing those actually did page somebody? [20:52:29] Please add a space or _ between # and p [20:52:54] oops , sorry RhinosF1 , I hope I didn't summon everyone ;( [20:53:06] Probably not many [20:53:14] But they might be a few [20:54:57] ryankemper can you fwd one of the alerts you got for wdqs1004 to me? Just curious what they look like [20:55:42] inflatador: observability own alerting ye [20:55:48] Probably go.dog best [20:56:15] inflatador: done [20:56:47] inflatador: what was the march 27 one you saw? was it a similar format as what i just forwarded or did it actually come from `alert-manager` [20:58:15] similar to yours, will fwd [21:02:02] brb [21:16:14] bqack [21:17:24] inflatador: the email you got is to alerts@wikimedia.org [21:17:38] i'm guessing #p.age emails do that [21:17:45] but you're missing the general alert emails to your specific email [21:36:12] heh, php 8.1 is getting `public readonly int $foo` as class variables. Becuase i guess final sounded too much like java :P [22:17:33] ^ haha [22:17:54] I do kind of like `readonly`, feels like it conveys a little more intuitively what it's doing [22:39:24] yea it's not bad, and it probably has different exact semantics from final so the different name might prevent some confusion. [22:40:10] also we have a *lot* of wikis that start with the letter s, approx 10% of all wikis [22:40:27] mostly i'm noticing the reindex was on letter s when i started my day, and it's still there even though most of the wikis aren't *that* big :P [22:44:22] also, jq is friggin magic. Extract a specific file from a pcc catalog: cat change.blahblah.pson | jq -r '.. | .content? | select(index("Access-Control"))' > wcqs.nginx [22:46:45] i suppose it did manage to print the file twice, not sure why :P [23:05:39] ^ Take note, kids! Erik calling something magic is high praise, indeed!