[07:35:47] We don't have much in our stand up notes this week. Is it because nothing got done? Or because we forgot to update them? [07:36:03] https://etherpad.wikimedia.org/p/search-standup [07:52:04] gehel: added few more items [07:57:45] dcausse: thanks! [08:22:54] Update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-06-14 [08:28:23] Oh. Just added notes. [08:28:33] My bad. [08:38:54] gehel: would you mind if I amend the weekly update wiki page? [10:26:33] lunch [12:45:20] o/ [12:58:15] o/ [13:36:40] pfischer: please do! And let me know when done so that I can report those changes in a few other places [13:38:00] gehel: I already updated it in the wikitech page you linked this morning [13:38:27] thanks! [14:55:01] going offline [14:58:45] .o/ [15:10:12] \o [15:10:32] * ebernhardson wonders if a secret would be better than getting fancy with network routing [15:10:58] but secrets are tedious in other ways :P [16:55:09] * ebernhardson lols...just registered an account with something, provided a 32 character password. On login: The field Password: must be a string or array type with a maximum length of '26'. [16:55:50] apparently when you submit the account it just shortens your password [17:08:51] ebernhardson: so it worked with the substring? - was that secret remark about the private wiki extension that grants access to the SUP (T357353)? [17:08:52] T357353: Application Security Review Request : NetworkSession MediaWiki extension - https://phabricator.wikimedia.org/T357353 [17:10:20] pfischer: the secrets is for allowing federation between RDF query instances to skip throttling. The other option is to make sure the federation requests get routed internally instead of through the edge, and then only throttle requests with headers our edge sets [17:10:43] pfischer: and curiously yes, i backspaced a few times and it accepted the password. craziness [17:11:23] but looking at secrets more...there is too much to do. probably not worthwhile. Things like the secret would end up in the java command line, and perhaps end up in unexpected places. But re-working the rdf configuration is probably out of scope [17:12:03] or we could explicitly source the secret out-of-band from normal config, pull from environment directly, but seems odd [17:12:55] dav.id has a plausible plan for the networking bits, looking at that closer now [18:25:47] Looking at a couple of DFW elastic hosts that crashed on Tuesday https://wm-bot.wmcloud.org/browser/index.php?start=06%2F11%2F2024&end=06%2F11%2F2024&display=%23wikimedia-operations . Looks like backpressure from the DFW logstash hosts being down may have contributed(?) [18:26:28] Wondering if anyone is aware if/when logstash host problems have caused issues for our elastic hosts in the past? [18:30:14] inflatador: hmm, i can't say i've seen that in the past [18:30:42] what am i looking at in the wm-o link? it's 2k lines :P [18:31:42] ebernhardson it's a whole day's worth of IRC logs, sorry. The logstash thing is a reach, but I do see we have a few CODFW hosts with logstash crashlooping [18:33:07] hmm, i'd have to review the flow but i think the logs go from elastic->kafka->logstash, in that event logstash having problems shouldn't matter. But i don't have the best grasp on how logging has evolved into the current system, there are a few ways it could flow [18:35:16] we have a local logstash instance running on all the elastic hosts , but I haven't looked at it much. Here's a cut down version of that IRC output https://phabricator.wikimedia.org/P64985 [18:36:01] oh right, hmm [18:37:00] it looks like logstash just forwards gelf over tcp on localhost:12201 to json lines over udp on localhost:11514 [18:37:01] yeah, looks like it goes from logstash to rsyslog [18:37:25] guessing rsyslog doesn't go directly to logstash hosts but will check [18:37:44] based on the configs, it depends :P [18:39:13] well, it looks like port 11514 specifically goes direct to kafka [18:39:25] from 50-udp-json-logback-compat.conf [18:39:31] looks like it's what you said, it goes via kafka or maybe some go directly to centrallog [18:40:18] it looks like logstash can do some direct logging outside kafka, but i'm not sure which conditions trigger it [18:40:22] s/logstash/rsyslog/ [18:41:09] Yeah, I guess it's a pretty big reach to say logstash hosts themselves were the problem. Will check for kafka backpressure alerts, but my theory is looking shaky [18:41:49] it's still quite odd for machines to just fall out :S but i guess you get enough machines and it happens. Paired is suspicious though [18:42:45] reminds me of how sql master instances used to be chosen when the old instance aged out, iirc it was whichever slave had been running the longest [18:42:59] hardware had proven itself :P [18:43:18] yeah, 3 alerts for logstash systemd crashlooping, plus 2 hosts that never alerts, but completely locked up [18:44:50] which of our hosts failed out? Did they have anything pre-fail in syslog/etc? [18:45:11] if it all happened in a second it might not, but maybe it was struggling for a few minutes and logged something? [18:46:19] there are some kafka consumer lag alerts in #observability around the same time, still very tenuous though. I just started working on this, haven't look at the hosts (elastic2088 and 2099) yet. Ticket is T367435 [18:46:20] T367435: Determine why elastic2088 and elastic2099 did not alert when unresponsive and fix - https://phabricator.wikimedia.org/T367435 [18:48:00] LOL, I just went to check on elastic2099 and it's down again! And we never got any alerts AFAIK [18:48:37] ok, that makes sense then [18:48:41] i noticed 2099 wasn't in the graphs :P [18:49:10] and 2088 was actually out for days, it failed around 202400606 11:00 [18:51:05] 2088 shut itself down [18:51:24] I just rebooted 2099, let's see if it comes back up [18:51:55] in elastic2088:/var/log/syslog.4.gz can clearly see the host going through shutdown [18:52:09] and then the next log is the 11th [18:53:51] i don't understand why its shutting down though, i dont see anything [18:57:52] I'm also not getting why we didn't get emails, let me check the puppet contact groups [19:01:00] are there logs of things cumin/spicerack does? Serve shutdown at 06:10, cumin connected a few times as root in the minute before shutdown [19:01:10] or at least, /etc/ssh/userkeys/root.d/cumin was used [19:01:56] from cumin2002.codfw.wmnet [19:02:14] * ebernhardson unsurprisingly gets permission denied there :P [19:05:05] interesting, will take a look [19:05:49] it definitely could be a cookbook gone wrong, but we only issue reboots, not shutdowns [19:06:31] curiously i'm not finding the actual shutdown command. I guess i don't know it would be logged or what it would look like...but 20 years ago it certainly did :P [19:08:35] but elastic2088:/var/log/auth.log.4.gz clearly shows cumin connecting at 06:06:50, 06:07:12, and 06:07:19. At 06:07:20 is the last log messages i find on the host [19:08:50] Yep, I see it on the cumin side too `:2024-06-06 06:05:46,608` [19:09:37] hmm [19:09:41] looks like it was a cookbook rolling operation [19:10:25] first mystery solved, next mystery started :P [19:11:19] so that would've caused an alert suppression. I guess I'll look at the cookbook code next. It's my understanding the suppression should only last a day [19:11:20] i guess could poke 2099 and make sure its same [19:11:34] 2099 isn't come back up, I've tried rebooting from DRAC twice to no avail [19:11:42] was about to punt that one to DC Ops [19:12:03] so it seems like, instances that fail on reboot? [19:12:24] i guess thats a long known problem with servers, although with less spinny stuff i thought it was less these days [19:12:42] solar flares. Gotta be solar flares [19:13:42] i guess i don't know what exactly the cookbook does, but i'm a little surprised it continued without seeing the host come up [19:14:58] Agreed. Still sifting thru the logs [19:37:56] yeah, suppressions should only last an hr, per https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/elasticsearch/rolling-operation.py#L183 [19:38:59] i suspect the problem with alerts is the alerts only fire if there is data, if nothing comes in then most alerts dont fire? [19:39:24] but i guess there should at least be some generic "server is running" check somewhere [19:44:25] weird part is the servers definitely show as down in icinga, and we even got an IRC recovery message for a ping alert [19:45:46] i'm not completely certain, but it looks like it shuld have thrown an exception from wait_for_elasticsearch_up [19:46:42] I'm running down the alert side, but if you wanna check the cook-book logs I put them in your homedir in elastic2088 [19:46:51] thanks, i'll look through it [19:51:10] hmm, so it gets 240 attempts. It uses up all 240 attempts and then tries to start replication. that fails and it we get [ERROR _menu.py:277 in _run] Exception raised while executing cookbook sre.elasticsearch.rolling-operation [19:51:30] but based on the timing of log messages, it doesn't seem to shut down [19:52:01] oh nevermind, i just had to scroll further. The cookbook failed at 07:30:40, then was started back up at 20:02:20 [19:52:34] so the cookbook did fail, but perhaps the output was cluttered with a bunch of things, and we didn't get alerts for missing hosts [19:52:42] so it wasn't clear that a host was down [20:01:02] inflatador: re: contact groups, in icinga it should be contacts.cfg in the private puppet repo, augmented by modules/nagios_comomn/files/contactgroups.cfg iiuc [20:01:52] ah, I figured it had to be in private puppet somewhere. Thanks [20:02:48] i'm not sure if the full names, like "Traffic" or "Search Platform" have mappings though, i guess i've never seen those. It's always been like, team-discovery [20:28:05] yeah, same here. I'm not seeing any explicit mappings, but I'm probably just missing something. Maybe it's just a routing thing for ping alerts [20:34:22] I might just end up making blackbox ping checks for these...really soft thresholds though, > 30m downtime or so [20:35:18] something feels off from reading the icinga logs, but it's probably me not understanding [20:35:45] at 06:05:51 we get a bunch of `SERVICE DOWNTIME ALERT` logs, which i assume is each individual alert it is downtiming. host downtime isn't in that list. [20:36:17] at 06:09:25 it issues a `HOST ALERT: elastic2088;DOWN;HARD...`, no indication that it is downtimed or supressed. But the alert didn't come to us [20:36:31] and then at 07:05:46 everything becomes undowntimed, but again its only service downtimes and not the host [20:39:45] makes it feel like the host alert did go somewhere, but maybe the wrong somewhere? [20:39:56] i haven't figured out in puppet how those are defined [20:44:28] yeah, me neither. I see alerts like that in my email for analytics hosts, but not for search hosts [20:44:55] `PROBLEM Host aqs2003 - PING CRITICAL - Packet loss = 100%` [20:48:52] inflatador: afaict, it comes from modules/puppetserver/files/naggen2.py. Might be worth poking at /etc/icinga/objects/puppet_hosts.cfg on the icinga host to see how it's configured [20:49:12] err, the second file should be the output of naggen2 [20:52:57] meh, pcc is broken :P [20:58:09] but naggen2 is looking for Nagios_host resources, those are probably defined by monitoring::host, would need pcc to know whats going on there [21:05:03] inflatador: my theory is we aren't setting the `contactgroups` hiera variable. We are only defined in flink and airflow [21:05:19] at least, `grep contactgroups hieradata | grep team-disco` doesn't have any elasticsearch instances [21:19:46] * ebernhardson is pleased to not find any of our instances on https://os-deprecation.toolforge.org/