[07:43:48] wdqs1009 is no longer importing files but the updater has not restarted (the data_loaded flag is not there so most probably something failed) [08:47:40] sorry for the noise on Erik's gitlab MR I was trying to add multiple reviewers but it's not possible apparently :) [10:55:40] lunch [10:55:56] lunch 2 [14:06:15] dcausse sorry, I saw it failed over the weekend and forgot to upload the logs. Will do that shortly. FWiW I think it's lexemes that failed [14:07:16] o/ [14:07:31] inflatador: let's finish the reload manually then [14:08:12] the logs should tell us from where to resume [14:13:48] dcausse OK. Logs here: https://people.wikimedia.org/~bking/T323096/buffer.txt.gz [14:16:51] hmmm, that doesn't look like it dumped the full buffer, let me try again [14:17:39] inflatador: I meant the cookbook logs [14:18:01] but the blazegraph logs you shared do not look promising [14:19:15] understood, will upload the spicerack logs shortly [14:23:01] dcausse OK, rest of logs up at https://people.wikimedia.org/~bking/T323096/ . I think the buffer2 (dump of my tmux session) has the most info [14:23:05] unsure that's necessary [14:23:08] oops [14:23:51] I meant unsure we'll be able to recover from this checksum error [14:24:27] Got it, that means the wikidata reload itself failed? [14:24:53] blazegraph corrupted its own journal [14:26:34] Not great. Do we just start over again? [14:26:46] yes most probably... [14:27:11] will try a restart I don't remember that we ever recoverd from this [14:27:16] *but [14:31:05] np, I'll let you take a look. Hit me up if/when you want to try another reload. wdqs2009 can't seem to do it without going into OOM state, not sure what that's about. [14:33:42] inflatador: OOM? [14:34:35] dcausse out of memory, the OS gets into an unresponsive state [14:35:00] happened 3 times so far. (/me should really update the phab ticket with this info) [14:35:18] inflatador: you have logs for these OOM errors? [14:36:19] dcausse I can check again, but the first two times it happened, it didn't log anything, which is pretty typical for OOM as the system might kill the logging process [14:36:56] when it gets like that, the server is pingable, but you can't login, even from the management console. You can see the command prompt but it's totally unresponsive [14:38:02] could it be connectivity errors? [14:38:20] I'm surprised that it can be mem related [14:38:22] I don't think so because the server recovers after a hard reboot [14:38:53] and while I can login to mgmt console and see the server's login prompt, it doesn't accept input [14:39:19] I'm also surprised, we should probably try it on a different server [14:40:17] sometimes I see alerts flapping on ssh and ping on the new wdqs hosts that do nothing have you looked into these perhaps this is a similar problem? [14:41:32] It's possible, although not clear why rebooting would fix that [14:41:57] nor why ping works but not SSH [14:42:05] let me check the console though [14:45:18] nope, I can login via the mgmt console now, so my best guess is OOM. Hard to prove though, I think it would be more productive to try on a different host in DFW and see what happens [14:50:12] last log from wdqs2009 is from Jan 14 08:10:47 and then a reboot at Jan 19 21:36:11 [14:52:18] cpu utilization&queue length climbs during that period https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wdqs2009&var-datasource=thanos&var-cluster=wdqs&from=1672983562761&to=1674354379805 [14:52:23] iowait [14:52:47] might be nfs? [14:59:04] Possible, but it should be done with NFS once it's done munging (I think?). It always gets past that point [15:02:49] ryankemper: I'm moving our 1:1 to tomorrow, super busy evening so far [15:03:14] hm indeed once munging is done unless I'm missing something nothing should be read from nfs [15:03:19] And let's use that time for our ITC, unless you have something urgent [15:07:45] inflatador: let's retry on wdqs1009, if we have a machine idle in eqiad we might increase our chances and run it there in parallel too [15:08:04] dcausse ACK, will get a puppet patch up to open the NFS ports shortly [15:08:44] for wdqs2009 we'll have to understand what's happening, I'm not sure how to interpret host metrics but something is definitely showing up [15:09:53] e.g. misc: saturation, procs blocked increasing as well as tcp/inuse rising to 3k [15:18:33] Puppet patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/882664 [15:36:11] OK, think I got the ferm situation fixed [15:37:28] thanks! [15:51:15] OK, reload for 2010 is started, will start 1009 and 1010 reload again shortly [15:51:47] * dcausse crossing fingers [15:52:06] inflatador: you should not use reuse-munge here [15:52:52] we need fresh dumps this time otherwise we won't have enough retention the kafka topics [15:53:05] dcausse ACK [15:53:30] \o [15:54:11] o/ [15:59:23] do I need to make a new dashboard if I want to see typical node exporter metrics (load, memory) for an arbitrary host? Wanted to look at clouddumps [15:59:47] i might have flakey internet today, will see. ISP notified me they are doing maintenance today [15:59:56] inflatador: I think "host overview" might have this? [16:00:30] https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001&var-datasource=thanos&var-cluster=wmcs [16:02:01] pfischer, ryankemper, dcausse: triage meeting: https://meet.google.com/eki-rafx-cxi [17:39:30] hmm, we never had a license file in wikimedia/discovery/analytics [18:52:55] lunch, back in ~45 [19:18:06] true to their word, isp cut off connectivity for their maintenance. on a 4g bridge for now [19:21:21] back