[11:17:26] Lunch [11:44:52] lunch 2 [14:05:00] o/ [14:53:56] looks like we're having trouble with the prometheus blazegraph exporter on the new wdqs hosts. Looks like it's not getting the metrics from the right place or something...still looking [14:54:25] inflatador: scream if you need help, or just want another pair of eyes [14:54:51] Thanks! I'll do my due diligence but probably will need help sooner or later ;) [15:40:01] looks like Blazegraph isn't started correctly (see `curl -v localhost:9999/bigdata/`) [15:40:41] gehel thanks, I was running the test queries and getting all failures...but I was also getting all failures from the public endpoint ;( [15:41:04] also `tail /var/log/wdqs/wdqs-blazegraph.log` shows : [15:41:15] 15:29:58.551 [main] WARN o.eclipse.jetty.webapp.WebAppContext - Failed startup of context o.e.j.w.WebAppContext@2e8c1c9b{Bigdata,/bigdata,file:///tmp/jetty-localhost-9999-blazegraph-service-0.3.118.war-_bigdata-any-3699081984997511495.dir/webapp/,UNAVAILABLE}{file:///srv/deployment/wdqs/wdqs-cache/revs/fb7d1611b4c03ef57f826cd484ad23facb22973a/blazegraph-service-0.3.118.war} [15:41:15] java.lang.Error: Two allocators at same address [15:41:29] sounds like a corrupted journal. David might know more [15:41:42] ah, that probably means our reload failed [15:41:48] yep [15:42:02] not a puppet problem! [15:42:04] I **think** the reload to wdqs1009 did work, but I'll verify that now [15:42:25] what host? [15:42:30] wdqs2009 [15:46:16] yes seems like it won't be able to recover from this journal [15:46:38] also it was copied on dec 7 so unlikely we have enough retention in kafka to catchup [15:46:56] wdqs1009 which was reloaded at the same time, seems to be healthy [15:47:37] december 7th? Hmmm [15:47:37] reloaded from dumps or using data-transfer? [15:47:50] -rw-rw-r-- 1 blazegraph blazegraph 1.1T Dec 7 23:44 wikidata.jnl [15:47:51] reloaded from dumps, but apparently that didn't work [15:48:08] let me check 1009 again to see if it has a correct modified date [15:48:31] inflatador: do you still the logs of the import? [15:48:42] s/still/still have/ [15:49:07] Yeah, it's on cumin, checking now [15:49:14] if the journal gets corrupted during the import the import should fail [15:50:06] Hi. I Use POST too access https://query.wikidata.org/sparql ; with a bot; It seems that the first access works fine and then the second POST would get 403. Is this somekind of alimit I am meeting? (=I would expect 429 in that case). I have aset a unique user agent. is there a whitelist i need to apply to? thanks [15:50:10] the data_loaded flag is not there so it did not finish [15:50:53] Kotz: you should get a 429 if you're putting too much pressure on the service. [15:51:20] this is our first time loading with NFS and the errors might not be as helpful. Still checking [15:51:22] If you ignore the 429 for too long, at some point we start to ban that user agent for 24h with a 403. [15:51:44] Kotz: if you can give me the user agent, I can have a look at the logs to see if I see something else [15:52:56] thanks @gehel look for "KotzBot" in the user agent [15:53:47] Kotz: and do you have a timeframe during which you made those requests? [15:54:43] gehel last one was 12 minutes ago [15:57:27] Kotz: I don't see anything... [15:57:45] * gehel is only looking into logstash at the moment [15:58:05] I'll start again now gehel [15:59:33] Kotz: any chance you still have a dump of the request and you could give me the value of the "x-served-by" header? [16:00:24] 2023-01-06 15:58:19 (my local time. which is now) - successfull requeust. [16:00:25] and 15:58:40 gets 403 [16:00:30] gehel [16:00:51] here's the abridged version of the logs, 1009 and 2009 both logged to this file https://phabricator.wikimedia.org/P42929 [16:03:00] \o [16:04:59] patch for the NFS data reload is here. We don't need to merge but in case you'd like to match the log errors to the code https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/876217 [16:05:09] Kotz: this at least doesn't seem to be throttled or banned from the WDQS service itself. If you could open a Phab task and put as much info as possible, maybe we can dig a bit more into it. [16:06:06] If you could put the example of the request you're doing, and ideally a full dump of the response your getting, that would help a lot in tracking this down! [16:06:23] ok will do. thanks [16:06:34] o/ [16:08:52] inflatador: when did the reload start? [16:09:04] 2023-01-03 ? [16:09:44] dcausse Y, the earlier timestamp is DFW [16:09:47] codfw that is [16:09:59] the eqiad reload for 1009 started a few hours later, and appears to have worked [16:11:15] I see a bunch of errors related to timestamp extraction [16:12:39] the KeyError is because I forgot to change "url" to "read_path" in the lexeme dict, just fixed [16:13:10] you can ignore the errors from dry tuns [16:13:13] errr...dry runs [16:13:27] dry run doesn't actually run the remote shell cmd so it always errs [16:14:24] but it was never started for 2009? can't seem to find it in the logs [16:14:35] 1009 failed as well no? [16:14:43] 1009 did not fail, as far as I can tell [16:14:51] except lexems [16:15:51] actually nm [16:16:15] I didn't mess up the lexeme key [16:16:16] hmm.. it did only munge the main dump file it did not start import [16:16:56] also if the import takes less than 8days it's likely a failure [16:17:33] We're using NFS now, so we assume it will be faster [16:17:39] not sure how much faster yet though [16:18:05] import won't be faster [16:18:55] "download + munge from local disk" old style might be slower than "munge from nfs path" [16:19:18] might be faster you mean? [16:20:22] I mean if the nfs approach helps the old method "download from a mirror and munging over a local file" might be slower [16:20:42] actually I don't know if we munge from the nfs path or still copy locally [16:20:57] we munge from the NFS path now [16:21:31] ok [16:21:33] I'm not sure if that would be faster or slower [16:21:57] unsure but that's only ~20h of 8-10 days process [16:22:22] the slow part is when importing the munged chunks to blazegraph [16:22:51] that step did not seem to start in the logs you pasted [16:24:07] Interesting. Here's the changes I made to the cookbook: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/876217/1/cookbooks/sre/wdqs/data-reload.py [16:24:27] gehel https://phabricator.wikimedia.org/T326427 Hope I did it correctly. going offline now [16:24:34] thanks in advance for your elp [16:24:39] help [16:26:27] ` File "/home/bking/wmf/spicerack/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 181, in munge [16:26:28] .format(path=dump['read_path'], [16:26:28] KeyError: 'read_path'` [16:26:53] probably need to look at the 'munge' function, checking [16:27:15] Kotz: thanks! Hopefully we'll find something. Probably not before the weekend :/ [16:27:38] and it does look like there is some code checking local disk size even though we don't use it anymore [16:29:40] inflatador: are we sure we're going to use nfs in the long term? [16:29:47] Kotz: if you could add what request you're doing, that would also help [16:30:10] if yes this "fetch_dumps" function no longer makes sense [16:30:15] it should just munge [16:31:39] dcausse unsure. SREs want us to use rsync long-term. But NFS is already set up so I was hoping we could get the reload done before changing again. We'd be using rsync in a different way than anyone else in the puppet code so I'm a little wary about that [16:32:21] ok, adding few comments to your patch [16:32:52] but some of these might not apply with rsync as we'll have to "fetch" the dumps [16:35:19] thanks dcausse , taking a look [16:35:42] Kotz: It looks like this was blocked by our nginx reverse proxy, and the only rule we have that would block it is for an empty user agent. Are you sure the user agent is sent with that second request? [17:42:50] going offline [17:44:19] I just updated the reload cookbook based on d-causse suggestions and restarted the reload on wdqs1009 and wdqs2009 [17:55:35] lunch, back in ~1h [18:27:12] ryankemper, inflatador : looks like we have an underrun in another project, we might get a few more WDQS servers [18:53:15] back [19:48:53] how weird, google is back to infinite scroll on search results [19:54:16] gehel: excellent! [20:24:54] meh, somehow journalctl on an-airflow1001 only goes back to 8-ish hours? [20:25:01] -- Logs begin at Fri 2023-01-06 12:50:15 UTC, end at Fri 2023-01-06 20:24:33 UTC. -- [20:29:59] i and the answer seems to be ... journald has an option to try and preserve an amount of space for the rest of the system, based on %. So when it sees a disk that's mostly full (like an-airflow usually is) it trims the logs [21:16:53] quick break, back in ~30 [21:38:39] back