[08:03:30] on wdqs2021 the systemd unit wdqs-updater.service has two definitions: /etc/systemd/system/wdqs-updater.service (wrong one) and /lib/systemd/system/wdqs-updater.service [08:23:34] new graph split hosts seem a bit noisy(/unstable?) with e.g. "SSH on wdqs1021 is CRITICAL" and "SystemdUnitFailed: systemd-timedated.service" [08:36:56] strange... can't ssh to any of the graph split in eqiad (wdqs102[1-4]) but they seem to respond to sparql queries query-main/query-scholarly [08:43:32] ryankemper, inflatador: could you have a look at those servers when around? ^ [08:57:08] I can't connect either. [08:57:18] Nothing entirely weird on https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=wdqs-main&var-instance=All [08:57:36] CPU usage is low, but load is high. Not sure where this is coming from. [08:58:28] brouberol: Any chance you could have a quick look before Ryan / Brian are around? We just sent the communication about those new endpoints, it would not look too good if the first experience for our users is a crashed system. [08:59:08] note that at the moment, the system seems to be running fine from a user point of view. [09:01:26] could be categories? [09:01:51] why is it logging to console? if I attach to the console via mgmt to login from there it's a fllod of logging from wdqs-categories [09:02:03] not sure why it ends in there, making the login impossible [09:02:30] weird [09:02:33] that explains at least some of it [09:02:47] *flood [09:02:55] We should really move categories to their own hosts [09:03:03] ^ [09:03:18] maybe even ganeti or k8s. Low traffic, low IO and all that [09:03:42] gehel: catching up [09:04:14] brouberol: if you need more context, I can jump in a meet, or we can do it here, or slack [09:04:48] a systemd-timer should log to some file in journalctl not the console? [09:04:54] we're in a team meet atm. I can attempt later, but I'll only have a couple of minutes, after which I'll need to prep lunch for when my wife comes back [09:05:07] brouberol: ack [09:05:29] no real emergency here, just some confusion and a semi high stake situation [09:12:31] I see "StandardOutput=journal+console" for /lib/systemd/system/wdqs-categories.service [09:13:33] well it's actually set like that for all wdqs related services [09:13:47] unsure if this is good practice, expected [09:14:17] probably not :/ [09:16:14] I'll have a look at wdqs1021 as well. [09:16:37] prepping a quick patch to stop logging to the console at least [09:16:58] dcausse: ack, thanks. Feel free to tag me. [09:17:03] thx! [09:22:56] is there a way to see where a systemd timer definition is located? [09:23:34] They are normally in /lib/systemd/system [09:24:02] But for these custom blazegraph services, I'll just check the puppet definition. [09:25:49] it's simply using "systemd::timer::job" without any template [09:26:56] What about this? https://github.com/wikimedia/operations-puppet/blob/production/modules/query_service/templates/initscripts/blazegraph.systemd.erb#L11 [09:27:39] dcausse: Sorry, I think I misunderstood your question. [09:29:24] But the answer is the same, any timer that is created by `systemd::timer::job` creates a file called `/lib/systemd/system/$resource.timer` and a corresponding `/lib/systemd/system/$resource.service` [09:29:39] btullis: I was not super clear, I meant this: https://github.com/wikimedia/operations-puppet/blob/production/modules/query_service/manifests/crontasks.pp#L140, and trying to see how this gets materialized to a system unit/timer definition file [09:29:51] btullis: thanks [09:30:28] looking at those on a random I don't the same mention with "StandardOutput=journal+console" so I suppose we're fine [09:30:49] it's only blazegraph itself and the updater [09:34:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070545 [09:34:57] Yes, I see what you mean. I don't see that timer definition where I would expect it to be. [09:35:34] found it with a brute 'find /lib/systemd -name *timer* | grep categ' in the end [09:36:32] but they're also in /lib/systemd/system/load-categories-daily.timer [09:37:04] I mean in /etc/systemd/system/multi-user.target.wants/load-categories-daily.timer but nvm this is a symlink [09:40:39] That's merged and deploying now. I will check whether or not it restarts services authomatically after a config file update. [09:41:06] thanks! [09:44:46] Yes, it triggered an automatic restart of both blazegraph and the updater service when I ran puppet on wdqs2025. [09:45:23] Oh, maybe not. [09:45:26] https://www.irccloud.com/pastebin/7DzEJaaB/ [09:47:58] yes I'd be surprised if it gets restarted automatically [09:50:50] What actually /is/ wdqs-categories? Is there any chance that we are receiving unexpectedly high load for this service? Can we depool it independently of wdqs-main, to allow us to restart the required services on these hosts? [09:52:27] btullis: I suspect that we don't get any traffic for categories to wdqs-main, I believe that it's the reload process (reload-categories timer) that is causing load issues [09:52:53] we reload this categories database and it's not supposed to be served from here [09:53:01] Oh, I see. [09:53:09] we should definitely move this out [09:53:38] they happen to be close to wdqs for historical reasons (simply because this is where blazegraph is deployed) [09:54:07] working on couple patches to clean this up a bit [10:02:17] lunch [13:17:21] \o [13:20:39] dcausse btullis what'd I miss w/categories? Anything else I can help with? [13:20:49] o/ [13:21:35] I haven't touched anything since d.cause went to lunch. Still around if I can help. [13:21:35] inflatador: did not follow up yet after the puppet patch cause I could not ssh there this morning [13:21:48] taking a look now [13:22:31] I think that the reload of categories may have finished. [13:22:56] yes [13:23:00] looks like the graph split hosts had a problem with categories logging that caused them to lock up? Is that right? [13:23:04] The last entries on my scrolling console were from `zh.wiktionary.org` so I recokon it may have been working alphabetically. [13:23:21] *reckon [13:23:53] going to restart the systemd services to stop logging to the console [13:24:04] Great. [13:24:05] not sure if that's going to fix the issue or not [13:24:21] We should do all of the wcqs,wdqs instances. Should we set off a cookbook? [13:24:41] btullis: ideally yes [13:24:56] it needs: sudo systemctl restart wdqs-blazegraph.service wdqs-categories.service wdqs-updater.service [13:24:59] on every host [13:25:09] run puppet first? [13:25:28] yes probably and perhaps depool/repool as well? [13:28:17] dcausse btullis I'll do the cookbook run...but I feel like I'm probably missing some context still. Up in https://meet.google.com/aod-fbxz-joy if anyone has time to fill me in [13:45:24] btullis https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph/Deepcat more context on the tool that uses wdqs-categories [13:48:15] created T374009 for the wdqs host investigation, feel free to add any context there [13:48:16] T374009: Investigate WDQS graph split hosts' unresponsiveness - https://phabricator.wikimedia.org/T374009 [14:31:48] dcausse: meeting with WMDE: https://meet.google.com/bfe-uzwh-ytj [14:41:40] wdqs eqiad hosts are done restarting, moving on to wdqs codfw [14:46:39] created T374016 to discuss what to do with `wdqs-categories` [14:46:40] T374016: Consider separating wdqs-categories from the rest of the wdqs stack - https://phabricator.wikimedia.org/T374016 [14:57:19] \o [14:57:28] o/ [14:57:32] catching up on wdqs stuff [14:57:57] so the categories reload happened without the namespace created and failed all queries [14:58:22] so I suspect that the massive error logging to the console is what caused the issue? [15:00:41] Search Platform office hours starting in https://meet.google.com/vgj-bbeb-uyi? [15:01:16] may I update your CR commit msgs to add bug T374009 ? [15:01:16] T374009: Investigate WDQS graph split hosts' unresponsiveness - https://phabricator.wikimedia.org/T374009 [15:01:23] dcausse ^^ [15:02:20] dr0ptp4kt: i'm going to have to move our meeting this afternoon, turns out i have to take liam to ortho [15:02:40] thx ebernhardson i'll update it [15:51:12] volunteer time, will be back in ~2h [17:09:29] dinner [17:10:20] according to audio pronunciation on a very large search engine, it's "syn"tillating ryankemper, you had it! [17:10:34] Internet keeps dying, will rejoin meet in a few mins [17:48:00] related thought for later, users can have multiple deepcat keywords in a single query. Do we put some limit on repeated keywords? [17:58:06] back [19:47:16] OK, I've merged all wdqs cruft removal patches from this US morning, ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070545 [20:28:10] ryankemper my volunteering is running late, going to be late to pairing [20:28:44] ryankemper updated invite [21:09:57] ack [22:16:39] OK, updated T374009 to the best of my abilities...if anyone who was more closely involved wants to add details feel free. Would appreciate some perspective on user impact [22:16:40] T374009: Investigate EQIAD WDQS graph split hosts' unresponsiveness - https://phabricator.wikimedia.org/T374009