[06:02:56] <elukey>	 hello folks, good morning
[06:03:32] <elukey>	 I see some icinga alerts for elastic eqiad/codfw nodes, that say
[06:03:33] <elukey>	 CRITICAL - $.search.remote.chi.seeds not found,$.search.remote.psi.seeds not found
[06:03:54] <elukey>	 is it WIP due to yesterday's outage or something different?
[06:15:24] <elukey>	 it doesn't seem related to a single cluster (I see on one host omega, on the other one psi)
[06:16:16] <elukey>	 in the logs there are a lot of 
[06:16:17] <elukey>	 [2021-09-13T20:31:14,030][WARN ][org.elasticsearch.deprecation.common.ParseField] Deprecated field [_retry_on_conflict] used, expe
[06:16:20] <elukey>	 cted [retry_on_conflict] instead
[06:20:23] <elukey>	 no sorry red herring
[06:31:49] <elukey>	 doesn't seem something really burning, but lemme know if there is anything to do
[06:38:36] <elukey>	 side note - I created a while ago https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts to assign a specific little "runbook" to every alert (rather than the generic admin page) as an effort to have more people being able to work on alerts without a ton of background
[06:38:59] <elukey>	 it seems working fine, it took a little bit of time but it may be a good idea for the Elastic search alerts as well
[06:40:06] <elukey>	 (In case you are interested I can offer my time to review the runbooks and give feedback, if I can follow along them anybody can afterwards :D)
[06:45:35] <dcausse>	 elukey: o/, looking
[06:47:58] <dcausse>	 about the runbooks it's a good idea, I saw that the new alerting system is somewhat forcing to have a link to a runbook which is good
[06:53:06] <elukey>	 exactly yes, a little context + some things to check speed up things
[09:00:05] <zpapierski>	 break&relocation
[10:15:23] <dcausse>	 lunch
[12:49:20] <gehel>	 zpapierski: want to talk SOLID? Ping me when around (meetings starting in ~40')
[12:49:43] <gehel>	 Or should we do a team wide discussion on SOLID? A new learning circle?
[12:50:50] <zpapierski>	 we can talk 1/1, I might be wrong about stuff :)
[12:51:05] * gehel is sure to be wrong about a lot of stuff!
[12:51:22] <gehel>	 meet.google.com/aay-fggr-uuz
[13:12:31] <dcausse>	 put up https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater (reviews welcome) mainly to serve as a context & runbooks for new alerts (c.f. https://gerrit.wikimedia.org/r/c/operations/alerts/+/720066) 
[13:18:29] <gehel>	 We have a runbook page for WDQS: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook
[13:18:44] <gehel>	 we should probably link the streaming updater runbooks from there
[13:18:59] <gehel>	 (I haven't read your page yet, take this with a grain of salt)
[14:34:58] <hashar>	 hello, I found some strange issue with search related to PageImages extension ` Error: Call to a member function getMimeType() on null`   apparently when looking up for thumbnails
[14:35:02] <hashar>	 https://phabricator.wikimedia.org/T290973
[14:35:30] <hashar>	 that started Sep 13th 2021 at 19:57 , and I could not find any matching event that could have lead to that
[14:36:00] <hashar>	 easy repro is: `time curl --verbose 'https://fr.wikipedia.org/w/rest.php/v1/search/title?q=circu&limit=10'` which ends after 30 seconds with no content
[14:36:16] <hashar>	 or enter `circu` in the search bar of https://fr.wikipedia.org/
[14:36:40] <hashar>	 I haven't made it a train blocker cause that does not seem related to any mw change
[15:00:15] <ebernhardson>	 \o
[15:01:49] <ebernhardson>	 hmm, something in completely wrong with chrome ... tab is skipping over password fields
[15:01:58] <dcausse>	 o/
[15:02:11] <ebernhardson>	 like, in what world would i want an http plain auth modal dialog, type in the username, and then tab to submit? seems so odd :P
[15:09:54] <ebernhardson>	 for page images...it's not clear what changed, that repo has been static for some times. I can band-aid in something to catch the exception i suppose
[15:11:05] <ebernhardson>	 and big email coming through...i guess the CEO search went quickly
[15:14:39] <elukey>	 dcausse: qq - anything that we can do to clear out the ES alerts? (if there is a task I can ack them)
[15:17:56] <dcausse>	 elukey: I have a patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/720908
[15:18:46] <dcausse>	 not sure why this changed during the outage restart yesterday, I suppose elastic did some "automatic migration" to new setting names
[15:18:53] <ebernhardson>	 dcausse: afaik some clusters had all 3 masters down at the same time, we lost the transient cluster settings
[15:19:12] <dcausse>	 ah.. these ones are persistent tho
[15:19:19] <ebernhardson>	 now, i suppose the real problem is transient settings shouldn't be important. There is literally a second option called persistent we should probably use :)
[15:19:28] <elukey>	 dcausse: thanks!
[15:19:36] <ebernhardson>	 dcausse: hmm, for all 9 clusters?
[15:19:54] <ebernhardson>	 i guess i should look, but i'm assuming some clusters lost state and others did not
[15:21:12] <dcausse>	 remote cluster config changed from "search" to "cluster" I haven't check all the clusters (the alert is only configured for the small cluster IIUC)
[15:21:43] <ebernhardson>	 huh
[15:22:20] <ebernhardson>	 somehow i completely glossed over that part in your commit message, even though it clearly says it :)
[15:22:52] <dcausse>	 sigh... chi still has "search" :/
[15:23:36] <dcausse>	 psi&omega have cluster
[15:23:46] <ebernhardson>	 dcausse: i suppose codfw:9243 is the only one that i know lost state, i had to restore the logging config there
[15:24:03] <ebernhardson>	 if thats also the only one with search at top level, something happened there? Dunno what exactly
[15:24:20] * ebernhardson feels like need to find where this swaps out in the elsatic codebase to understand whats going on
[15:24:33] <dcausse>	 in codfw it's the opposite :(
[15:24:33] <ebernhardson>	 but thats hours :P
[15:24:47] <ebernhardson>	 sigh
[15:25:00] <dcausse>	 might be somewhere when reloading from disk I suppose :/
[15:25:20] <dcausse>	 but that alert is going to be broken no matter what we put in there
[15:26:00] <ebernhardson>	 elastic docs are clearly `cluster.remote.*`, were we in some back-compat mode maybe?
[15:26:24] <dcausse>	 yes it was "search" initially
[15:27:24] <dcausse>	 we should make this setting part of elasticsearch.yml I think
[15:27:39] <ebernhardson>	 the seeds probably should, yea
[15:28:14] <dcausse>	 I remember something like: elastic won't boot if it can't setup the connection but I hope that's been fixed
[15:28:38] <ebernhardson>	 oh right...no way thats been fixed in our version :S
[15:39:14] <dcausse>	 jsonpath supports (node1|node2) I'll use that for now I don't want to messup with elastic today :)
[15:41:46] <ebernhardson>	 :) makes sense ... only other thing i can think of is to properly-reboot the cluster that didn't switch. But that seems annoying
[15:42:27] <ebernhardson>	 (and no clue if that works)
[15:43:18] <ebernhardson>	 i wonder if we need some better ideas about when to use persistent/transient, i don't really remember why i choose a particular side of that
[15:44:31] <dcausse>	 I think persistent would make sense for nodes ban with H/W issues (possibly long delay)
[15:44:48] <dcausse>	 but tbh I don't think we expect the 3 masters to be down at the same time
[15:45:25] <ebernhardson>	 yea, even with the name transient my assumption is clearly that it would be perm
[15:45:33] <ebernhardson>	 maybe we just use persistent for everything?
[15:46:06] <dcausse>	 yes it's just that I hope we don't abuse it over proper puppet config :)
[15:46:36] <ebernhardson>	 lol, indeed
[15:52:13] <Guest71>	 Sorry to interrupt, but, I think we should expect that 3 masters could go down(as it happened today) even though we do not want it to and have a mechanism handle such contingency. so, i think we should create ticket for the same.
[15:53:47] <Guest71>	 (because murphy's law)
[15:54:26] <ebernhardson>	 i guess my thought is, this is the first time in 6+ years i've seen this happen. How much effort do you spend to stop something once every 6 years?
[15:54:39] * ebernhardson checks the xkcd chart ;)
[15:57:51] <dcausse>	 hopefully what we lost in the transient settings is not that important but making them persistent will improve this a little bit,
[15:58:03] <Guest71>	 Oh, well, if its that long, then may I'd create backlog to lowest priority that notes this incident for historical reasons with steps to fix. Sorry, its my first day at WMF and going through at Search Platform's Task board.
[15:59:48] <ebernhardson>	 Guest71: dont worry! We indeed should think about it and make some changes, mostly i'm thinking the changes will be small operational things like stop pretending transient and persistent settings are the same :)
[16:00:27] <ebernhardson>	 and then i know there is some plan around making sure the orchestration delays until the servers are healthy before restarting the next set of servers (prob most important, imo)
[16:00:38] <ebernhardson>	 Guest71: also, welcome!
[16:00:38] <Guest71>	 (a ticket / task at backlog*) and wiki page for the weekly report / monthly incidents may be
[16:01:33] <Guest71>	 thanks, i think you mean a health check script - post deployment / reboot
[16:03:04] <ebernhardson>	 i'd have to check, but i think this reboot was run through spicerack which is (maybe, i've never used or written any of it, i dont have access) ops stuff to automate cluster restarts and such. It sounds like it was perhaps missing a health check in there somewhere indeed
[16:03:29] <dcausse>	 I think systemd itself simply restarted elastic
[16:03:36] <dcausse>	 while applying puppet
[16:03:40] <ebernhardson>	 dcausse: oh? That should be 100% disabled
[16:04:04] <dcausse>	 I don't think a systemctl restart was explicitely requested 
[16:04:04] <ebernhardson>	 dcausse: from report: 16:25 operator runs puppet on rest of fleet, 6 hosts at a time: sudo cumin -b 6 'P{elastic*}' 'sudo run-puppet-agent'
[16:04:19] <ebernhardson>	 so, it was puppet :s
[16:04:36] <dcausse>	 yes, puppet should not be able to restart elastic I think that's the issue
[16:04:38] <ebernhardson>	 i'm fairly certain at some point that was disabled, puppet should never restart elastic
[16:04:40] <ebernhardson>	 ever
[16:04:49] <ebernhardson>	 but obviously not disabled anymore :S 
[16:05:25] <dcausse>	 I notice it happened on the wdqs-updater and was a bit surprised so I wonder if some "defaults" changed somehow
[16:05:43] <dcausse>	 auto restart when the system unit changes
[16:05:52] <ebernhardson>	 hmm, maybe? I guess thats worth looking into as well
[16:05:57] <dcausse>	 yes
[16:06:35] <dcausse>	 Guest71: welcome btw!
[16:10:56] <Guest71>	 Yes, You are right dcausse - It wasn't systemd - https://github.com/elastic/elasticsearch/issues/25425 - they never added restart on failure.
[16:12:06] <Guest71>	 On the other hand, I think puppet or any CD service should restart the service deployed or being deployed in case of failure.
[16:13:10] <dcausse>	 Guest71: we don't use initscripts provided by elastic, the one we use is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/templates/initscripts/elasticsearch_6%40.systemd.erb
[16:14:29] <dcausse>	 I don't know why we don't have restart on failure tho, I guess it never occured to us yet (where elastic simply stops)
[16:14:45] <Guest71>	 Oh.. i see 
[16:14:47] <ebernhardson>	 i guess my worry with allowing puppet to restart things is there is no orchestration, puppet just runs and restarts things when it does
[16:15:12] <ebernhardson>	 even though ops ran an explicit thing to do 6 instances at a time, puppet could also be deciding to do other instances
[16:16:56] <ebernhardson>	 i'd prefer to entirely disable restarting in puppet and require an operator to use the same automation that does our typical restarts to ship new plugins or elastic versions, but that does leave the opportunity for mistakes where puppet deploys a config change and it's never picked up (until 3 months later during something unrelated and we get surprisde)
[16:16:59] <Guest71>	 ooh.. ok that was a bit shock "no orchestration" - sorry, i was going by the definition of pipeline and CI/CD.
[16:18:21] <ebernhardson>	 ahh, so CI/CD could orchestrate. But puppet just runs on a timer of 60 min +- some skew. Whenever a host runs puppet with a new elastic config it would restart itself
[16:19:09] <ebernhardson>	 Guest71: also, i almost feel awkard asking, but is there a name i can use other than guest? :)
[16:19:20] <Guest71>	 Sure, My name is Dinesh.
[16:20:10] <ebernhardson>	 i'm sure there will be lots of surprising things here...i suppose it's common to hear from new hires that they expected a much higher level of automation/CI/etc. than we actually have
[16:20:53] <Guest71>	 I signed up today - https://www.mediawiki.org/wiki/User:TheReadOnly
[16:22:22] <Guest71>	 Sorry, I am not really a new hire yet. I started working on the tasks here out of curiosity and needed some more satisfaction of accomplishing good things before I go to bed.
[16:22:40] <ebernhardson>	 Guest71: i just realized that, thats ok too though. We are an open project and anyone is welcome to be interested :)
[16:27:25] <Guest71>	 I'd be more than happy to setup automation / CI. Since, I've been an SRE for over 5 years, I came around this this event today. In my experience I'd not use an operator or prefer manual work by ops team or dev team as the same is prone to lot of errors and rather have automation scripts which is code reviewed and verified way to work.
[16:28:02] <Guest71>	 Thanks ebernhardson
[16:29:05] <ebernhardson>	 the difficulty tends to be around the underlying systems. So for example our CI runs in wmf cloud. wmf cloud can only talk to production over the public internet
[16:29:11] <ebernhardson>	 understandably, you can't deploy from the public internet :)
[16:29:54] <ebernhardson>	 that limitation has to be in place because anyone can apply for a project in wmf cloud and we provide compute free of charge
[16:31:04] <ebernhardson>	 the longer term solution has to rely on our release engineering department, but they are currently in the midst of a big project to deploy gitlab (not including CI and such, just code review and hosting)
[16:31:07] <Guest71>	 I understand that. Thanks. Since I haven't setup things using puppet or setup elasticsearch (strictly only use AWS things so far) - let me do a deep dive on how to setup and may be i could write the steps on wiki or author change.
[16:31:13] <Guest71>	 oh, i see the limitation part now. thanks for explaining
[16:31:35] <Guest71>	 ooh, migration to gitlab - that sounds cool.
[16:33:10] <ebernhardson>	 i'm a fan of gerrit that we use now, but it's quite hard to teach and most people need a year or two to really start to like it. Rather than constantly losing contributors who want the github style env we are moving over to that kind of thing (and to be fair our current code hosting is useless, only the code review works well)
[16:33:45] <ebernhardson>	 and "works well" is limited to people who have used gerrit for years likely :P I understand gerrit is quite difficult for many newcomers
[16:34:35] <Guest71>	 i understand that i can't deploy from public internet as I haven't setup my dev account yet and uploaded keys - and while on that note - can you please point me towards - should i use the toolsforge setup or cloud VPS setup at
[16:34:36] <Guest71>	 https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_account#Step_2:_Decide_which_service_you_need_a_Wikimedia_developer_account_for and FYI - i don't have any personal projects to host right now, but I'd like to contribute the current tasks on the board.
[16:37:27] <ebernhardson>	 Guest71: hmm, sorry to take a moment but i'm trying to ponder. We have some pre-defined paths to help new contributors into mediawiki development, but on the sre side we have much less. TBH the only non-wmf people i know of that sometimes contribute sre stuff used to be sre's here.
[16:37:34] <ebernhardson>	 but i'm sure there is somewhere to start, i just need to think
[16:40:15] <Guest71>	 I haven't used github much - gitlab looks wonderful  to me - all (at least most of things) at one place though - gerrit looks similar to review board which i am more used to review board - https://en.wikipedia.org/wiki/Review_Board - and gerrit is almost similar with a bit of UI changes.
[16:42:59] <dcausse>	 dinner
[16:43:01] <Guest71>	 wow, so it is really an unexplored path for a non-wmf person to contribute to SRE stuff - Nice,  i like it that way - it is more challenging - like navigating in a jungle. Then, I'll sign up to both the accounts. 
[16:43:04] <ebernhardson>	 probably the biggest different in gerrit, or at least in how gerrit is typically used, is that everything is one patch at a time and patches are expected to be refined and polished. As opposed to github which prefers a branch that you keep adding patches to 
[16:43:38] <ebernhardson>	 in github you would submit a pr, add a few more patches to the branch, and then maybe squash and merge. In gerrit you keep polishing the first patch until it's what you want
[16:44:02] <ebernhardson>	 in practice, it's about the same, you just commit and squash locally before sending to gerrit
[16:47:15] <ebernhardson>	 i put a question in our sre chat, see if someone there has an idea of where we could direct you :)
[16:47:25] <majavah>	 Guest71: both accounts on https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_account#Step_2:_Decide_which_service_you_need_a_Wikimedia_developer_account_for are the same, the workflow is just made a bit less confusing for people who are only intend to use Toolforge
[16:48:08] <majavah>	 also hi o/, I guess I'm one of the few non-previously-employed folks who are volunteering sre stuff, mostly on the cloud side though (plus a mediawiki committer)
[16:48:26] <ebernhardson>	 i totally forgot, of course wmf cloud has a couple non-wmf people helping out
[16:48:36] <ebernhardson>	 thanks :)
[16:50:07] <Guest71>	 i agree about github and I think that's how it works on gitea ( as i am used to a modified version of the same) as well  - https://en.wikipedia.org/wiki/Gitea - so gerrit would also force us to reduce the number of remote branches being pushed and having to track them across changes. it makes code review simple.
[16:51:39] <Guest71>	 Thanks majavah . I'd just follow the path once.
[16:51:54] <majavah>	 (I guess one reason there's not that many sre volunteers is that getting sre-level access to the production realm is kind-of impossible for a volunteer)
[16:52:59] <Guest71>	 Oh, hmm.. that's sadness - as SRE's or admins we need a bit of elevated privileges - i understand that
[16:57:45] <ebernhardson>	 Guest71: i suppose its worth mentioning as well, we have open SRE positions, they are all remote,  and they come with SRE access :)  https://boards.greenhouse.io/wikimedia/jobs/3338061?gh_src=e64170451us
[17:01:46] <ebernhardson>	 Guest71: as to somewhere to start contributing, i was referred to https://phabricator.wikimedia.org/T273673 which seems like a reasonable place to start, not too dangerous but still enough to start meeting some people and reading some code
[17:05:07] <Guest71>	 Thanks ebernhardson . This is really a nice oppurtunity, I'll definitely consider applying to after making up a few contributions. I have least experiences as SRE using open source tools like grafana, prometheus, jenkins - due to policies i have experience with aws tools only like cloudwatch, cloudformation etc.. except for coding on
[17:05:08] <Guest71>	 python/java/ruby - i'd have a lot to learn about using these tools although the concepts behind them remain same.
[17:11:20] <Guest71>	 I will work upon T273673 and have an update or even a patch by end of this week. It's almost my bed time now as I am from India. Good night, all. Thanks all for the great work.
[17:11:20] <stashbot>	 T273673: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673
[17:12:10] <Guest71>	 whot, this bot acts smart (y) . Nice piece of work.
[17:15:38] <ebernhardson>	 Guest71: g'night
[17:36:00] <ebernhardson>	 was reviewing LVS docs ...this isn't scary at all: get a list of all IP addresses served by this LVS server - you're going to check that they all exist after your change
[17:45:26] <volans>	 hello search, I've a quick one for you. The relforge setup is still active/passive? It happen that the 2 relforge hosts are part of a very small subset of hosts that have an issue with lldp and we have a test command to try to see if it fixes it. And was looking for a host that wouldn't cause any disruption
[17:47:22] <volans>	 all the others are active swift hosts, so I was checking if maybe I could get lucky with relforge...
[17:47:25] <volans>	 :)
[17:48:36] <volans>	 and to be clear, the command shouldn't cause any disruption in theory, but because is doing echo to /sys/kernel/debug/i40e/... you never know what can happen ;)
[17:55:43] <ryankemper>	 volans: usually I hear active/passive in reference to different datacenters
[17:55:52] <ryankemper>	 in this case do you mean is one of the hosts active and the other passive? because both relforge are in eqiad
[17:56:29] <volans>	 ryankemper: yes I meant if there is one host that is less critical than the other
[17:56:37] <volans>	 that could be a good candidate to try this
[17:56:44] <volans>	 T290984 for context
[17:56:45] <stashbot>	 T290984: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984
[17:57:23] <ryankemper>	 volans: I'm actually not sure, we might choose one of the hosts (if it were, i'd guess `relforge1003`) to preferentially route queries to, let me see if hiera has anything
[17:57:47] <ryankemper>	 but which-host-is-less-important aside, yes feel free to run the commands there, relforge is quite low impact
[17:57:51] * ryankemper checks hiera...
[17:57:52] <volans>	 I see that relforge1004 has load of 1 and relforge1003 of 0
[17:58:01] <ryankemper>	 ah so my guess was backwards :P
[17:58:22] <volans>	 so I assumed that, also because of says https://wikitech.wikimedia.org/wiki/Service_restarts#relforge
[17:58:28] <ebernhardson>	 relforge should be safe, those don't do anything automated
[17:58:38] <ebernhardson>	 those are strictly places for us to break, err test, things
[17:59:25] <ryankemper>	 per `hieradata/role/eqiad/elasticsearch/relforge.yaml`, `relforge1003` is the unicast host (the master)
[17:59:46] <ryankemper>	 at least IIRC that's what the unicast host means, 1 sec
[18:00:07] <ryankemper>	 yeah it is
[18:01:10] <ryankemper>	 very interesting that `relforge1004` has all the load, I would expect `relforge1003` to be both a master and a data node, which would usually mean doing more work overall altho in this case I'd wager they should be identical since I doubt the master-specific stuff takes up much load at all
[18:03:48] <ebernhardson>	 i'm surprised they have any load at all :)
[18:04:38] <ryankemper>	 haha yeah
[18:04:54] <ryankemper>	 I bet `relforge1004` just happened to get some heavier (in terms of queries, resources used etc) shards
[18:05:07] <ebernhardson>	 the thing is, there is no automated querying or indexing
[18:05:13] <ryankemper>	 ah
[18:05:39] <ebernhardson>	 so they really should be doing nothing unless someone is currently doing something, i guess trey could be
[18:05:41] <ryankemper>	 volans: either host is fine. technically `relforge1004` is the "safer" one given it's not the master, although with a 2 host cluster like relforge it really doesn't matter either way :P so fire away
[18:06:09] * volans in a quick emergency, brb in 5
[18:08:13] <volans>	 ryankemper: great, thanks a lot, we'll try on 1004
[18:08:23] <volans>	 cc topranks ^^
[18:08:47] <topranks>	 ok cool.
[18:15:06] <volans>	 FYI tested on 1004 and it's all good, it worked as expected, we'll go over the others tomorrow
[18:22:37] <ryankemper>	 great
[18:51:02] <hashar>	 ebernhardson: looks like that Guest71 has a lot many layers of smartness and experience :]
[18:51:57] <ebernhardson>	 hashar: indeed, seems promising :)
[19:43:28] <ryankemper>	 ebernhardson: spot checking `hieradata/role/common/wdqs/public.yaml` versus `hieradata/role/common/wcqs/public.yaml` for https://gerrit.wikimedia.org/r/c/operations/puppet/+/720078 and noticed that wcqs is gonna write to the same log dir as wdqs, and also wcqs' data dir is `/srv/wdqs-data`
[19:43:30] <ryankemper>	 see https://github.com/wikimedia/puppet/blob/545b517cea678d8fdeadeb6051e0f3757bd4ebff/hieradata/role/common/wcqs/public.yaml#L4-L5
[19:44:03] <ryankemper>	 for the scope of that patch it doesn't matter, but was just curious if those are the values we want
[19:44:04] <ebernhardson>	 ryankemper: oh good call, i'm still thinking elasticsearch where log files and data dirs come in sub directories under that path, and the server keeps it all separate
[19:44:37] <ebernhardson>	 ryankemper: probably we want them separate, even though they are on different servers
[19:44:40] <ebernhardson>	 i guess? 
[19:44:50] <ebernhardson>	 or we rename everything to be generic, query_service
[19:45:04] <ebernhardson>	 (but not on wdqs, because that's tedious...so maybe no renaming :P)
[19:45:06] <ryankemper>	 oops the gerrit patch I linked was teh wrong one, meant https://gerrit.wikimedia.org/r/c/operations/puppet/+/719643 (rest of the stuff still applies tho)
[19:45:22] * ebernhardson didn't actually open anything, but is guessing which patch anyways :)
[19:45:30] <ryankemper>	 good! that worked out then :P
[19:45:59] <ryankemper>	 ebernhardson: hmm yeah generic would be ideal but renaming wdqs is meh
[19:46:15] <ryankemper>	 so maybe just call it query_service on `wcqs`, and pretend we'll eventually circle back and rename it on wdqs but never do?
[19:46:21] <ebernhardson>	 i suppose, in terms of making a concious decision ...i don't want to put wcqs into a folder named wdqs. We have enough lies already :) 
[19:46:23] <ryankemper>	 that's usually the way to do it imo :)
[19:46:27] <ryankemper>	 agreed
[19:46:33] <ebernhardson>	 so the question is, wcqs or query_service wth the plan to "someday" change wdqs
[19:46:45] <addshore>	 do you do any transfering of JNL files around between query servies right now? for any reasons?
[19:46:58] <ebernhardson>	 yea, lets just call it wcqs
[19:47:51] <ryankemper>	 addshore: that's a good question...the datasets should be entirely different AFAIK so we should never be doing a transfer between different service types
[19:48:38] <addshore>	 I still wanted to, and never tried, SCPing a JNL file out of a wdqs test server to try loading it using the docker images on another server
[19:49:17] <ryankemper>	 addshore: wait I was thinking your question was related to the wdqs vs wcqs stuff above...are you just asking in general if we ever transfer between hosts? like between `wdqs1003` and `wdqs1004`?
[19:49:25] <ryankemper>	 or were you asking if we transfer from a wcqs host to a wdqs host and vice versa
[19:50:48] * ryankemper has never actually looked at the `wcqs.jnl` on wcqs-beta-01...it's 3.3T, had no idea it awas that big
[19:51:02] <ebernhardson>	 O.o
[19:51:16] <ryankemper>	 unless it's that free allocators bug hitting us
[19:52:01] <ebernhardson>	 i actually know nothing about blazegraph :) I have some memory that it has a log-structured storage much like elastic, cassandra, kafka, etc. But otherwise no clue
[19:52:07] <ebernhardson>	 is there some compaction type thing we have to run?
[19:52:22] * ebernhardson guesses wildly, probably not useful :P
[19:52:38] <dcausse>	 it's a known issue, we do full reloads for wcqs and blazegraph is unable to reclaim freed space
[19:53:10] <ebernhardson>	 ahh, ok i do remember something about that
[19:53:38] <ryankemper>	 dcausse: we do full reloads to regenerate the jnl file and restore its appropriate size right? i.e. you're not saying that even after a full reload the journal stays huge?
[19:54:23] <ryankemper>	 I guess when the issue crops up on wdqs we solve it by transferring from a healthy server so I've never actually seen if the full reload fixes it too, but I would imagine it does end up the correct size after a full reload
[19:54:45] <dcausse>	 ryankemper: yes it is, as we use a different namespace (e.g. wcqs20210901) but when deleting the old one blazegraph fails to free the unused space
[19:55:36] <ebernhardson>	 dcausse: as long as your around, i guess probably no clue, but any idea why blazegraph package_dir is /srv/deployment/wdqs/wdqs?  I'm use to /srv/deployment mimic'ing the gerrit layout
[19:55:49] * ebernhardson was changing it to /srv/deployment/wcqs/wcqs, and that seems kinda stupid :)
[19:55:55] <ottomata>	 ryankemper:  https://phabricator.wikimedia.org/T285355#7352970 :)
[19:56:29] <dcausse>	 ebernhardson: I think this is related to scap?
[19:57:32] <ebernhardson>	 dcausse: yea, it's the wikidata/query/deploy repo
[19:58:29] <ryankemper>	 ottomata: linkrecommendation has been switched from hardcoded thorium to the analytics-web cname, which is currently pointing to thorium
[19:58:52] <ryankemper>	 ottomata: so when ready to cut over, the procedure should be to switch the cname over to analytics-web and it won't require an actual configuration change of the link recommendation helmfile
[19:59:00] <dcausse>	 hm I have no clue why it was cloned under "wdqs"...
[19:59:18] <ebernhardson>	 maybe it's just some historical artifact and we are too lazy to change it :)
[19:59:19] <ottomata>	 ah!  great
[19:59:35] <ottomata>	 ryankemper:  that has been done in prod k8s clusters too?
[19:59:39] <ottomata>	 not just staging?
[19:59:53] <ebernhardson>	 i guess i feel awkward about copying things that look silly to wcqs, but then wcqs and wdqs will somehow end up different enough to make other things (automation, etc) more difficult...
[19:59:57] <ryankemper>	 ottomata: correct, I deployed it to staging and then eqiad/codfw a few days after that
[20:00:09] <dcausse>	 yes, should be /srv/deployment/wikidata/query/deploy I guess?
[20:00:13] <ryankemper>	 ottomata: and I verified that the pods automatically restarted, so what's running now has been using that cname for a couple weeks at least
[20:00:33] <ottomata>	 ok great
[20:00:34] <ottomata>	 awesome
[20:01:13] <ebernhardson>	 yea i suppose, make it slightly better and hope to some day™ switch wdqs :)
[20:03:22] <dcausse>	 yes these "cosmetic refactoring" (renames) never tend to happen :/
[20:04:37] <dcausse>	 I'm now curious why it's "wdqs", never realized since now it did not match the gerrit repo
[20:11:14] <dcausse>	 I guess for wcqs we could have another query_service deployment repo in /srv/deployment so that :package_dir feels a bit less weird but I have no clue if that'll open a can of worms
[20:40:31] <ryankemper>	 ebernhardson: dcausse: so as far as stuff like wcqs currently using `/srv/wdqs-data`, thoughts on `/srv/wcqs` versus `/srv/query_service`?
[20:41:04] <ryankemper>	 Even if we never circle back on wdqs `query_service` is still not a bad name for it for wcqs, but then again if we're not gonna have a new SPARQL service in the next few years then there's probably no reason not to go with `wcqs`
[20:41:14] * ryankemper is probably overcomplicating naming things as always
[20:41:20] <ebernhardson>	 names are important! 
[20:41:26] <ebernhardson>	 but also, i dunno :P
[20:41:49] <ryankemper>	 ok my gut is telling me `query_service` so I'm just gonna go with that
[20:41:54] <ebernhardson>	 sounds good
[20:41:56] <ryankemper>	 it's marginally more resilient to naming changes
[21:04:00] <ryankemper>	 ebernhardson: grabbing some water real quick but want to pair on getting as many of these wcqs puppet patches merged as possible? and figuring out a loose dependency tree for the ones that aren't immediately mergable
[21:04:30] <ebernhardson>	 sure
[21:09:31] <ryankemper>	 ebernhardson: i'm ready now, I just joined meet.google.com/zqq-oefa-hzv