[08:20:57] relocating [10:27:08] lunch [10:35:23] relocating [10:51:43] Lunch [14:59:26] oops, left my headphones on all night...somehow they still have 20% battery [15:00:58] SDineshKumar: awesome! I'll take a look over it today [15:01:17] Thanks. [15:31:14] dcausse: question, should we start loading data into WCQS? I need /readiness-probe to return success before we can setup LVS [15:31:36] dcausse: or is there some way to create an empty namespace that makes that happy? [15:31:51] ebernhardson: simply touch the data_load file flag I think [15:32:22] I think the namespace is already there [15:32:23] dcausse: it won't need a wcq namespaces to query against? [15:32:25] ahh, ok [15:32:27] lemme try [15:33:46] dcausse: hmm, i'm not sure which thing data_load is. nginx has a data-reload, but that's inverted where it only complains if the file doesn't exist [15:33:56] err, it only complains if the file exists [15:34:26] is it a blazegraph thing? Querying 9999 directly gives 503 [15:34:38] * ebernhardson should learn more about blazegraph....it's slowly working :) [15:59:21] ebernhardson: sorry was distracted, what's the machine? [15:59:36] dcausse: i'm testing wcqs1001, but i think all of them [15:59:45] dcausse: but we can just make one work, the rest i can figur eout :) [16:00:09] for WCQS you can also use data reload script [16:00:12] i'm kinda confusde because i can't convince blazegraph to say anything in its error logs, but it 503's all requests [16:00:56] actually, it would make a lot of sense, since once we're ready with the streaming updater for WCQS, we'd have the data already bootstrapped based on the latest dump, no matter when we go ahead with that [16:01:01] i suppose i mean, i can't evne get the workbench [16:01:08] or maybe we disabled that? [16:01:15] I don't think so? [16:01:16] no [16:01:54] zpapierski: there's https://gerrit.wikimedia.org/r/c/operations/puppet/+/721280 which has topic names and the deployment-charts [16:02:02] probably some dashboards too [16:02:35] hm not the deployment-charts actually [16:02:50] the job config in the deploy repo [16:03:02] zpapierski: how long does commons take? If we can get them responding i can probably run that late rtoday [16:03:03] will look at it after I'm done with the cookbok [16:03:19] ebernhardson: you mean data reload? about 2 days [16:03:24] (or more now) [16:03:30] ebernhardson: not sure if someone mentioned this but `data_loaded` is just a file whose presence we use as a flag [16:03:33] hmm, yea i guess we wont deploy anything friday, so today would be a good day to start [16:03:39] it's not a blazegraph thing it's our own puppet config [16:03:57] wcqs_data_reload is a script that touches this flag [16:04:03] ryankemper: hmm, i only found the inverse, a data-reload flag that causes all requests to 503, but that's just nginx [16:04:14] right now i'm talking to blazegraph on :9999 [16:04:29] yay :) [16:05:18] huh, but it's 503 [16:05:21] so not yay [16:05:56] hmm, logs are empty? [16:06:05] zpapierski: not empty, but they just have the bootup which looks happy [16:06:09] zpapierski: in /var/log/query_service/* now [16:06:10] (and logstash) [16:06:13] aah [16:06:50] i suppose for reference, wcqs and wdqs now have /etc/query_service and /var/log/query_service, those are both symlinks to wherever the actual /etc/wdqs and /var/log/wdqs or whatever dirs are [16:06:59] noted [16:07:07] yeah, looks happy [16:08:42] ebernhardson: starting blazegraph should have created an empty journal but I don't see it [16:09:16] oh sorry in /srv/query_service/ hm.. [16:09:31] oh, did i maybe change one path but not another one? [16:09:53] * ebernhardson goes to find a puppet catalog [16:10:18] https://www.irccloud.com/pastebin/YZXqLAvp/ [16:10:20] I see /srv/wdqs-data/ but no clue what that is [16:10:32] log config is somewhat lacking, but journalctl has it all [16:11:10] weirdly, it didn't match the name of the service (maybe I did somethin wrong), but it matched PID: [16:11:18] sudo journalctl _PID=1363 [16:11:20] Can someone rerun that script which removes deleted Wikidata items from the Wikidata Query Service? [16:11:34] right now things are at least 10 days stale and it's making planning maintenance tasks difficult [16:12:28] an example is https://www.wikidata.org/wiki/User:MisterSynergy/sysop/meta_items_without_sitelink, where the first item in that list was deleted on the 6th but is still returned by the source query even today [16:12:35] mahir256: sure, running this now [16:12:45] has lydia asked you about this within the last 10 days? [16:13:04] thanks dcausse by the way [16:13:50] zpapierski: hmm, also odd :s [16:14:34] it has -DblazegraphDefaultNamespace=wcq so the journal should be "wcq"... no clue what's going on [16:17:13] going offline [16:17:21] dcausse: g'night [16:17:32] ebernhardson: I might be mistaken, I see now that log is an old one [16:17:48] nothing newer than Sep 15 22:30 (I assume UTC) [16:18:05] hmm, maybe restart blazegraph and hope turning it off and back on again is the trick? :) [16:18:14] the read-only error still came from yesterday, which is when the server started [16:18:28] 22:30 sounds to about match when puppet was finally happy enough to complete [16:18:33] need to go run to pick up a prescription real quick but I can take a look in ~30 mins [16:18:58] also matches the actual logs (which are missing a date from time/date format) [16:19:56] read-only doesn't make sense though, unless maybe it's some systemd layered filesystme wieirdness? Because /var/log is on the root fs and dmesg would say something if that ever went read only [16:20:21] i thought systemd only did that in /tmp though [16:20:22] yeah [16:20:27] and gc logs are there [16:20:50] otoh - the main blazegraph log has been written on Sep 15 last [16:21:51] i'm gonna restart the service..couldn't hurt [16:21:59] sure [16:22:23] ahh, but GC logs are logged not by blazegraph itself, but by jvm [16:22:33] it probably makes no difference, though [16:23:01] still says read-only filesystem :) so at least reproducable [16:23:15] oh, actually. It's a stupid error message [16:23:21] java.nio.file.FileSystemException: /var/log/wdqs/query_event.log: Read-only file system [16:23:39] qhmm, no that directory does exist... [16:23:43] wait, what's that :)? [16:23:50] aaaa [16:23:53] I know what it is [16:23:58] we need to drop it [16:24:04] zpapierski: thats the new complaint in `journalctl -u wcqs-blazegraph` [16:24:06] forgot all about it [16:24:27] it's basically a thing that writes all queries to a file [16:24:40] we could analyze those differently on beta [16:24:51] but we have different means of getting that result on production [16:24:58] it should be adjusted to /var/log/query_service, the directory does exist and is owned by blazegraph, so in theory that should be odd, but "ok" [16:25:07] i mean /var/log/wdqs currently exists on the server [16:25:21] (but is empty) [16:26:03] file-event-sender [16:26:07] that's the property [16:26:25] no, we don't need that on prod, we just need to set this property to false [16:27:45] it's set in wcqs.pp [16:27:49] https://www.irccloud.com/pastebin/tC4PHQsC/ [16:28:01] I think we can basically drop that [16:28:33] ok, should be easy enough [16:28:36] I'd drop it completely, I don't think we will care about statistics on wcqs-beta [16:28:42] after having a production service [16:29:02] I don't remember what that if-header was... [16:32:08] but since only wcqs was using it, we can drop it - it will be consistent with wdqs state [16:32:55] need to step AFK, but will be close by [16:33:21] zpapierski: i think https://gerrit.wikimedia.org/r/c/operations/puppet/+/721575 will do it [16:33:35] going to manually edit the unit and try before jumping through merging hoops [16:34:14] i guess it's in defaults, only referenced by sytemd [16:35:31] looks happy :) At least no complaining in journalctl, and the workbench html is returned [16:36:36] the readiness probe returns an http 200 with the xml of false. So maybe? [16:36:45] * ebernhardson assumes whatever calls readiness-probe ignores the content [16:54:28] Cool :) [16:54:46] reload script then? [16:55:16] It takes sometime to actually start feeding blazegraph, though, download and munging come first [16:55:30] Unless you prefer to cooy data from wcqs-beta? [16:55:55] Journal (which is probably huge by now) or munged data [17:03:45] A weird thing just happened. I reindexed all the Czech (cs) wikis, but cswiki and arbcom_cswiki didn't take the new unpacked config. All other cs wikis used the new config. There were no errors or warnings, and it seems to be the same on all three clusters (eqiad, codfw, cloudelastic). I'm not even sure where to start looking. ebernhardson or ryankemper—any ideas? [17:05:42] Trey314159: hmm, which machine did you reindex from (has logs)? [17:06:41] mwmaint1002 /home/tjones/reindex/cirrus_log/cswiki.* [17:09:14] for comparison, cswikisource reindexed correctly, and I don't see anything different in the logs. [17:10:11] i suppose first step, i was mostly going to double check naming. The logs should say the name of the index (that has a timestamp in the name) and verifying they were promoted [17:11:18] Trey314159: oh, i bet i have an idea [17:11:23] I checked for duplicate indexes, but not that specific indexes were promoted.. [17:11:31] Trey314159: cs.wikipeida is 1.37.0-wmf.21, cs.wikisource is 1.37.0-wmf.23 [17:11:37] is your patch in .22 or .23? [17:11:52] Yeah. I thought I checked that! [17:12:11] Did something get rolled back recently? [17:12:19] Ugh! That's it. [17:12:21] back [17:12:48] Trey314159: i hadn't remembered anything, but checking my emails for train they sent an email yesterday tht wmf.23 was rolled back to group 0 [17:13:00] i'm not sure what happened to .22 [17:13:24] no train that week, rel eng was having a special week [17:13:26] I looked at https://versions.toolforge.org/ and I swear I thought they all said .23.. but that was yesterday. Ugh. Okay, mystery solved. Thanks [17:14:31] Lesson learned: the deployment train, unlike time, does not move in only one direction. [17:14:37] :) [17:15:10] ryankemper: cool, we figured out the issue was some paths. z said we can just remove the piece entirely, was only used for -beta and not needed anymore [17:15:22] ryankemper: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721575 [17:16:11] with that i *think* blazegraph will be happy enough to continue setting up the rest of routing. [17:16:25] * ebernhardson says that alot, often not true :P [17:22:39] Trey314159: common misconception, time *can* move backwards! just not in real life, only during the DST transition of a poorly coded program :P [17:24:01] ryankemper: true enough. I also enjoy when it stretches out at 11:59:60 when we need a leap second. [17:24:20] :') [17:25:57] SDineshKumar: i glanced over your patch, but ladsgroup already gave it a review and he knows more than i do about those pieces :) [17:27:55] also i dunno about how familiar you are with our systems, but -1 basically just means there are things to be improved before merging [17:28:14] (-2 would be please abandon, or at least talk about if this is a reasonable idea :) [17:34:03] Thanks for taking a look. I appreciate your time. I understand about the -1 and -2 part. I thought jenkins would automatically run a test build when requested for a CR. To my shock, that didn't happen. I will work with Ladsgroup and get it done. I am sure to find him/her/them in one of the wikimedia IRCs. [17:34:08] ebernhardson [17:37:34] ebernhardson: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/721575, was the previous behavior just leading to that read-only filesystem log line you mentioned above? or something else too [17:37:45] ebernhardson: (wondering what I should be looking at to verify it had the intended effect) [17:39:28] Yeah no errors in `journalctl -u wcqs-blazegraph` so looks good [17:40:24] Hi all, I have a couple of questions regarding CirrusSearch. The easier of the two is this: In one of our wikis, which recently had CirrusSearch deployed using AWS Elasticsearch (now called AWS OpenSearch), has a $wgMetaNamespace like Foo_Bar_Baz. It turns out that the editors had a redirect so they could search for stuff under FBB: instead of the [17:40:25] full namespace name. Is there any way to get CirrusSearch to recognize this, like a "namespace alias"? [17:44:42] The second question is more complicated. The six wikis are on MW 1.35 and I have a full dev environment in that matches my live environment, just smaller resources for load and cost. I am working on a MW 1.36 upgrade, so I have to test it first by duplicating most of the dev environment (EC2 servers, Aurora MySQL clusters, Elasticache cluster, EFS [17:44:43] filesystem, and now AWS Elasticsearch) so I don't interfere with the 1.35 dev env. I'd rather not duplicate the Elasticsearch cluster, though, so I was thinking of just pointing the 1.36 web servers at it and reindexing, then testing 1.36. Then I may need to revert the ES cluster back to the 1.35 servers for the time being. Could I just reindex [17:44:43] from the 1.35 servers again or would that not work since it's been indexed against a newer MW version? [17:46:38] ryankemper: yes, although i still don't understand what was read-only about it [17:47:26] justinl: hmm, thinking [17:51:36] The further problem with the 1.36 upgrade is that I then need to actually do the live env. First I build a new live env, get everything working there, and then cut over during a maintenance downtime to the new web servers, run the update.php, etc. and reindex. My guess is that I'd just not point the new 1.36 live wikis at ES at all, defaulting to [17:51:36] the build-in search engine until time for the cutover, since I'd already proven it works in dev. Not ideal, but I can't think of a better way without a full copy of the ES cluster, including a full initial index build. [17:52:28] I'm not married to this procedure, I'm open to any suggestions. I just have some basic requirements that I always have a functioning dev and live 1.35 environment until I'm ready to do the 1.36 cutover. [17:52:51] Again, also trying to balance the cost, time, and complexity of the build and test procedures. [17:52:51] justinl: for the first, i know there is a $wgNamespaceAliases config in mediawiki. Cirrus uses mediawikis namespace prefix parser, i expect that should resepct wgNamespaceAliases but not 100% sure [17:53:03] I'll look at that. [17:53:58] justinl: for a test installation, if you aren't doing any cross-wiki searching or whatnot the easiest might be to set $wgCirrusIndexBaseName on the test instances. By default this is the wikiid, but you could set it to something else for testing [17:54:09] justinl: it can't have underscores or spaces though, just a-z [17:54:25] $wgCirrusSearchIndexBaseName actually [17:54:53] that would allow you to point them all at the same clusters, but i suppose there could be some worry that a misconfiguration would allow testing to overwrite prod [17:55:17] To give you a concrete example, the wiki in question has a $wgMetanamespace of Guild_Wars_2_Wiki. Users used to be able to search for e.g. GW2W: and it would give GW2W namespace recommendations in the dropdown list. Now it doesn't. [17:55:35] So there are underscores but only in the regular namespace, not the alias. [17:56:05] They seem to be doing it with redirects, but I'm not 100% sure. It's not a setting in LocalSettings.php but somewhere in the wiki content, with which I'm not familiar. [17:57:02] ebernhardson: just realized we should change https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/query_service/wcqs.pp#34 for if we enable query logging in the future...presumably `/var/log/query_service...`? [17:57:36] ebernhardson About to jump into a meeting but I'll be checking back after that. Appreciate the advice! [17:58:43] or more accurately it should be "$log_dir/query_event.log" [17:58:57] ryankemper: hmm, yea that sounds appropriate [18:04:13] justinl: for the specific question of if an index populated by 1.36 could be read by 1.35, overall it should be fine. There might be some small differences in how the content is prepared to ship over to search but they should be small [18:04:31] the schema should still be the same [18:06:52] ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721594 could use a rubber stamp [18:07:55] ryankemper: on retrospect ... i wonder if I looked over this much too quickly and should have deleted more? [18:08:18] ryankemper: the logrotate::rule is going to rotate the file we no longer create, if removed $query_event_log becomes unused [18:09:13] ebernhardson: the logrotate rule has `notifempty`, although unsure what it might do if there's no file whatsoever [18:09:45] ryankemper: right, but i mean there is no longer any code that can possible add logs to the file, it's vestigial [18:10:53] hmm, actually vistigial is wrong word :) But if we are no longer writing to the file then there are no logs to rotate moving forward [18:10:55] ebernhardson: oh like, even if we still had the extra_jvm_opts for that enabled? or do you just mean now that we've disabled it [18:11:20] ryankemper: right, that path was only referencd in the jvm opts we just removed afaict [18:11:25] There's an argument for keeping the logrotate stuff in place so that we can flip the query logging back on when we want to [18:11:31] altho ideally it'd be a nice parameter boolean [18:11:42] and not implicit in us deleting / adding back in the opts when we want to disable/enable respectively [18:11:47] isn't it already logging to eventgate though? This was for on-disk logs because beta doesn't have eventgate [18:12:00] sec lemme double check [18:12:18] ebernhardson: ah, I was about to follow up by asking what it does in general, because query logging sounded important, so it makes sense if it was an on-disk hack just for the beta cluster [18:13:44] this part is sending the same logs to eventgate and is part of the deploy opts: -Dwdqs.event-sender-filter.event-gate-endpoint=https://eventgate-analytics.discovery.wmnet:4592/v1/events?hasty=true -Dwdqs.event-sender-filter.event-gate-sparql-query-stream=wcqs-external.sparql-query [18:13:55] so ya, looks like that was strictly for beta because it didn't have the supporting services of prod [18:14:33] * ebernhardson is going to be forever annoyed at how many wcqs things say wdqs :P [18:16:25] okay yeah I'll rip the logrotate out [18:17:21] although oddly, checking a random wdqs server i dont see event-gate mentioned anywhere. Wonder why [18:17:30] (in the jvm args for blazegraph, specifically) [18:22:00] huh, i was just lucky. 1003-08 have it, 1011-12 have it. But 1009 and 1010 are missing it [18:22:04] i suppose because they are test servers [18:22:05] ebernhardson: okay https://gerrit.wikimedia.org/r/c/operations/puppet/+/721594 is now the patch to rip out the logrotate rule, pcc's running now on the new patchset [18:22:14] yeah makes sense [18:22:29] ebernhardson: we don't have event gate set up for wcqs yet right? i.e. that's something we'll have to circle back to add when it's ready? [18:23:05] Since `profile::query_service::blazegraph` has `"-Dwdqs.event-sender-filter.event-gate-endpoint=${event_service_endpoint}"` but not the corresponding `wcqs.pp` version [18:23:41] ryankemper: hmm, the jvm args on wcqs1001 have it [18:24:04] ryankemper: i think it's controlled via the `rofile::query_service::sparql_query_stream:` hiera which i set to wcqs-external.sparql-query [18:24:15] we might have to tell eventgate what schema to use on that, not sure. Will check [18:24:24] (or it might trust the incoming events declaring their schema) [18:24:31] another naming gripe that we prob don't have an elegant way to fix but should consider doing at some point...what we call`profile::query_service::blazegraph` is really what we'd call `profile::query_service::wdqs` [18:25:27] ryankemper: hmm, profile::query_service::wcqs mostly just invokes profile::query_service::blazegraph, suggests it's generic? [18:25:50] ebernhardson: doh, totally missed that [18:25:54] okay that explains why wcqs does have those opts then [18:26:04] yay that makes life easier, no renaming needed :) [18:26:26] i am waffling so much on renaming lately...i really want to but then the number of steps to perform surgery on the live instances is...tedious [18:26:52] we should wherever we can though, i think : [18:27:33] yeah definitely anywhere that doesn't require surgery, or at least surgery more complicated than one or two commands [18:27:56] we've been doing a decent job so far of renaming the wcqs stuff to use `query_service` etc so just gotta keep that up I suppose [18:28:23] yup. [18:28:41] ebernhardson: okay pcc completed on https://gerrit.wikimedia.org/r/c/operations/puppet/+/721594 and looks as expected: https://puppet-compiler.wmflabs.org/compiler1001/940/ [18:28:58] I'll want to delete `/etc/logrotate.d/query_event_sender_log` when we merge that [18:30:27] for other patches currently pending, https://gerrit.wikimedia.org/r/c/operations/puppet/+/717630 adds the apache config to the microsite so it serves the wcqs ui, and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/721089 declares it as a wikimedia_cluster but i don't know what exactly that does, just that puppet was complaining we didn't [18:31:12] ebernhardson, ryankemper: I'm here for my usual pairing session with Ryan, but if you're already pairing on merging those puppet patches, maybe better to continue [18:31:38] gehel: any chance you know what declaring a wikimedia_cluster does? :) [18:31:47] looks like something with maybe cumin or prometheus? [18:31:53] ebernhardson: I'm going to need some more context [18:32:17] gehel: in hieradata/common.yaml there is a wikimedia_clusters key, that lists ~100 unique clusters in our datacenters [18:32:28] lemme find which thing was complaining we didn't declare it [18:36:05] hmm, the error was certainly the `fail("Cluster ${cluster} not defined in wikimedia_clusters")` in modules/profile/manifests/base.pp, but having trouble finding where puppet wrote that [18:36:36] ebernhardson Great, thank you! I'll still have to figure out if my current thoughts on how to handle the live wiki 1.36 testing prior to the cutover will be sufficient. Hopefully the dev 1.36 testing will be sufficient. [18:37:12] ebernhardson: I don't remember anything about that. But I might have created a new cluster only once since I work here :/ [18:37:14] justinl: for the namespace thing, if wgNamespaceAliases doesn't work i'm not sure, i don't think we have any other handling that would help there. [18:38:18] Ok, thanks. I'll give it a try and report back. May not be until early next week, though. I really appreciate your thoughts and advice. [18:38:27] gehel: oh i remember now, i added the `cluster: wcqs` to hiera, and then it failed, so we reverted [18:39:13] probably something to ask in #wikimedia-sre [18:39:19] alrighty [18:39:47] justinl: np, glad to help [18:40:04] ebernhardson: yeah let's ask in #wikimedia-sre, at a minimum I can see it gets used in `modules/profile/manifests/prometheus/ops.pp` to generate a list of hosts running memcached, apache, etc [18:41:11] There's also https://phabricator.wikimedia.org/T234232 but while I gather that sanity checks were added cuz it breaks puppet if `wikimedia_cluster` is set to something invalid, nothing in the ticket actually mentions what `wikimedia_cluster` actually is :P [18:41:33] dropped a question in [18:42:15] yup, saw it [18:44:36] wonder what else we have to do ... i realized we need dsh groups so i'll put together a patch for that today [18:44:46] (otherwise scap wont sync to the new instances [18:48:06] oh actually, scap lets you put the hostnames in the deploy repo now [18:48:12] * ebernhardson wonders if thats good o rbad :P [18:50:16] :P [18:50:37] ebernhardson: so on the wikimedia_cluster stuff, we should wait to hear back but at a minimum the stuff pybal and prometheus are using it for wouldn't be impacted by us adding a new entry [18:50:51] since they're looking for specific subsets of the clusters [18:50:57] for the existing deploy it seems wdqs-canary is defined in-repo, but wdqs is defined externally. Should probably follow suit [18:51:09] sounds reasonable [18:51:11] ryankemper: makes sense [18:51:48] ebernhardson: back to the query event stuff, +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/721594? [18:52:35] ryankemper: lgtm [19:18:02] ebernhardson Turns out $wgNamespaceAliases works perfectly, thanks. Interesting that since the namespace is defined with the NS_PROJECT constant, there's already a "Project" alias. [19:24:49] justinl: awesome [22:00:32] ryankemper: turns out i broke puppet on wcqs-beta-01, too many pieces to remember... this should fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721623 [22:00:54] taking a look [22:01:25] beta says: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find resource 'Package[wdqs/wdqs]' in parameter 'before' (file: /etc/puppet/modules/query_service/manifests/common.pp, line: 89) on node wcqs-beta-01.wikidata-query.eqiad.wmflabs [22:02:00] it's basically back closer to where we started, but with Package this time :) [22:08:55] ebernhardson: I just run the puppet agent on `wcqs-beta-01` like normal right? [22:09:03] If so it seems to have the same error after merging [22:09:05] https://www.irccloud.com/pastebin/rwsktANR/ [22:09:14] ryankemper: i think we have to pull on wdqspuppet [22:09:20] or wait for the hourly puppet to pull [22:09:37] ah right, I thought I remembered there being its own puppet master or w/e [22:12:01] ebernhardson: great, error's fixed [22:12:19] excellent. Wonder how many more times i'll break it ;) [22:12:39] I'm hoping between both of us we hit at least double digits [22:12:49] Otherwise we're clearly doing something wrong :P [22:13:09] lol [22:33:55] ebernhardson: arguably getting slightly sidetracked but I'm taking a quick look at SDineshKumar's patch and was wondering if this `$log` actually gets used anywhere: https://github.com/wikimedia/puppet/blob/1410712a0e0bba57268c145373b0fdabf5183e96/modules/elasticsearch/manifests/log/hot_threads.pp#L29 [22:33:59] it seems not but I may be missing something obvious [22:34:41] it looks like the actual hot_threads_logger.py you wrote knows where to write from a config file it reads: https://github.com/wikimedia/puppet/blob/1410712a0e0bba57268c145373b0fdabf5183e96/modules/elasticsearch/files/elasticsearch_hot_threads_logger.py#L28-L45 so i think that $log might be unused [22:41:26] Thanks ryankemper and @dzan [22:43:20] SDineshKumar: great job on the patch! For puppet changes specifically, we have a tool called `pcc` that lets us see the effect of any proposed patch on the puppet catalog: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility) [22:43:21] I tried the `$script` as well thinking systemd could run it in a subshell... after @dzahn update that CI passed, i feel dumb testing on various possibilities this for over 1.5 hours [22:43:56] SDineshKumar: ha, yeah I've had some facepalm moments in my time...stuff like I was editing the wrong directory entirely [22:44:06] or more rarely, weird stuff with smart quotes or some utf bug screwing up code that looks otherwise fine [22:44:52] Yes, Amir mentioned it. and he told it requires even more black magic to run PCC than just puppet and pointed me to https://bash.toolforge.org/quip/AVfTAUmefIH_7EDsriqu [22:45:52] SDineshKumar: hahaha I'm definitely saving that forever as well :P [22:46:36] Yeah I'm not sure if community members get integration.wikimedia.org CI keys or not, but fortunately we have this neat feature now where you can just put `Hosts: elastic2001.codfw.wmnet, elastic1003.eqiad.wmnet` in the commit message and comment on gerrit `check experimental` [22:46:39] as we are moving from gerrit to gitlab towards CD .. is there a similar path were where we are moving from puppet to ansible etc.. [22:47:02] I'm working on the pcc part right now so no need for you to do anything, just explaining the process for context's sake [22:47:42] SDineshKumar: we're definitely moving to gitlab but we're going to be on puppet for the foreseeable future. Frankly all configuration management software is frustrating in its own way, so it's not so much that puppet is particularly bad so much as config management in general is a hard problem [22:48:06] I would bet good money we'll still be using puppet 5 years from now, but if we ever switch I dread the work involved in porting over to a new model :P [22:49:11] i see.. thanks i will have that for next CRs. i wish i don't have to depend on someone else to add jenkins-bot for build. I am used to sending multiple build requests previously. [22:50:07] I understand. migration in part of s/w world is pain. I've been there. [22:51:09] SDineshKumar: interesting, I always assumed it built by default, but looking back at the history of the patch I see it didn't seem to build until ebernhardson commented on the patch [22:51:38] I suppose it's vaguely possible it does build by default and it just got stuck in a queue forever and died, but that seems highly unlikely...if so, that is annoying! [22:51:47] sadly, i've never touched puppet/chef or other open source tools in my last 5 years due to office policies which is why i a bit skeptic of these config mgmnt stuff. [22:52:03] when you've got just a handful of hosts to manage, it's more trouble than it's worth generally [22:52:14] but as soon as you have a real fleet, or just many different services, it becomes essential [22:52:27] otherwise you basically end up with a feature-incomplete reimplementation of config management in bash...speaking from experience... [22:53:19] i agree. i've had used good tools to manage about 1K-3K hosts and they were all pretty legacy. [22:53:19] SDineshKumar: anyway to continue my pcc explanation, we always want to run `pcc` on a representative sample of hosts. so if there's 10 hosts that all have the same role, we just run it on one of them, but if there's 8 hosts total split across 4 roles, then you want one host of each role, etc [22:53:40] You don't have access to cumin but in general I do something like this to get a list of affected hosts: [22:53:42] https://www.irccloud.com/pastebin/L5pCcTle/ [22:54:10] In this case the puppet file we changed primarily was `Class: elasticsearch::log::hot_threads` so that's why you see that in the cumin syntax [22:54:36] Anyway so seeing that the list of hosts is `cloudelastic[1001-1006].wikimedia.org,elastic[2025-2060].codfw.wmnet,elastic[1032-1038,1040-1067].eqiad.wmnet,relforge[1003-1004].eqiad.wmnet`, I'll have PCC run on one cloudelastic host, one elastic codfw, one elastic eqiad, and one relforge [22:56:05] i see.. [22:57:16] So arbitrarily taking the first host in each of those ranges, that gives me `cloudelastic1001.wikimedia.org,elastic2025.codfw.wmnet,elastic1032.eqiad.wmnet,relforge1003.eqiad.wmnet` [22:59:06] Which we can then stick in the commit message following this syntax: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Gerrit_integration [22:59:07] not sure if you already know, but you can pass `C:elasticsearch::log::hot_threads` as the "host" to PCC and it automatically identifies hosts with different roles and runs it against them [22:59:39] legoktm: I didn't know that, neat! [22:59:45] interesting. [22:59:50] https://puppet-compiler.wmflabs.org/compiler1002/31109/ is the run I got from C:mediawiki::packages [23:00:00] I have a lot to learn and experiment with puppet. [23:00:15] me too ;) [23:00:16] https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Host_variable_override has more shortcuts [23:04:39] "Hosts that compile with differences" [23:06:40] may be the the reason it was failing as I was testing on a diff config of a machine i build it in a container ( a debian 11 docker running a diff version of puppet and ruby) [23:12:24] this becomes more reliable way of testing than https://wikitech.wikimedia.org/wiki/Puppet_coding#Testing_a_patch [23:17:32] i meant using PCC and testing with multiple hosts than testing on local machine / in a container. [23:19:20] SDineshKumar: right generally the local testing will help catch syntax errors and the like, whereas what pcc does that's special is it actually looks at what the effect will be based off how our various attributes are set in production (hiera variables, etc) [23:20:20] Note that PCC still isn't an "integration test" i.e. it tells us how puppet's model of what it's managing (the catalog) changes - ie what puppet thinks will change - but there's some types of things it won't catch [23:31:24] ebernhardson: stumbled across another small thing: https://github.com/wikimedia/puppet/blob/84f967be9b417aa2c94f3008052519413f902c39/modules/elasticsearch/manifests/instance.pp#L297-L299 [23:31:48] this part caught my eye: > If this was merged to master please fix. [23:33:01] Sounds like we can simplify it to just always be `"elasticsearch-${title}-gc-log-cleanup"` now...I don't fully get how the whole comparing pre / post multi-instance refactor thing actually works, but from what I gather `elasticsearch-${title}-gc-log-cleanup` is probably the one we want [23:33:46] Oh also you can ignore my earlier question about $log up above since that line will get deleted soon anyway, so the gc-log-cleanup is the only actual question I have right now [23:36:30] Also I just rubber ducked myself because based off of https://puppet-compiler.wmflabs.org/compiler1002/31119/elastic1064.eqiad.wmnet/index.html the one with `${title}` is the one we want, so when I circle back to remove the old cron I will get rid of the `default` vs `'default'` stuff and just set it to `elasticsearch-${title}-gc-log-cleanup` [23:50:39] dcausse: (for tomorrow) it looks like our change to the elasticsearch unit file has some kind of syntax error, just noticed this in the logs: `Sep 16 23:39:12 elastic1064 systemd[1]: [/lib/systemd/system/elasticsearch_6@.service:21] Unknown lvalue 'ExecPreStart' in section 'Service'` [23:52:17] https://www.irccloud.com/pastebin/NB8Ekvou/ [23:53:12] dcausse: Ah, it's supposed to be `ExecStartPre` not `ExecPreStart`: https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStartPre= [23:55:33] (my fault for not catching it at time of deploy) [23:56:45] Interestingly the commit message for the original patch, https://gerrit.wikimedia.org/r/c/operations/puppet/+/720667/, seems to imply that a systemctl restart leads to the /var/run/elasticsearch getting deleted but not created, yet we had (unintended) restarts earlier this week and presumably didn't see that error? Anyway, getting a patch up now