[01:09:13] bd808: thanks for the heads, went to add a comment just now but saw it got fixed in https://phabricator.wikimedia.org/T300214#7655132 so prob not worth mentioning it at this point [01:13:58] ryankemper: I made T300225 for it. It is related, but different than the other issue [01:13:59] T300225: "publish-to-doc" jobs failing due to missing rsync on integration-agent-docker nodes - https://phabricator.wikimedia.org/T300225 [01:15:21] ah understood [01:15:24] bd808: thanks for filing that [08:03:31] ryankemper, inflatador: I've added you to https://gerrit.wikimedia.org/r/admin/groups/896bb182e55868ea25eec329ed1142e9f756f254,members [08:41:20] ryankemper, inflatador: for when around: we have icinga alerts for elastic1068 and elastic1077 [08:42:02] We also have alerts about puppet disabled on a few WCQS servers. This seems related to testing authentication, but needs to be re-enabled as soon as possible [08:43:05] I'm re-enabling puppet on those WCQS hosts right away [08:56:59] cc ebernhardson ^ [11:03:35] lunch + errand [11:04:50] lunch too [14:10:11] heya dcausse ! [14:10:22] ah we need to rebuild the eventgate-wikimedia image with https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/737429 and deploy to eventgate-main [14:10:28] eventgate-main has no dynamic schemas [14:10:35] so it doesn't have fetch-failure 1.1.0 [14:10:43] getting email alerts about that [14:11:07] i can do for you...or if you like, I can teach you how! [14:13:02] ottomata: oh sorry, I can do that in a 20mins? [14:13:50] greetings [14:19:26] yup [14:26:55] ottomata: doing now [14:27:35] ottomata: do you have a doc somehwere? :) [14:29:56] yes [14:30:08] https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#eventgate-wikimedia_schema_repository_change [14:30:45] you can go ahead and just bump both schema repo version to latest sha [14:30:57] ok [14:31:15] and actuually,i have to do an eventgate-main deploy today with luca anyway, so if you are game, get it to the point of merging the helmfile values.yaml patch [14:31:18] and i'll do the deployment for ya [14:31:25] (unless you really want to! :) ) [14:36:09] I can :) [14:36:26] ottomata: https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/757665 [14:38:08] dcausse: ncie thank you [14:38:40] merged. wait for gerrit to tell you the new image version, and you can update values.yaml [14:38:42] ottomata: so I just wait for the pipeline to push a patch to deployment-chart now? [14:38:44] yup [14:38:48] :) [14:39:30] cool then I just apply that on all env (staging/eqiad/codfw) [14:39:50] ottomata: do you look at some graph while applying the chart? [14:40:35] sure you could watch this one [14:40:36] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1 [14:40:50] but also, the readinessProbe actually posts a test event through the service [14:40:58] so ideally the pod will fail deployment if it can't do that [14:41:35] if you do the deployment (please go ahead) you'll see there is also a CA cert change that elukey has been working on [14:41:41] we've already deploy this for 2 other eventgate clusters [14:41:46] it shoudl be fine for eventgate-main too [14:41:48] but we'll watch it [14:41:51] ok [14:43:14] oh Pipeline bot does not create a patch for me? :/ [14:44:38] 2022-01-27-143826-production [14:44:40] no it doesn't [14:44:58] can it? [14:45:08] there are 4 differnet values.yaml files you might want to edit [14:45:16] i think it doesn't have a way of knowing which one to do [14:45:27] i guess it coudl do all of them by default, then you could edit the patch if you didnt' want to do that [14:46:15] it does for us but we have only one service [14:46:48] ottomata: btw do I need to do all of them or just -main? [14:48:57] given what you said ealier it does not seem needed since the schema are pulled from the network but then all 4 are not on the same image [14:51:04] ottomata: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/757667/ [14:51:38] just main is fine dcausse [14:51:45] cool [14:52:13] awesome, merged. [14:52:20] ok shipping [14:52:24] shall I deploy (or did you want to? I am happy to if you prefer) [14:52:57] ottomata: either way, happy to do it if you feel it's something we should do ourselves in such circumtances [14:53:16] why not go for it, i'm here [14:53:26] ok doing it then [14:53:28] a change like this should be safe to do without me too [14:53:41] k i'll log that we are proceeding, since this also will change some CA certs [14:53:53] ok thanks! [14:54:06] hello folks [14:54:20] elukey: o/ [14:54:25] I was reading the list of talks for a conf, and I saw https://www.applyconf.com/agenda/data-engineering-isnt-like-software-engineering/ [14:54:32] ok proceed dcausse [14:54:33] for a moment I was confused [14:55:00] (the speaker name seemed familiar :D) [14:55:16] :) [14:55:42] yes there is another one out there, and doing very similar stuff on top of that :) [15:01:44] ottomata: checking the logs on staging I see nothing particularly wrong, should I go ahead or do you want to double check smth there? [15:01:57] proceed! [15:05:34] At an appt but looks like there’s a decent backlog of ppl ahead of me so might miss retro depending on how fast things go [15:05:36] Hey, not sure if anyone saw T297454 [15:05:37] T297454: WCQS gives "502 Bad Gateway Error" - https://phabricator.wikimedia.org/T297454 [15:12:45] cbogen_: looking [15:15:20] blazegraph is not even running on wcqs-beta-01 [15:22:19] going to start wcqs-blazegraph instead of sdoc-blazegraph (based on the journal it seems bigger) [15:23:08] glad that we're soon moving out of wcqs-beta-01 [15:26:36] dcausse: thank you! [15:37:53] getting 99% on the WDQS update lag SLO is harder than I thought :) [15:38:06] we're just above at 99.1 [15:44:35] rebooting for OS update, back soon [16:00:50] \o [16:01:03] o/ [16:01:42] mpham, ryankemper, ebernhardson: retro time: https://meet.google.com/ssh-zegc-cyw [16:01:56] be tehre in a minute [16:04:39] sec missing headphones... [17:15:59] back [17:22:30] looking into `elastic1068`'s alerts, it's in a weird state [17:24:00] `/var/log/elasticsearch` isn't present which is causing failures for `elasticsearch-production-search-eqiad-gc-log-cleanup.service`; and `/var/run/elasticsearch/` isn't present which is causing failures for `elasticsearch-disable-readahead.service`, etc [17:24:13] sigh, that reminds me of a ticket from before. sec [17:24:19] yes me too :/ [17:24:31] thought we pushed a fix for that tho [17:24:32] I remember that ticket too, although I think it might not be that exact issue [17:24:47] this is one of the new eqiad refresh hosts [17:24:55] hmm [17:25:15] on tuesday we tried to bring the fleet into service but the puppet run failed because of that dependency issue with `elasticsearch-oss` etc [17:25:17] can we pull the same data moritz did before that said what deleted it? Or did he turn on extra logging to make that happen [17:25:24] audit logs of some sort [17:25:35] we reverted that patch but now that I think about it going from the `elasticsearch::cirrus` role back to `insetup` wouldn't undo the stuff it did [17:25:43] so I think the problem is this is basically a half installed elasticsearch host [17:25:59] is it bullseyes? [17:26:01] here's the log of the puppet run btw: https://phabricator.wikimedia.org/P19226 [17:26:10] dcausse: nope it's stretch [17:26:19] ok [17:26:57] `E: Unable to locate package elasticsearch-oss` seems like why it didn't have /var/run/elasticsearch [17:27:00] ryankemper no rush, but fwd me the alert when you have a min? Don't see it in my email [17:27:29] inflatador: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=elastic1068 [17:27:31] why is it installing logstash-oss ? [17:28:23] dcausse: part of the `gelf_relay`, basically part of a slightly hacky shim observability had to put in place [17:28:30] are we trying to push a jar to elastic folder? /usr/share/elasticsearch/lib/logstash-gelf.jar [17:28:39] with it complaining about no elasticsearch-oss package, i suspect this is the previous problem of the apt component not being added in time. Can we maybe switch elasticsearch-oss package to use apt::package_from_component? [17:28:57] * ebernhardson would have to look closer what guarantees that provides [17:29:21] indeed that's the first error [17:30:57] hmm, the problem with package_from_component is currently half of that is done in the profile and half in the module [17:38:45] I'm suspicious of the lack of a `/etc/apt/sources.d/wikimedia.list` on the host [17:41:27] er more specifically `/etc/apt/sources.list.d/wikimedia-elastic.list` [17:41:52] That's the part on the working nodes that should make `elasticsearch-oss` visible: [17:41:54] https://www.irccloud.com/pastebin/sPvwOSFB/ [17:42:26] https://www.irccloud.com/pastebin/lW2GhYgu/ [17:43:15] which is why it's so bizarre that the output of the puppet run on `elastic1068` has the following: [17:43:19] https://www.irccloud.com/pastebin/Zu2GvgLk/ [17:43:30] ryankemper: i suspect an appropriate fix means taking apt::repository out of profile::elasticsearch and using apt::package_from_component in elasticsearch::packages [17:43:42] but i worry about how many elastic installs there are :S [17:44:34] ryankemper: essentially, it seems there is no dependency link between installing the elasticsearch-oss package and adding the elastic68 component. IIRC the two concepts were bundled into the single apt action specifically to deal with that [17:45:01] I see [17:45:39] can you elaborate on the bundling (if you remember)? is that what `apt::repository` is supposed to be doing [17:46:07] ryankemper: by bundling i mean its a single action that both adds the repository and installs the package, and it sets appropriate dependencies between them so it all happens in the right order [17:47:09] ah I was confused on which part you were saying was supposed to be doing the bundling, but upon rereading I gather you're saying that `apt::package_from_component` is what will do that [17:47:57] yup, thats it [17:48:42] okay, lemme try getting a patch up with that approach [17:49:43] ebernhardson: do you think we need to switch the approach just for the `wikimedia-elastic` block or for the `wikimedia-curator` as well? [17:49:48] we closed the Gerrit Deploy Windows thing but if you want to work on it together let me know [17:51:09] inflatador: sorry wdym by closed? [17:51:18] ryankemper: I suspect best practices today would be anything coming from a component should come through package_from_component, in part from a DRY perspective but also to avoid curator having same problem in future [17:51:36] * ebernhardson needs to check if logstash-oss came from a separate component as well [17:51:55] makes sense to me, I was thinking we might get bitten on the curator side of things too [17:54:57] ryankemper yeah, the Google Meet room specifically. dcausse and I were talking deb pkgs , specifically https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/755750 . I can merge the patch but don't know next steps [17:55:57] inflatador: ah gotcha. okay let's go over the plugin upload process when we pair with gehel later today [17:55:58] looks like logstash-oss and elasticsearch-oss are supposed to come from the same component, not sure how that works with package_from_component :S Might have to check with other SRE's [17:57:52] ebernhardson: the other confusion I have is that `profile::elasticsearch` has some logic around deciding which component to use: https://github.com/wikimedia/puppet/blob/1badaf7efa82ab8baa3748fff3971535e9c78ab5/modules/profile/manifests/elasticsearch.pp#L32 / https://github.com/wikimedia/puppet/blob/1badaf7efa82ab8baa3748fff3971535e9c78ab5/modules/profile/manifests/elasticsearch.pp#L96-L119 [17:58:24] and it's not clear how that logic can be transported over to `elasticsearch::packages` [17:59:22] ryankemper: it's just checking the version number to decide the component name? We have the version number as 5/6/7 in the top level elasticsearch class, that might have to take the full version number and pass it along to elasticsearch::packages [18:00:34] we added the elastic68 components today and moritz pulled elastic 6.8 it should be well isolated but just in case it's related to the failure you see [18:03:21] dinner [18:04:31] unrelated to the failure, this was already failing two days ago and also this should be depending on `elastic65` anyway i believe [18:07:29] huh, i had expected puppet would have a function to parse a version string, return the major component of 6.5 or whatever, but not seeing one. I guess thats why we have $config_version and $version as two variables :) [18:10:05] that would make sense (in a sad way) :P [18:16:52] so yeah that's something I'm a bit confused on, in profile-land we can just read the hiera variable but we can't do that in class-land [18:17:53] ebernhardson: in keeping the probably-obvious questions going, I gather that there's a way to get the full version string in the top-level class? and so you're trying to figure out how to just write the logic to figure out the major component so we won't need hiera to tell us? [18:20:53] ryankemper: in profile::elasticsearch you have $config_version and $version variables. Currently we pass class { 'elasticsearch': version => $config_version, ...}. I suppose we would need to pass both and use in appropriate places [18:21:38] ryankemper: the 6.5 one would pass on again from elasticsearch to class { '::elasticsearch::packages': ... } [18:23:02] oh this is my need to go learn proper puppet fundamentals showing, I wasn't realizing the elasticsearch profile is what spawns the actual class [18:23:27] ebernhardson: any reason we can't just have `elasticsearch.pp` directly pass on `$apt_component`? And otherwise keep stuff the same [18:24:24] ryankemper: you could certainly pass apt component instead, it's never particularly clear how concrete of a value should be passed in and is partially a matter of style. However you like :) [18:24:30] also fwiw I haven't looked into what the `apt::package_from_component` resource actually needs to be handed to it, but I was assuming we'd have to specify the component just like we do in `apt::repository` [18:25:30] ryankemper: should need component name, and a list of packages to install from there [18:42:29] ebernhardson: ugly first patch -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/757700 [18:44:00] ebernhardson: oh hey look at this old context I just stumbled across: https://gerrit.wikimedia.org/r/c/operations/puppet/+/565617 lol [18:45:11] looking [18:45:36] speaking of which it looks like in that old patch moritzm just had the `apt_from_component` in the same `modules/profile/manifests/elasticsearch.pp`, is there a specific reason we wanted to move it to `packages.pp` in my patch? [18:46:05] almost a year to the day :) [18:46:07] lunch, back in ~45m [18:50:53] * ebernhardson wonders if there is a reasonable tool to look at a puppet catalog as a dependency graph [18:56:16] yeah man a graphviz type tool would be so useful [19:01:00] added a couple comments, I'm feeling like I'm wrong about something regarding how puppet does ordering but I can't put my finger on what :S [19:01:22] I do see that puppet can generate .dot files that then go into graphviz, not sure if it will be what i want but will see if I can get an instance to spit one out and look at it :) [19:07:27] oh, well yea of course. Puppet needs graphviz installed to generte them and we don't generally have that on the servers. I suppose ideally pcc would generate but thats not for today :) [19:07:30] hm [19:31:26] ebernhardson: those .dot graph tends to be so packed that they are mostly unreadable. But sometimes it still beats the alternatives [19:32:52] inflatador: pairing: meet.google.com/ckm-dmmh-opt [19:34:35] omw [19:40:37] from puppet docs: "Unlike with resources, Puppet does not automatically contain classes when they are declared inside another class (by using the include function or resource-like declaration)." [19:40:53] and there, i suspect, is why the ordering isn't happening like i expected [19:42:29] by that definition, Class[apt::repository] was before Class[Elasticsearch], but Class[Elasticsearch::Packages] isn't contained by Class[Elasticsearch], so there is no dependency ordering [19:50:38] * ebernhardson wonders if the right solution is to make things that should always be contained resources instead of classes, or make sure we `before => Class[]` when creating the other class, or something else... [19:51:08] ebernhardson: we're talking about this issue in our weekly meet if you want to join: http://meet.google.com/ckm-dmmh-opt [20:28:46] (╯°□°)╯︵ ┻━┻ [21:55:51] well how fun, i succesfully banned myself from wcqs. Works!. But somehow I haven't managed to un-ban myself [21:58:58] Brilliant [22:05:11] Fixed with liberal application of cache and cookie purging. but i guess now i gotta try again :) [22:05:25] (just browser cache)