[10:16:59] godog: I upgraded routinator in codfw, but one of the eqiad graphs looks weird since then: https://grafana-rw.wikimedia.org/d/UwUa77GZk/rpki?orgId=1&from=now-1h&to=now (see the top right one) [10:19:18] XioNoX: indeed, looks like both recover after a little bit? [10:19:35] I'm assuming while data is syncing and/or not already synced ? [10:20:43] for the codfw instance yeah, but it shouldn't impact the eqiad one [10:24:33] XioNoX: agreed, the prometheus targets seem correct on the prometheus side [10:24:43] thanks! [10:24:59] I'll let it sit for a bit, routinator looks healthy [10:25:32] sure np [10:27:05] XioNoX: looks like graph 'weirdness' isn't a new problem, if you zoom out say 24h [10:27:49] indeed! [10:34:06] Setting "null value" to connected in the visualisation makes it look normal fwiw. [10:57:36] topranks: good point, yeah probably the graph could be bars or sth similar [11:05:50] good point indeed, null values as 0 works too, but that will probably just hide the real issue [11:06:16] I'll try to catch it from /metrics when it happen to see what values are exposed [11:12:16] <_joe_> jbond: say I need to download the .pem of our puppet CA, do you know of a place where we expose it? [11:13:00] _joe_: its in the puppet repo, not sure if we serve anywhere elses [11:13:02] <_joe_> this is for building a docker image, in CI. I need to add the puppet ca to the cert bundle at build time, and I think it's preferrable not to commit it to the git repo [11:13:35] i can add it to pki.discovery.wmnet/bundles [11:13:47] <_joe_> yeah I was thinking of that [11:13:47] I'm curious why the puppet one, which service is this? [11:13:57] <_joe_> any service right now? [11:14:06] <_joe_> they're all still using the puppet CA [11:14:12] not *all* :) [11:14:17] but almost [11:14:51] _joe_: dose it need to be externally avalible as well, if so config-master may be a better option [11:15:02] the ones with cergen are also signed by puppet CA? [11:15:11] <_joe_> volans: yes [11:15:16] :/ [11:15:30] <_joe_> volans: why do you think it's 4 years I ask for a PKI? [11:15:35] eheheh [11:15:39] getting there.. [11:15:59] <_joe_> jbond: so, I'm a bit conflicted on that respect, but let's say for now internal-only is ok [11:16:20] jobo: ack ill add it to pki bundles for now and we can change later if needed [11:16:25] <_joe_> to clarify, it's not aestetically pleasing to add this bundle to an image that could be use also externally [11:16:44] <_joe_> but we're not exposing any private data [11:16:58] <_joe_> so 🤷 [11:17:04] ahh yes dosnt make senses for this to be in an image which is used externally but thought you may want to support building images from your laptop [11:17:46] <_joe_> jbond: no my point is, I want to add the pki bundle only to the final image that we only use in production, and is under /restricted/ on the registry [11:17:53] <_joe_> but that gets built by CI [11:18:01] ahh ok got it [11:18:06] <_joe_> I *think* it's built in production, but I'm not 100% sure [11:18:18] <_joe_> anyways, if you add it to the bundle, that's great [11:18:25] yes will do [11:19:08] <_joe_> I mean, ideally we'd just create a debian package with all those bundles [11:19:22] <_joe_> and the debian package could be installed everywhere and we could check versions too [11:19:37] <_joe_> 🤔 [11:19:41] _joe_: seems reasnable i can create a task for that [11:19:54] <_joe_> jbond: yeah I can work on the package, btw [11:20:27] <_joe_> anyways, it's lunch time for me :) [11:21:15] ack should be done when you back :) [11:33:34] _joe_: curl http://pki.discovery.wmnet/bundles/Puppet_Internal_CA.pem.pem should work now [11:34:34] <_joe_> jbond: thanks that should be enough for now [11:34:50] cool [14:07:47] <_joe_> jbond: so, where can I find the urls of all those bundles? I'm preparing a debian package [14:11:04] _joe_: i think yuo should just need to have http://pki.discovery.wmnet/bundles/Wikimedia_Internal_Root_CA.pem and the http://pki.discovery.wmnet/bundles/Puppet_Internal_CA.pem.pem. the intermeidates should be sent by the server so shouldn;t need to be installed in the ca-certificates. i think vgutierrez mentioned installing the intermediates had caused issues before [14:11:25] <_joe_> ack [14:11:32] yup [14:12:03] https://phabricator.wikimedia.org/T271063 [14:12:11] that's a fine example [14:12:12] thx [15:18:10] kormat: pc1009 will be the next one to get hammered, yes? [15:18:13] is it ready for purging? [15:18:35] Krinkle: yep re: next one. let me put in a downtime, then you can purge [15:18:43] ack :) [15:19:46] Krinkle: alright, do your worst [15:23:55] kormat: running, tee'd as before. [15:24:04] 👍 [18:39:31] bstorm: I read your wiki diff and it looks great to me. thank you !:) [18:39:48] Cool :) [19:10:58] hello, got redirected here from #wikimedia-tech, will try my luck on this channel as well: probably a long shot, but I'm looking for an engineer from wikimedia that was at PromCon 2019 in Munich (at the Google HQ), we briefly chatted about Prometheus and Elasticsearch, but I completely forgot the name... so if you've been at that conference - pm me please, I'll introduce myself [19:12:36] godog: ^ ? [19:15:40] nixfloyd: hi! Nowadays SRE has been split into multiple subteams. One of them is called "Observability" and since they are using Prometheus and ELK stack, I would think it is likely https://www.mediawiki.org/wiki/Wikimedia_Site_Reliability_Engineering#Observability [19:16:25] so I can continue the channel forwarding game and recommend #wikimedia-observability [19:17:20] I pinged godog since he seems most likely to have been in Munich [19:17:58] might be off for the evening at this point [19:19:26] nixfloyd: try the talk page of https://www.mediawiki.org/wiki/User:Filippo_Giunchedi maybe. I think it's likely him [19:23:27] yeah he was there. [19:23:45] but he's on eu time is all [19:29:50] so I'm running the wdqs `data-transfer` cookbook to transfer a file from `wdqs1009`->`wdqs1010`. Normally, the cookbook (with puppet disabled) will do a `systemctl start wdqs-updater.service` at the end of the cookbook run [19:29:52] In this case I *don't* want the updater to start because I need this file to remain unmutated, and there's no convenient flag in our cookbook to prevent that [19:30:05] My current approach is to mutate the systemd unit file and change the `ExecStart` to a no-op, which - given puppet will be disabled so this unit file shouldn't get overwritten - should prevent the actual updater process from starting. Does that sound right? [19:30:16] I can't think of any issues with that approach but just looking for a quick sanity check :) [19:31:31] I don't know about all the rest, but don't forget to do "systemctl daemon-reload" after manually editing a systemd unit file [19:31:33] or it won't pick up the change [19:31:33] wanna add the flag to the cookbook? if it's a use case that might happen, shouldn't take more than few minutes [19:32:06] volans: I considered it, but this is a very one-off case so it's hard to envision it cropping up again [19:32:42] the tl;dr is our test host `wdqs1009` is a snowflake running the latest streaming updater, so its journal file is different from the others, and I'm transferring to our other test host `wdqs1010` while I re-image `wdqs1009` [19:32:49] ok, then I might have another suggestion, very against all rules (mines inlcuded ;) ) [19:33:04] let's hear it :P [19:33:22] you're the only one running the wdqs data-transfer cookbooks AFAICT [19:33:55] ah, so just mutate the cookbook itself on `cumin1001`? [19:34:08] something like that, but I was then thinking... how long will it run? [19:34:27] ~1 hour or so [19:34:32] because if it's days it would be a problem, because it will stop puppet updating the repo because of local changes [19:34:35] ahhh ok then [19:34:38] go ahead [19:34:46] if you want I can double check the diff [19:35:14] ryankemper: in /srv/deployment/spicerack/cookbooks/sre/wdqs/ [19:36:23] just make sure to do a git checkout data-transfer.py afterwards [19:37:23] volans: and just to be clear puppet won't overwrite it on each run? or is it that it will but as long as I run the cookbook right away then it'll be in RAM and so won't matter after that point [19:37:26] changing line 29 should be enough from a first look, you kust have to stop it yourself because it will not stop it [19:37:36] both :) [19:37:49] puppet will try to do a git pull that will fail with local modifications [19:38:01] and also if you're already running it, it will not be affected [19:40:20] volans: check the diff now, I edited a different line so that we won't impact the stopping of services, only the start [19:40:41] but it won't start any of them, is that ok for you? [19:41:20] diff looks good for that [19:41:24] Yeah that's fine, neither blazegraph nor updater need to be running [19:41:27] ack [19:41:29] +1 [19:41:38] I technically want them running on 1009 but there's no time criticality so I figure I'll just do that part manually [19:42:28] ack [19:42:38] wait a sec [19:42:51] will the wait_for_updater() wait/fail? [19:43:18] I guess you run it without the lvs option, so there will be no 'pool' at the end, correct? [19:43:28] ryankemper: ^^ [19:43:44] volans: no pool at end, correct [19:43:54] I would expect wait_for_updater to hang forever [19:44:01] I didn't look at the logic for it that's just going off the name [19:45:17] @retry(tries=1000, delay=timedelta(minutes=10), backoff_mode='constant' [19:45:45] so it will try for quite a bit, but being the last command you can also just ctrl+c once stuck there [19:45:49] Close enough to forever :P [19:46:10] Yeah exactly that's the plan [19:47:05] I already kicked off the cookbook and restored the state of the git repo btw so everything should be as normal as far as the repo's concerned [19:47:40] great, thanks [19:53:42] mutante, apergos: thanks! [19:57:02] you're quite welcome