[09:20:05] elukey: thanks for investigating - that's pretty weird [09:21:42] I'd say, as this is the second time now that calling update-ca-certificates has surprising side-effects, we should probably stop using it for creating the wmf bundle [09:23:07] <_joe_> jayme: well in this case, it's just a clash between something you install from the package and what is installed via puppet though [09:23:18] <_joe_> as in conflicting on the same cert name AIUI [09:23:51] _joe_: but none of my changes added a certificate to the package. So that must have been conflicting all the time [09:24:56] my guess is that calling update-ca-certificates in the hook script changes how the certificates in /etc/ssl get linked [09:25:27] where did the correct /etc/ssl/certs/Puppet_Internal_CA.pem originally come from? if puppet, why did it not fix it to be correct? [09:28:16] jayme: I added to the the task some ideas, I think that the other times (when the new package was rolled out) the /usr/local/share/ca-certificates took precedence over the certs added by wmf-certificates [09:28:54] this time we used /dev/null as local certs dir (to avoid the big bundle) and in cloud the wmf-certificates' puppet ca cert was linked under /etc/ssl/.. instead [09:29:18] elukey: yes, that's what I meant [09:29:20] <_joe_> sigh irccloud [09:29:54] <_joe_> maybe I should start offering irc bouncers to everyonefor a small fee [09:30:30] jayme: yeah but we shouldn't really deploy wmf-certificates to cloud, it is super easy to forget the use case and cause issues.. But the simple concat of certs in a single file is also fine, we could also ship a .p12 and .jks too :) [09:31:02] elukey: eheh, nice move :p [09:31:46] but...I think we're really better of not using update-ca-certificates with all it additionally does [09:31:58] <_joe_> I disagree. [09:32:13] can you elaborate? :) [09:32:16] <_joe_> that's how you're supposed to add system-wide certs to a debian system [09:32:31] <_joe_> so I'm still unsure about what went on here [09:32:47] <_joe_> but at least to add certs to the main cert bundle, I would object using something else [09:32:54] ah, yes sure! I'm not saying we should no longer use it [09:33:00] in general [09:33:16] <_joe_> tbh, the "wmf-certificates" package was never designed to be used in wikimedia cloud [09:33:16] but for the creation of the wmf only certificate bundle [09:33:26] yes --^ [09:33:33] <_joe_> in fact, it's completely useless there [09:33:50] <_joe_> and borderline harmful [09:34:27] <_joe_> maybe I should've called it wmf-ONLY-INSTALL-IN-PRODUCTION-OR-PRODUCTION-DOCKER-IMAGES--certificates [09:34:32] so basically replace the code I added as a hook with something (cat) that does not call update-ca-certificates again. We would still use the update-ca-certificates flow to add the certs and have them in the global bundle. But not for ours [09:34:57] <_joe_> that is ok, I don't have a preference for that [09:35:23] <_joe_> I thought your comment was more broad [09:36:25] understood :) [09:39:05] Krinkle: (back from VAC) yes re: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite the impact is data loss before oct 11 for affected metrics, I've clarified the summary [09:39:41] <_joe_> majavah: apple search works in staging! [09:39:50] woo!!! [09:41:05] godog: hi! we've provisioned new cumin servers in wmcs (cloud-cumin-03 and 04), but they don't seem to be able to reach pontoon hosts that cloud-cumin-01 can. the relevant hiera is present in ops/puppet as far as I can see [09:43:44] majavah: hello, which hosts ? I suspect they might have not run puppet therefore didn't pick up the new cumin_masters [09:48:56] godog: as far as I can see, at least the swift, analytics and monitoring projects [09:49:28] <_joe_> majavah: also works in eqiad/codfw; I'll add lvs and wire it to the public endpoint later in the day [09:50:04] (full list of projects with broken instances is {'cloudinfra-nfs': 1, 'analytics': 3, 'bastion': 1, 'openstack': 1, 'commonsarchive': 1, 'deployment-prep': 1, 'gitlab-test': 2, 'testlabs': 1, 'swift': 6, 'mediawiki-vagrant': 1, 'ores-staging': 1, 'osmit': 1, 'puppet-dev': 2, 'pki': 1, 'monitoring': 20, 'auditlogging': 2, 'toolsbeta': 1, [09:50:04] 'wikidata-query': 1}, but I don't think the rest use pontoon) [09:51:55] majavah: thank you, I'll take a look at 'monitoring' now [10:02:21] majavah: so yeah pretty sure the problem is puppet runs are failing for unrelated reasons, I'll take a deeper look since it needs fixing anyways [11:12:00] <_joe_> can someone help me understand how to add a new VIP to netbox? the instructions in https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox are not very clear to me [11:12:31] _joe_: what's unclear? so we can improve them [11:12:42] <_joe_> nevermind, I went past the point I could not understand and the answer was right there [11:13:06] <_joe_> volans: so the instructions 1-2 are misleading in case you're setting up a VIP [11:13:09] :) feel free to shuffle the info if it makes sense to have that earlier [11:13:26] <_joe_> yeah I'll try to reword once I've done the whole procedure [11:22:08] majavah: for 'monitoring' project the situation should be better now, can you confirm the same? [11:24:04] _joe_: So, setting up a service in Beta Cluster; IIUC, I need to make a helm chart and a box with `role::beta::docker_services`, and the name of the chart is the input to the `profile::docker::runner::service_defs` config in hiera? Or am I mistaken? [11:24:28] btw, I was reading https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/ and it looks interesting. I am TILing a bit today. nautobot is a fork of netbox supporting mysql and dolt is something...new I guess. A commit graph database with MySQL protocol support. [11:24:46] James_F: you don't need a helm chart. You only need a docker image. [11:25:02] akosiaris: Yes I am currently looking at the benefits of nautobot vs netbox. [11:25:26] akosiaris: Oh, it just pulls from docker-registry itself? That's much easier. :-) [11:25:28] that is for Beta specifically. But you WILL need the helm chart (hopefully using the same image) for the move to production [11:25:29] And definitely that feature / dolt backend looks really nice. [11:25:33] Yeah. [11:26:02] <_joe_> yeah the idea is you can test your image faster this way in beta :} [11:27:09] topranks: it does indeed. Heck even the idea of not having to maintain another postgres db just for netbox is a selling point in my book, never even mind the awesome feature of being able to rollback your changes in netbox [11:28:22] they even have a hub where they upload their databases. https://www.dolthub.com/discover [11:28:24] c [11:28:36] clearly very dockerhub, github inspired [11:28:48] heh hadn't seen that. nice! [11:29:08] I guess the disadvantage is we need to maintain it separate, much like postgress, even if it is more MySQL-like [11:31:12] maybe we can convince marostegui to manage it for us :) [11:31:28] dolt? yeah. And that's just scratching the tip of the iceberg. It's a totally new software, it's probably a very long way from being "production" ready. [11:31:30] but akosiaris was always the postresql go-to person [11:32:04] marostegui: https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics [11:32:31] regardless, nautobot supports mysql, so using the existing mysql clusters would be a benefit [11:32:40] yup [11:32:41] but it means using nautobot :) [11:33:34] yup. Which btw what's pretty decent migration guide. https://nautobot.readthedocs.io/en/latest/installation/migrating-from-netbox/ [11:34:35] the question is of course whether the cost of the migration will be payed back by not having to maintain a postgres db [11:35:00] and I am totally unclear on that. [11:37:35] yeah, and there are more implication than that [11:38:33] eg. nautobot has a lot of features for network automation, that we could maybe leverage. At the cost of locking us in their environment [11:39:19] on the other hand having "invented here" tools locks us in as well in a different ways and takes ressources to develop [11:39:51] +1 [11:43:11] <_joe_> XioNoX: so you mean as a replacement for homer? [11:43:27] or completing, yeah [11:43:43] <_joe_> I would think 4 times before ditching that for something developed externally probably not with large web operations in mind [11:44:06] <_joe_> unless we are ready to sink some dev time on the upstream thing too [11:44:43] yeah of course, we're thinking about all of that right now, nautobot vs. netbox, and future of network automation [11:45:15] one issue is that the upstream thing could very well become open-core kind of thing [11:45:35] Nautobot certainly wants to be that all-encompasing tool, which will configure your devices etc. (i.e. replace homer for intsance.) [11:45:43] it's open source, but developed mostly by one company [11:46:11] with no certainty that they will not go into an open core model at some point [11:46:37] <_joe_> again, I'm all for not succumbing to the NIH syndrome [11:46:52] "NIH syndrome" ? [11:46:54] <_joe_> but also a product of a single vendor without a dev community around it [11:46:59] <_joe_> Not Invented Here [11:47:05] gotcha [11:47:12] yeah agreed. [11:47:19] <_joe_> that's not much better tbh :) [11:48:00] <_joe_> then I never used homer, maybe it's bad; but in some cases a bad tool tailored to your needs is better than an above-average one intended for general use [11:48:28] <_joe_> I mean given the people involved in writing homer I'd expect it to be good but annoying [11:48:46] Homer's fine I think, and volans may correct me but I don't think in need of a lot of ongoing maintenance. [11:48:57] _joe_: please type the square root of the number of targetted devices [11:49:07] <_joe_> does it require you to make some octet calculations before committing a change? or maybe integrate a simple function [11:49:13] <_joe_> XioNoX: ahahahah yeah exactly [11:49:17] <_joe_> volans: <4 [11:49:19] But say if we moved from Juniper there would probably be a lot of work to change most likely. [11:49:23] <_joe_> err off by one [11:49:27] lol [11:49:38] <_joe_> topranks: ack definitely [11:49:41] * topranks finds himself unable to pass the increasingly difficult captchas. [11:49:49] there are a lot of improvments that we could/should do with Homer [11:53:16] what kind of things you thinking of? [11:58:05] topranks: I'll try to finish my first draft today, but outages delayed me [11:58:19] but it's based on that list of pain-points I shared some time ago [11:58:27] (and wishlist) [11:59:03] I must dig it up yep... probably make more sense to me now I know my way around homer. [12:35:58] godog: monitoring now has 15 instances reachable, still 7 that are not (thanos, puppetdb, log, conf, kafka, cumin, icinga) [12:48:37] majavah: ack thanks, I'll take a look at those too [13:33:14] fyo, we can soon repool codfw [13:33:17] fyi* [13:34:30] \o/ [13:35:14] going to monitor codfw row B for a bit first [13:35:21] lvs2007 is back in service [14:13:09] majavah: ok I think I fixed all of the above [14:13:58] godog: pontoon-conf-01 is still broken, otherwise everything in monitoring works [14:21:31] majavah: ack, fixing [14:26:36] majavah: ok pontoon-conf-01 fixed too, thank you for the heads up and the patience [14:27:23] indeed: 100.0% (22/22) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. [14:27:35] (that's for the monitoring project) [14:29:29] majavah: sweet, what's the cumin command you used there? I'd like to check 'swift' project too [14:29:46] so we don't have to do the ping/pong here that is [14:29:46] taavi@cloud-cumin-03:~$ sudo cumin "O{project:monitoring}" "true" [14:30:32] thanks [15:41:41] jayme: elukey: should I be worried that I'm seeing |/etc/ca-certificates/update.d/jks-keystore: 56: [: amd64: unexpected operator" during wmf-certificates upgrade on deployment-deploy01 [15:42:46] * jayme is glad that this is from elukey's hook [15:43:20] do I have a hook?? :D [15:44:04] oh, you don't .. .I thought jks-keystore was a product of your's...but it's not [15:44:11] sorry to have summoned you :) [15:44:20] always blaming me :D [15:45:06] majavah: I'll try to reproduce, give me a minute [15:46:28] this is very weird, the error happens in a ca-certificates' hook? [15:46:34] majavah: I don't see that on an-test-client1001 for example [15:46:43] better - ca-certificates-java [15:47:20] jayme: https://phabricator.wikimedia.org/P17796 [15:47:44] thanks [15:47:47] weird [15:48:05] I didn't see it on deployment-mediawiki11, so maybe a stretch-specific issue [15:48:40] so on an-test-client we have ii ca-certificates-java 20190405, on deployment-deploy01 ii ca-certificates-java 20170929~deb9u1 [15:48:52] yeah, was about to ask what's in line 56 - as for me it's a "dpkg -L" query (on buster) [15:48:55] so the hook is different [15:49:50] on deploy01 it does if [ "$arch" == "armhf" ]; then [15:52:47] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922720 [15:53:10] majavah: I see from cache policy that a new version of ca-certificates-java is available on deploy01, shall we try it? [15:53:19] sure! [15:53:55] I'll let you upgrade, lemme know if it improves the situation [15:54:56] looks better, I don't see any warnings like last time [15:55:13] did something change with github today. i got a bunch of random notifications about old changes, about 8 randome from the last 2-3 months. not a big deal but wierd [15:56:21] majavah: great :) Do you see the puppet_ca issue fixed as well? [15:56:45] /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/local/share/ca-certificates/Puppet_Internal_CA.crt [15:56:50] \o/ [15:57:24] let's hope that was the last weird thing this introduced :-| [15:57:33] root@deployment-deploy01:~# openssl x509 -in /etc/ssl/certs/Puppet_Internal_CA.pem -text -noout | grep CN [15:57:33] Issuer: CN = Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs [15:57:45] goood [15:57:59] great, thanks for checking! [16:08:49] topranks: the dolt blog post on nautobot makes it sound like fixing support for MySQL was a relatively easy first step for them. Perhaps easy enough to upstream to netbox (for them/upstream to do, or for us to do). Then migrating within netbox might be an option. Less drastic of a change. [16:12:55] Yeah that could be an option. Django supports both so in theory I guess it’s not impossible. [16:13:33] elukey: jayme: deployment-prep seems to be working again so I'll close the tasks, thanks! [16:13:34] Looking at the netbox git it’s not been asked, so they may not want to dedicate time to it. But might be happy to include if we worked on it [16:13:43] majavah: thanks! [16:18:09] majavah: thanks! [16:18:59] majavah: fixed new cloud-cumin host access for 'swift' project too [16:19:12] great, thanks [16:20:43] sorry for causing trouble :/ [18:33:12] * jbond only just noticed RIPE83 started today [18:48:19] * topranks had also missed that point and thanks j.bond for the heads up :)