[09:20:05] <jayme>	 elukey: thanks for investigating - that's pretty weird
[09:21:42] <jayme>	 I'd say, as this is the second time now that calling update-ca-certificates has surprising side-effects, we should probably stop using it for creating the wmf bundle
[09:23:07] <_joe_>	 jayme: well in this case, it's just a clash between something you install from the package and what is installed via puppet though
[09:23:18] <_joe_>	 as in conflicting on the same cert name AIUI
[09:23:51] <jayme>	 _joe_: but none of my changes added a certificate to the package. So that must have been conflicting all the time
[09:24:56] <jayme>	 my guess is that calling update-ca-certificates in the hook script changes how the certificates in /etc/ssl get linked
[09:25:27] <majavah>	 where did the correct /etc/ssl/certs/Puppet_Internal_CA.pem originally come from? if puppet, why did it not fix it to be correct?
[09:28:16] <elukey>	 jayme: I added to the the task some ideas, I think that the other times (when the new package was rolled out) the /usr/local/share/ca-certificates took precedence over the certs added by wmf-certificates
[09:28:54] <elukey>	 this time we used /dev/null as local certs dir (to avoid the big bundle) and in cloud the wmf-certificates' puppet ca cert was linked under /etc/ssl/.. instead
[09:29:18] <jayme>	 elukey: yes, that's what I meant
[09:29:20] <_joe_>	 sigh irccloud
[09:29:54] <_joe_>	 maybe I should start offering irc bouncers to everyonefor a small fee
[09:30:30] <elukey>	 jayme: yeah but we shouldn't really deploy wmf-certificates to cloud, it is super easy to forget the use case and cause issues.. But the simple concat of certs in a single file is also fine, we could also ship a .p12 and .jks too :)
[09:31:02] <jayme>	 elukey: eheh, nice move :p
[09:31:46] <jayme>	 but...I think we're really better of not using update-ca-certificates with all it additionally does
[09:31:58] <_joe_>	 I disagree.
[09:32:13] <jayme>	 can you elaborate? :)
[09:32:16] <_joe_>	 that's how you're supposed to add system-wide certs to a debian system
[09:32:31] <_joe_>	 so I'm still unsure about what went on here
[09:32:47] <_joe_>	 but at least to add certs to the main cert bundle, I would object using something else
[09:32:54] <jayme>	 ah, yes sure! I'm not saying we should no longer use it
[09:33:00] <jayme>	 in general
[09:33:16] <_joe_>	 tbh, the "wmf-certificates" package was never designed to be used in wikimedia cloud
[09:33:16] <jayme>	 but for the creation of the wmf only certificate bundle
[09:33:26] <elukey>	 yes --^
[09:33:33] <_joe_>	 in fact, it's completely useless there
[09:33:50] <_joe_>	 and borderline harmful
[09:34:27] <_joe_>	 maybe I should've called it wmf-ONLY-INSTALL-IN-PRODUCTION-OR-PRODUCTION-DOCKER-IMAGES--certificates
[09:34:32] <jayme>	 so basically replace the code I added as a hook with something (cat) that does not call update-ca-certificates again. We would still use the update-ca-certificates flow to add the certs and have them in the global bundle. But not for ours
[09:34:57] <_joe_>	 that is ok, I don't have a preference for that
[09:35:23] <_joe_>	 I thought your comment was more broad
[09:36:25] <jayme>	 understood :)
[09:39:05] <godog>	 Krinkle: (back from VAC) yes re: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-29_graphite the impact is data loss before oct 11 for affected metrics, I've clarified the summary
[09:39:41] <_joe_>	 majavah: apple search works in staging!
[09:39:50] <majavah>	 woo!!!
[09:41:05] <majavah>	 godog: hi! we've provisioned new cumin servers in wmcs (cloud-cumin-03 and 04), but they don't seem to be able to reach pontoon hosts that cloud-cumin-01 can. the relevant hiera is present in ops/puppet as far as I can see
[09:43:44] <godog>	 majavah: hello, which hosts ? I suspect they might have not run puppet therefore didn't pick up the new cumin_masters
[09:48:56] <majavah>	 godog: as far as I can see, at least the swift, analytics and monitoring projects
[09:49:28] <_joe_>	 majavah: also works in eqiad/codfw; I'll add lvs and wire it to the public endpoint later in the day
[09:50:04] <majavah>	 (full list of projects with broken instances is {'cloudinfra-nfs': 1, 'analytics': 3, 'bastion': 1, 'openstack': 1, 'commonsarchive': 1, 'deployment-prep': 1, 'gitlab-test': 2, 'testlabs': 1, 'swift': 6, 'mediawiki-vagrant': 1, 'ores-staging': 1, 'osmit': 1, 'puppet-dev': 2, 'pki': 1, 'monitoring': 20, 'auditlogging': 2, 'toolsbeta': 1,
[09:50:04] <majavah>	 'wikidata-query': 1}, but I don't think the rest use pontoon)
[09:51:55] <godog>	 majavah: thank you, I'll take a look at 'monitoring' now
[10:02:21] <godog>	 majavah: so yeah pretty sure the problem is puppet runs are failing for unrelated reasons, I'll take a deeper look since it needs fixing anyways
[11:12:00] <_joe_>	 can someone help me understand how to add a new VIP to netbox? the instructions in https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox are not very clear to me
[11:12:31] <volans>	 _joe_: what's unclear? so we can improve them
[11:12:42] <_joe_>	 nevermind, I went past the point I could not understand and the answer was right there
[11:13:06] <_joe_>	 volans: so the instructions 1-2 are misleading in case you're setting up a VIP
[11:13:09] <volans>	 :) feel free to shuffle the info if it makes sense to have that earlier
[11:13:26] <_joe_>	 yeah I'll try to reword once I've done the whole procedure
[11:22:08] <godog>	 majavah: for 'monitoring' project the situation should be better now, can you confirm the same?
[11:24:04] <James_F>	 _joe_: So, setting up a service in Beta Cluster; IIUC, I need to make a helm chart and a box with `role::beta::docker_services`, and the name of the chart is the input to the `profile::docker::runner::service_defs` config in hiera? Or am I mistaken?
[11:24:28] <akosiaris>	 btw, I was reading https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/ and it looks interesting. I am TILing a bit today. nautobot is a fork of netbox supporting mysql and dolt is something...new I guess. A commit graph database with MySQL protocol support.
[11:24:46] <akosiaris>	 James_F: you don't need a helm chart. You only need a docker image.
[11:25:02] <topranks>	 akosiaris:  Yes I am currently looking at the benefits of nautobot vs netbox.
[11:25:26] <James_F>	 akosiaris: Oh, it just pulls from docker-registry itself? That's much easier. :-) 
[11:25:28] <akosiaris>	 that is for Beta specifically. But you WILL need the helm chart (hopefully using the same image) for the move to production
[11:25:29] <topranks>	 And definitely that feature / dolt backend looks really nice.
[11:25:33] <James_F>	 Yeah.
[11:26:02] <_joe_>	 yeah the idea is you can test your image faster this way in beta :}
[11:27:09] <akosiaris>	 topranks: it does indeed. Heck even the idea of not having to maintain another postgres db just for netbox is a selling point in my book, never even mind the awesome feature of being able to rollback your changes in netbox
[11:28:22] <akosiaris>	 they even have a hub where they upload their databases. https://www.dolthub.com/discover
[11:28:24] <akosiaris>	 c
[11:28:36] <akosiaris>	 clearly very dockerhub, github inspired
[11:28:48] <topranks>	 heh hadn't seen that.  nice!
[11:29:08] <topranks>	 I guess the disadvantage is we need to maintain it separate, much like postgress, even if it is more MySQL-like
[11:31:12] <XioNoX>	 maybe we can convince marostegui to manage it for us :)
[11:31:28] <akosiaris>	 dolt? yeah. And that's just scratching the tip of the iceberg. It's a totally new software, it's probably a very long way from being "production" ready.
[11:31:30] <marostegui>	 but akosiaris was always the postresql go-to person
[11:32:04] <akosiaris>	 marostegui: https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics
[11:32:31] <XioNoX>	 regardless, nautobot supports mysql, so using the existing mysql clusters would be a benefit
[11:32:40] <akosiaris>	 yup
[11:32:41] <XioNoX>	 but it means using nautobot :)
[11:33:34] <akosiaris>	 yup. Which btw what's pretty decent migration guide. https://nautobot.readthedocs.io/en/latest/installation/migrating-from-netbox/
[11:34:35] <akosiaris>	 the question is of course whether the cost of the migration will be payed back by not having to maintain a postgres db
[11:35:00] <akosiaris>	 and I am totally unclear on that. 
[11:37:35] <XioNoX>	 yeah, and there are more implication than that
[11:38:33] <XioNoX>	 eg. nautobot has a lot of features for network automation, that we could maybe leverage. At the cost of locking us in their environment
[11:39:19] <XioNoX>	 on the other hand having "invented here" tools locks us in as well in a different ways and takes ressources to develop
[11:39:51] <akosiaris>	 +1 
[11:43:11] <_joe_>	 XioNoX: so you mean as a replacement for homer?
[11:43:27] <XioNoX>	 or completing, yeah
[11:43:43] <_joe_>	 I would think 4 times before ditching that for something developed externally probably not with large web operations in mind
[11:44:06] <_joe_>	 unless we are ready to sink some dev time on the upstream thing too
[11:44:43] <XioNoX>	 yeah of course, we're thinking about all of that right now, nautobot vs. netbox, and future of network automation
[11:45:15] <XioNoX>	 one issue is that the upstream thing could very well become open-core kind of thing
[11:45:35] <topranks>	 Nautobot certainly wants to be that all-encompasing tool, which will configure your devices etc. (i.e. replace homer for intsance.)
[11:45:43] <XioNoX>	 it's open source, but developed mostly by one company
[11:46:11] <volans>	 with no certainty that they will not go into an open core model at some point
[11:46:37] <_joe_>	 again, I'm all for not succumbing to the NIH syndrome
[11:46:52] <topranks>	 "NIH syndrome" ?
[11:46:54] <_joe_>	 but also a product of a single vendor without a dev community around it
[11:46:59] <_joe_>	 Not Invented Here
[11:47:05] <topranks>	 gotcha
[11:47:12] <topranks>	 yeah agreed.
[11:47:19] <_joe_>	 that's not much better tbh :)
[11:48:00] <_joe_>	 then I never used homer, maybe it's bad; but in some cases a bad tool tailored to your needs is better than an above-average one intended for general use
[11:48:28] <_joe_>	 I mean given the people involved in writing homer I'd expect it to be good but annoying
[11:48:46] <topranks>	 Homer's fine I think, and volans may correct me but I don't think in need of a lot of ongoing maintenance.
[11:48:57] <XioNoX>	 _joe_: please type the square root of the number of targetted devices
[11:49:07] <_joe_>	 does it require you to make some octet calculations before committing a change? or maybe integrate a simple function
[11:49:13] <_joe_>	 XioNoX: ahahahah yeah exactly
[11:49:17] <_joe_>	 volans: <4
[11:49:19] <topranks>	 But say if we moved from Juniper there would probably be a lot of work to change most likely.
[11:49:23] <_joe_>	 err off by one
[11:49:27] <volans>	 lol
[11:49:38] <_joe_>	 topranks: ack definitely
[11:49:41] * topranks finds himself unable to pass the increasingly difficult captchas.
[11:49:49] <XioNoX>	 there are a lot of improvments that we could/should do with Homer
[11:53:16] <topranks>	 what kind of things you thinking of?
[11:58:05] <XioNoX>	 topranks: I'll try to finish my first draft today, but outages delayed me
[11:58:19] <XioNoX>	 but it's based on that list of pain-points I shared some time ago
[11:58:27] <XioNoX>	 (and wishlist)
[11:59:03] <topranks>	 I must dig it up yep... probably make more sense to me now I know my way around homer.
[12:35:58] <majavah>	 godog: monitoring now has 15 instances reachable, still 7 that are not (thanos, puppetdb, log, conf, kafka, cumin, icinga)
[12:48:37] <godog>	 majavah: ack thanks, I'll take a look at those too
[13:33:14] <XioNoX>	 fyo, we can soon repool codfw
[13:33:17] <XioNoX>	 fyi*
[13:34:30] <volans>	 \o/
[13:35:14] <XioNoX>	 going to monitor codfw row B for a bit first
[13:35:21] <XioNoX>	 lvs2007 is back in service
[14:13:09] <godog>	 majavah: ok I think I fixed all of the above
[14:13:58] <majavah>	 godog: pontoon-conf-01 is still broken, otherwise everything in monitoring works
[14:21:31] <godog>	 majavah: ack, fixing
[14:26:36] <godog>	 majavah: ok pontoon-conf-01 fixed too, thank you for the heads up and the patience
[14:27:23] <majavah>	 indeed: 100.0% (22/22) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
[14:27:35] <majavah>	 (that's for the monitoring project)
[14:29:29] <godog>	 majavah: sweet, what's the cumin command you used there? I'd like to check 'swift' project too
[14:29:46] <godog>	 so we don't have to do the ping/pong here that is
[14:29:46] <majavah>	 taavi@cloud-cumin-03:~$ sudo cumin "O{project:monitoring}" "true"
[14:30:32] <godog>	 thanks
[15:41:41] <majavah>	 jayme: elukey: should I be worried that I'm seeing |/etc/ca-certificates/update.d/jks-keystore: 56: [: amd64: unexpected operator" during wmf-certificates upgrade on deployment-deploy01
[15:42:46] * jayme is glad that this is from elukey's hook
[15:43:20] <elukey>	 do I have a hook?? :D
[15:44:04] <jayme>	 oh, you don't .. .I thought jks-keystore was a product of your's...but it's not
[15:44:11] <jayme>	 sorry to have summoned you :)
[15:44:20] <elukey>	 always blaming me :D
[15:45:06] <jayme>	 majavah: I'll try to reproduce, give me a minute
[15:46:28] <elukey>	 this is very weird, the error happens in a ca-certificates' hook?
[15:46:34] <jayme>	 majavah: I don't see that on an-test-client1001 for example
[15:46:43] <elukey>	 better - ca-certificates-java
[15:47:20] <majavah>	 jayme: https://phabricator.wikimedia.org/P17796
[15:47:44] <jayme>	 thanks
[15:47:47] <jayme>	 weird
[15:48:05] <majavah>	 I didn't see it on deployment-mediawiki11, so maybe a stretch-specific issue
[15:48:40] <elukey>	 so on an-test-client we have ii  ca-certificates-java 20190405, on deployment-deploy01 ii  ca-certificates-java         20170929~deb9u1
[15:48:52] <jayme>	 yeah, was about to ask what's in line 56 - as for me it's a "dpkg -L" query (on buster)
[15:48:55] <elukey>	 so the hook is different
[15:49:50] <elukey>	 on deploy01 it does  if [ "$arch" == "armhf" ]; then
[15:52:47] <elukey>	 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922720
[15:53:10] <elukey>	 majavah: I see from cache policy that a new version of ca-certificates-java is available on deploy01, shall we try it?
[15:53:19] <majavah>	 sure!
[15:53:55] <elukey>	 I'll let you upgrade, lemme know if it improves the situation
[15:54:56] <majavah>	 looks better, I don't see any warnings like last time
[15:55:13] <jbond>	 did something change with github today.  i got a bunch of random notifications about old changes, about 8 randome from the last 2-3 months.  not a big deal but wierd
[15:56:21] <jayme>	 majavah: great :) Do you see the puppet_ca issue fixed as well?
[15:56:45] <elukey>	 /etc/ssl/certs/Puppet_Internal_CA.pem -> /usr/local/share/ca-certificates/Puppet_Internal_CA.crt
[15:56:50] <elukey>	 \o/
[15:57:24] <jayme>	 let's hope that was the last weird thing this introduced :-|
[15:57:33] <majavah>	 root@deployment-deploy01:~# openssl x509 -in /etc/ssl/certs/Puppet_Internal_CA.pem -text -noout | grep CN
[15:57:33] <majavah>	         Issuer: CN = Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs
[15:57:45] <elukey>	 goood
[15:57:59] <jayme>	 great, thanks for checking!
[16:08:49] <Krinkle>	 topranks: the dolt blog post on nautobot makes it sound like fixing support for MySQL was a relatively easy first step for them. Perhaps easy enough to upstream to netbox (for them/upstream to do, or for us to do). Then migrating within netbox might be an option. Less drastic of a change.
[16:12:55] <topranks>	 Yeah that could be an option.  Django supports both so in theory I guess it’s not impossible. 
[16:13:33] <majavah>	 elukey: jayme: deployment-prep seems to be working again so I'll close the tasks, thanks!
[16:13:34] <topranks>	 Looking at the netbox git it’s not been asked, so they may not want to dedicate time to it.  But might be happy to include if we worked on it
[16:13:43] <elukey>	 majavah: thanks!
[16:18:09] <jayme>	 majavah: thanks!
[16:18:59] <godog>	 majavah: fixed new cloud-cumin host access for 'swift' project too
[16:19:12] <majavah>	 great, thanks
[16:20:43] <jayme>	 sorry for causing trouble :/
[18:33:12] * jbond only just noticed RIPE83 started today 
[18:48:19] * topranks had also missed that point and thanks j.bond for the heads up :)