[06:50:40] <moritzm>	 I'll reimage bast1003 in a few minutes
[09:56:59] <jelto>	 GitLab needs a short maintenance at 11:00 UTC
[10:09:24] <kamila_>	 whoever's making the ospf-related change in homer in eqiad, feel free to commit my leftover wikikube-ctrl1001 change whenever, sorry for the mess 
[10:10:15] <kamila_>	 (or, alternatively, may I commit the ospf-related change? :D)
[10:18:20] <topranks>	 kamila_: em let me have a look 
[10:18:39] <topranks>	 apologies, you folk should not be hitting ospf related changes not sure what's happening there 
[10:35:10] <kamila_>	 Thanks topranks 
[10:39:37] <topranks>	 kamila_: I've pushed your changes now, and BGP to wikikube-ctrl1001 is working :) 
[10:40:04] <topranks>	 the issue was a new circuit we are in the process of delivering, it had been set to "active" in Netbox rather than "planned", which made the router try to set up OSPF for it 
[10:40:08] <topranks>	 set back to planned for now 
[10:44:38] <kamila_>	 Thanks! 
[11:05:45] <jelto>	 GitLab maintenance finished
[11:36:09] <pmiazga>	 Hello everyone. I have a request to review and merge a very simple puppet change that affects only beta cluster: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039822 - I already tested it by cherrypicking it on puppetserver1 and running puppet-agent on apache host.
[11:36:52] <claime>	 pmiazga: looking
[11:38:06] <claime>	 I'll wait for PCC to finish and merge afterwards
[11:42:30] <pmiazga>	 thank you
[11:59:54] <jayme>	 dcausse: I guess it's fine to just deploy things that use flink-app? (re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1039727) Or do we need to take any actions before?
[12:01:38] <jayme>	 I'm aiming to deploy the change as is (basically a noop) and then enable rate limiting to mw-api-int in a second step (for cirrus-streaming-updater)
[12:03:24] <claime>	 pmiazga: PCC is taking this long because there's no Host: line in the commit messge, so it's testing for one of every type of hosts in prod
[12:03:32] <claime>	 Hosts*
[12:04:02] <claime>	 That means it's got almost 600 nodes to test
[12:05:11] <claime>	 I think I'll just merge your change, it's not going to affect prod anyways
[12:21:30] <dcausse>	 jayme: yes, should be fine, I can do quick deploy on staging after
[12:22:22] <dcausse>	 jayme: for the cirrus-streaming-updater we might some changes to the code IIRC (will check with Peter)
[12:22:26] <jayme>	 dcausse: Thanks. I'd like to do at least one (cirrus maybe) to verify the diff if you don't mind
[12:22:34] <dcausse>	 jayme: sure
[12:22:52] <jayme>	 dcausse: changes to the code for the ratelimit stuff you mean?
[12:22:59] <jayme>	 or just for a chart update?
[12:23:30] <dcausse>	 jayme: yes if rate limiting is enabled we might not yet handle that properly in the code
[12:23:55] <jayme>	 dcausse: ah, it's not. The current change (bump in chart version) is just dependency updates really
[12:24:08] <dcausse>	 ok then no worrie
[12:24:11] <dcausse>	 s
[12:24:29] <jayme>	 dcausse: but there is already a flurry of changes I've not introduces (config changes to flink AIUI)
[12:24:50] <jayme>	 resource changes, new image ... :)
[12:25:07] <dcausse>	 oh ok :)
[12:26:10] <jayme>	 but the not flink related things (securityContext, envoy config) look fine to me
[12:27:12] <jayme>	 dcausse: I've also checked the diff for rdf in staging. That is what I'd expect from the chart update
[12:27:51] <jayme>	 I did not apply the change, so you can take a look and deploy all the flink-app things if their changes do look like that
[12:28:22] <dcausse>	 will check
[12:28:35] <claime>	 pmiazga: merged, btw, sorry I forgot to ping you
[12:28:42] <jayme>	 dcausse: thanks! <3
[12:32:09] <dcausse>	 jayme: the new "filter_chains" thing is related to ratelimiting?
[12:33:22] <jayme>	 dcausse: no, they are duplications of existing listeners but on IPv6 interface
[12:33:41] <dcausse>	 oh ok
[12:33:57] <jayme>	 that's a generic improvement we made in the mesh config. The ratelimiting is not enabled for the deployments yet, so those changes do not appear
[12:35:04] <dcausse>	 jayme: reading the diff it seems fine to me
[12:35:50] <jayme>	 nice. Are you going to do all the deploys or should I do it? Can obviously also wait until monday if you feel it's risky
[12:37:14] <dcausse>	 jayme: should not be risky, I can take care of the deploys if you want
[12:37:28] <jayme>	 dcausse: that would be super helpful, thanks!
[12:37:34] <dcausse>	 ok deploying
[12:37:49] <jayme>	 ping me if something goes south :)
[12:38:15] <jayme>	 regarding the ratelimit rest I left a comment in phab, we can wait for Peter on that one I suppose. Is he on vacation?
[12:39:17] <dcausse>	 no he's around but working on the wdqs graph split at the moment
[12:39:39] <jayme>	 okay. Then I'll guess he'll reach out when he has capacity
[12:39:55] <dcausse>	 I've seen patches related to ratelimiting that I need to take a look
[12:39:57] <dcausse>	 sure
[12:40:55] <jayme>	 cool,cool. I've pinged you in the task as well. I think the code changes where attached there roo
[12:40:58] <jayme>	 *too
[12:42:02] <dcausse>	 thanks!
[12:53:03] <pmiazga>	 thank you claime for review and merge. I didn't know about the `Host:` part, next time I'll try to fill it in when doing puppet changes.
[12:54:17] <claime>	 pmiazga: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Gerrit_integration for doc :)
[13:08:15] <dcausse>	 redeployed all search team flink jobs using the new flink-app chart (except cirrus-streaming-staging@staging which needs a small tweak), pinged the owners of the mw-page-content-change-enrich job to do it
[13:09:58] <ottomata>	 TY, asked in CR and slack, but this is probably the better place.  jayme what is the urgency on deploy?  need to happen asap or can we just wait until the next time we deploy?
[14:22:42] <elukey>	 Emperor: o/ I noticed some alerts related to thanos-be (https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=thanos), not sure if already known/wip, dropping a line for awareness
[14:23:24] <Emperor>	 elukey: yeah, see T351927
[14:23:24] <stashbot>	 T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927
[14:25:09] <elukey>	 okok!
[14:31:07] <jayme>	 ottomata: asap would be preferred. As said it's basically a noop application wise, but the changes are a blocker for the next k8s upgrade (migrating away from Pod Security Policies)
[14:31:37] <jayme>	 dcausse: thank you!
[14:33:14] <Emperor>	 Is there a way to find out the rack of another host in puppet? I can see profile::netbox::host::location for the current host, but if I want to know the location of a different host? It doesn't seem to be in facts...
[14:56:49] <elukey>	 What is the use case?
[14:57:16] <elukey>	 because so far IIRC the only use case in puppet was to know the host's metadata etc..
[15:06:09] <jhathaway>	 Emperor: I think profile::netbox::data would provide the data you need
[15:06:42] <jhathaway>	 you can view its contents on puppetserver1001:/srv/git/netbox-hiera/common.yaml
[15:07:31] <Emperor>	 elukey: ceph can be rack-aware when distributing replicas, so I want my host spec file I feed to cephadm to know the hosts' locations
[15:08:21] <sukhe>	 https://wikitech.wikimedia.org/wiki/Netbox#Puppet
[15:08:53] <jhathaway>	 sukhe: oooh, thanks I didn't know those docs existed
[15:09:18] <sukhe>	 jhathaway: basically what you said but examples
[15:09:23] <sukhe>	 with examples
[15:09:41] <Emperor>	 thanks, yes, that's helpful.
[15:10:12] <sukhe>	 Emperor: depending on the requirement, you can add a function for this as well so that you can just pass a host and it returns the rack
[15:12:34] <Emperor>	 I'll need to turn hostname into management hostname by the looks of things
[15:12:55] <elukey>	 TIL about netbox/puppet thanks!
[15:14:58] <jhathaway>	 Emperor: that is a bit awkward
[15:20:08] <jhathaway>	 Emperor: here is the script that generates the data, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/puppet/sync-netbox-hiera.py
[15:20:28] <jhathaway>	 I assume it would be fairly easy to add hosts in addition to mgmt hosts
[15:20:42] <jhathaway>	 feel free to file a request for us to add
[15:21:13] <sukhe>	 I know it's not nice but would it be incorrect to assume that the mgmt rack and the hots rack are the same? the mgmt rack info is already there
[15:22:05] <Emperor>	 it _should_ just be a case of splitting the hostname on \., inserting mgmt into the obvious place and rejoining...
[15:22:09] <jhathaway>	 I think that assumption is fine in practice
[15:22:21] <XioNoX>	 let's do it the clean way directly :)
[15:22:33] <jhathaway>	 but it seems nice to duplicate the data for hosts, if it is *just* adding another graphql query
[15:23:17] <XioNoX>	 jhathaway: is the rack a facter fact too?
[15:24:51] <jhathaway>	 I don't think it is, https://puppetboard.wikimedia.org/node/mw1365.eqiad.wmnet
[15:25:28] <jhathaway>	 you could read the rack location from hiera, write to disk on node, read as a fact
[15:26:03] <Emperor>	 XioNoX: no, I tried looking there first.
[15:26:12] <XioNoX>	 seems like a long way around :)
[15:26:45] <XioNoX>	 Emperor: also keep in mind how "real time" the data needs to be updated
[15:27:06] <jhathaway>	 I bit awkward for sure, but having the information as a structured fact also seems nice
[15:27:10] <XioNoX>	 like this requires the sync-hiera cookbook to be run (part of the dns one and re-image, etc)
[15:28:19] <Emperor>	 If we were moving a node between racks, it'd be a total pain anyway, this is really so when nodes are added to the ceph cluster they are put into the right place in the CRUSH hierarchy (and so ceph knows about their location for placing replicas)
[15:29:00] <XioNoX>	 cool
[15:30:44] <arturo>	 Emperor: this rings a bell, we also have crush maps like that. I think it was created by dcaro, so maybe sync with him next week
[15:45:51] <ottomata>	 jayme: would it be okay if we tried to do it early next week?  I'd like to get some newer folks on DE experience with deploying k8s, (and deploying flink in k8s too)
[15:46:24] <ottomata>	 jayme: ah you a answered on CR. great thank you.
[15:47:40] <jayme>	 yeah, that works. Thanks
[19:28:19] <mutante>	 if we need new LVS service IPs (like in eqiad 10.2.2.X / something.svc.eqiad.wmnet). How do get one?  just make an oldschool DNS change in the repo and take one that is free?  then sync netbox cookbook?  ask via ticket to get one assigned?
[19:28:40] <mutante>	 edit netbox first?
[19:31:36] <sukhe>	 mutante: so basically, you can assign the IPs on Netbox and put the DNS names and then run the netbox DNS cookbook
[19:32:00] <sukhe>	 then you will need to add the same IPs to the DNS repo (and the PTRs) 
[19:32:25] <sukhe>	 once the service definition then exists in hierdata/service/common.yaml, then the last step will be to add the the DNS discovery records
[19:32:35] <cdanis>	 related, https://phabricator.wikimedia.org/T270071
[19:33:08] <sukhe>	 yes, thanks, this is the reason why it needs to exist in both
[19:33:22] <sukhe>	 mutante: also definitely feel free to assign to Traffic and I am happy to take care of this
[19:33:26] <cdanis>	 and btw mutante documented at https://wikitech.wikimedia.org/wiki/DNS/Netbox
[19:33:28] <mutante>	 thanks for linking that ticket! glad I asked.. this is kind of why :)
[19:33:44] <sukhe>	 https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox
[19:33:45] <mutante>	 alright, thanks both
[19:33:52] <cdanis>	 I actually didn't remember it was on the wiki page
[19:34:01] <cdanis>	 but I searched wikitech for [netbox ipam svc] haha
[19:34:43] <sukhe>	 I think the most amazing thing is we have documentation at all :)
[19:35:09] <mutante>	 :)
[19:38:48] <sukhe>	 side note is that at least the initial documentation is not very clear about this. I promised Empero.r I will update it and I haven't but it's on my list and I will on Monday. so feedback welcome if something is missing.
[19:39:30] <mutante>	 will do:)