[05:54:57] <_joe_>	 legoktm: the diffs only run for the charts IIRC
[07:01:18] <wikibugs>	 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy) >>! In T271736#7335188, @tstarling wrote: > Reading https://github.com/ruflin/Elastica/issues/1913 , it looks like the way out of that infinite regression is to just use --ignore-platform-req=php, o...
[08:33:35] <legoktm>	 SyntaxHighlight (pygments) is using Shellbox on group0 wikis now: https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main&refresh=30s
[08:34:26] <legoktm>	 the load is going to be a bit higher than normal while all the new cache keys get populated, but so far looks fine
[08:35:03] <elukey>	 nice!
[08:36:45] <_joe_>	 legoktm: great
[09:55:45] <elukey>	 jelto: o/ around for an helm3 question?
[09:57:09] <elukey>	 I am trying to deploy for the first time revscoring-editquality to ml-serve-eqiad with helm3, but I get 
[09:57:12] <elukey>	 ser "revscoring-editquality" cannot list resource "secrets" in API group "" in the namespace "revscoring-editquality"
[09:57:32] <elukey>	 I have set the helm3 flag as indicated previously of course
[09:58:15] <elukey>	 I was checking clusterrole and related bindings but I a little confused about the -deploy user
[09:58:47] <jelto>	 elukey: I can take a look. From which host are you trying to deploy?
[09:59:08] <elukey>	 when I follow https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service do I need to add the new user with the -deploy?
[09:59:23] <elukey>	 this may be what I did wrong
[09:59:41] <elukey>	 deploy1002, ml-services dir
[09:59:56] <elukey>	 (I already synced admin-ng previously for the new namespace etc..)
[10:00:26] <elukey>	 but while writing I realized that the kubeconfig points to 'revscoring-editquality' user, not the -deploy one
[10:00:51] <elukey>	 since previously tiller was in charge, so maybe the wikitech page is not working with helm3
[10:00:56] <elukey>	 or something else :)
[10:03:53] <jelto>	 yes I think you are using the "wrong" kubeconfig file. It references /etc/kubernetes/revscoring-editquality-ml-serve-eqiad.config but this is the user "revscoring-editquality". This user only has view permissions. So you need a additional -deploy user in ml-serve-eqiad and change the --kubeconfig to that lubeconfig file
[10:05:05] <elukey>	 jelto: ack I suspected something like that - so in theory I just need to add another -deploy user + tokens to all configs, and update the helmfile config?
[10:09:27] <jelto>	 elukey: Yes that should help. I'm still not 100% sure where the ml-serve-eqiad users and kubeconfig files come from. I would assume somewhere from private puppet. Do you know where to add the user? Otherwise I can look around in the puppet code and private puppet repo
[10:10:17] <elukey>	 jelto: yes yes I am going to add the new user and report back if it works :) thanks!
[10:11:41] <jelto>	 Ah I guess the deploy user should be added in hieradata/role/common/ml_k8s/master.yaml ;)
[10:12:33] <elukey>	 yesyes otherwise the kube-api will get mad :D
[10:12:52] <elukey>	 if it works I can add some details to the wikitech page
[10:15:26] <elukey>	 jelto: do we need both users to be created though? (in this case, revscoring-editquality and revscoring-editquality-deploy)
[10:16:04] <elukey>	 (probably yes since before we had tiller + regular user, but I want to be sure)
[10:16:31] <jelto>	 elukey: yes the idea is that access with kubectl is "read-only". So if you are running kube_env it uses the non-deploy user and only for helm3 deployments the -deploy user is used
[10:17:24] <elukey>	 jelto: perfect it makes sense
[10:29:48] <_joe_>	 we might need to work a bit on permissions for the different kubeconfigs :)
[10:31:08] <elukey>	 I am checking the diff for deploy1002: https://puppet-compiler.wmflabs.org/compiler1001/31232/deploy1002.eqiad.wmnet/index.html
[10:31:43] <elukey>	 the kube_env service list gets the -deploy additional entry as well, is it what we want?
[10:32:05] <elukey>	 _joe_: +1 yes I was thinking the same
[10:32:23] <_joe_>	 elukey: no we do not
[10:33:04] <elukey>	 one quick workaround could be to filter -deploy users for that list 
[10:33:42] <elukey>	 we currently do __kube_env_services="<%= @all_service_names.uniq.join(" ") %>"
[10:34:31] <elukey>	 the name is a little misleading now though
[10:35:27] <_joe_>	 elukey: let's try to do it correctly
[10:36:20] <elukey>	 yes I was reasoning out loud, we deep merge services and tokens, so the -deploy user gets added in afaics
[10:39:42] <elukey>	 I can report back this into the task, and then we can proceed from there
[10:40:28] <_joe_>	 let me try to make a patch
[10:43:26] <_joe_>	 elukey: uhm wait, why are you adding the "user" to the list of services?
[10:43:38] <_joe_>	 ok I see, that's quite bad tbh
[10:43:48] <_joe_>	 yeah we need a refactoring, onto it
[10:43:59] <elukey>	 okok thanks!
[10:44:10] <_joe_>	 elukey: you can merge your patch anyways for now, but it will leave shit behind
[10:46:25] <elukey>	 ahahahah nono I want to do things right, not hurry
[10:47:21] <elukey>	 *no rush
[10:47:58] <elukey>	 I think that I am the first using helm3 for services so it makes sense that some issues come up :)
[10:52:26] <_joe_>	 the problem is basically how we do the deep merge
[10:52:57] <_joe_>	 my idea is to change the data structure for services to accept a list of usernames, and then merge in the tokens for those users
[10:53:08] <_joe_>	 in a hash that we can use everywhere
[10:53:29] <_joe_>	 so we don't need an additional "fake" service, just the additional token
[10:55:22] <_joe_>	 it's puppet, so it's a pain ofc
[13:17:53] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) To keep archives happy between Phabricator/IRC - I tried to deploy the new ml `revscoring-editquality` service and got:  ` "revscoring-editquality" cannot list resource "secrets"...
[13:18:12] <elukey>	 updated the helm3 task with what we discussed earlier on --^
[13:31:09] <elukey>	 folks, if nobody opposes I'd go forward with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/720997, that adds the possibility to add labels to namespace declarations + the NamespaceDefaultLabelName feature gate (worked with Janis on this earlier on)
[13:31:31] <elukey>	 it is a no-op for now, there is a follow up patch for kfserving that will use it
[14:03:36] <ottomata>	 FYI I'm going to shepard Petr's eventgate chart changes through...fingers crossed!
[14:05:02] <elukey>	 ottomata: o/ can you give me 5 mins to sanity check one thing first?
[14:05:12] <ottomata>	 elukey:  sure, i've already started merging but haven't done any deployments
[14:05:19] <elukey>	 ack thanks :)
[14:05:33] <ottomata>	 i'll finish merging but will wait
[14:09:56] <elukey>	 ottomata: green light!
[14:13:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Add label kubernetes.io/metadata.name to all namespaces - https://phabricator.wikimedia.org/T290476 (10elukey) @JMeybohm the code is deployed, feel free to take over and turn on NamespaceDefaultLabelName :)
[14:14:17] <ottomata>	 k ty!
[14:15:25] <ottomata>	 hmm, interesting, because the resource names are changing, helm can't deploy because of port conflicts
[14:16:18] <ottomata>	 _joe_:  advice?  do I have to delete the existing deployment in order to deploy a newly named e.g. Service resources that use the same port?
[14:16:54] <ottomata>	 e.g. in eventgate-logging-external, the Service is currently named eventgate-logging-external-production-tls-service and uses port 4392
[14:17:11] <ottomata>	 but after petrs change, this Service has a new name "eventgate-production-tls-service" and uses port 4392
[14:17:12] <ottomata>	 so
[14:17:19] <ottomata>	 Invalid value: 4392: provided port is already allocated
[14:48:33] <ottomata>	 Pchelolo: FYI ^ looks maybe harder to just use the common templates that we thought
[14:51:07] <_joe_>	 ah damn
[14:51:14] <Pchelolo>	 oh...
[14:51:27] <_joe_>	 so yes, the way to do it would be to 1) depool from traffic eventgate in one dc
[14:51:33] <_joe_>	 2) delete the old service
[14:51:35] <_joe_>	 3) deploy
[14:51:46] <_joe_>	 or, you know
[14:51:52] <_joe_>	 kubectl edit the service itself
[14:52:11] <_joe_>	 sorry I'm between meetings so I don't have much time, maybe akosiaris ?
[14:52:51] <akosiaris>	 I am in between 2 different meetings too
[14:57:09] <elukey>	 for kubectl edit etc.., is there the risk of getting duplicated replica sets?
[14:57:22] <elukey>	 (it happened to me in the past this is why I am asking, curious)
[14:58:13] <ottomata>	 editing the service sounds risky, lots has changed
[14:58:29] <ottomata>	 right ok, so DC failover to handle the traffic and delete and redeploy
[14:58:32] <ottomata>	 ok
[14:59:12] <ottomata>	 ok i'm going to wait until we have some discussion about https://phabricator.wikimedia.org/T282148#7373078 and label names
[14:59:23] <ottomata>	 IMO the names are confusing, but it might just be a lack of docs for them
[14:59:47] <ottomata>	 (i think the eventgate names and labels are better :p (but would rather conform to common_templates))
[16:37:11] <wikibugs>	 10serviceops, 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for jobs - https://phabricator.wikimedia.org/T291620 (10Michael) Adding #platform_engineering #serviceops and #analytics as this is related to all three teams. I'm aware that...
[18:52:07] <wikibugs>	 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) We tried to deploy this today, but ran into an issue:  Since the k8s resources have been renamed, k8s thinks t...
[18:53:42] <wikibugs>	 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) > To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repo...
[18:55:21] <wikibugs>	 10serviceops, 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for jobs - https://phabricator.wikimedia.org/T291620 (10Ottomata) Data Eng (analytics) is in the process of [[ https://phabricator.wikimedia.org/T282033 | solving on a simila...
[20:05:23] <wikibugs>	 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Oof right.  I've already merged the eventgate chart change, and I think to rollback we'd have to revert and th...
[20:06:41] <wikibugs>	 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) why rollback? we just make the same changes to eventstreams before going through the deployment
[20:17:00] <wikibugs>	 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) I'm worried that in the meantime someone will need to make an emergency fix/change to eventgate and won't be a...
[20:21:49] <wikibugs>	 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) oh, yeah. ok. up to you.
[22:42:44] <bd808>	 legoktm: I think it's alive! The new egress rules are passing traffic to meta at least which feels like a huge victory.
[22:44:19] <legoktm>	 awesome!!
[22:44:59] <legoktm>	 so what's next? :)
[22:45:47] <bd808>	 I was hoping to fully auth, but it looks like I need to be much smarter about how I proxy into staging cluster to make that work. The leg of exchanging the oauth2 authorization code for a token is failing. I think that is because the hostname is stuck in it on the server side, but I'm not 100% convinced of that yet.
[22:46:50] <bd808>	 so I need to either figure out how to make the staging instance see itself as "toolhub.wikimedia.org" or I need to get a new grant for a name I can make it see itself as (like staging.svc.eqiad.wmnet:4011) if I'm going to test that in staging
[22:47:25] <bd808>	 but maybe the better next step is actually to setup the real lvs service and do those bits from the prod cluster?
[22:49:10] <legoktm>	 bd808: we could create a custom TLS cert that includes toolhub.wm.o and use it in the staging cluster, but just doing it in prod will be easier
[22:49:45] <bd808>	 *nod* it's not even really about certs, its about not needing a port number I think
[22:50:03] <legoktm>	 Ohhh. Not sure how to do that 
[22:50:09] <bd808>	 so yeah, maybe trying to promote up is the more constructive next step
[22:51:28] <bd808>	 I should probably double check the oauth secrets though. I'm still not sure that the 401 from meta is related to the hostname I'm under...
[22:51:44] * bd808 looks at things in /etc on deploy1002
[22:53:26] <bd808>	 WIKIMEDIA_OAUTH2_SECRET matches what is stored in my password vault, so that's likely correct
[22:54:13] <bd808>	 legoktm: ok, so what order of operations and assistance do I need to get the prod cluster ingress setup?
[22:55:06] <legoktm>	 Something needs to be deployed to the cluster that'll pass health checks
[22:55:24] <legoktm>	 And then I or another SRE can deploy the LVS
[22:57:00] <bd808>	 ok. I can run `helmfile -e eqiad -i apply` and keep my fingers crossed for the first bit.
[22:58:37] <mutante>	 if it doesnt work it should auto-revert after 5 or 10 minutes of working
[22:58:44] <mutante>	 eh..waiting
[23:04:19] <bd808>	 looks like something isn't quite right, possibly the mariadb user grant. /me looks for the ticket on that to check data
[23:05:57] <mutante>	 bd808: in case it's needed, this is brandnew and seemed pretty useful: https://upload.wikimedia.org/wikipedia/labs/0/07/Kubernetes_Troubleshooting_WMF.png
[23:07:34] <bd808>	 thanks for the link mutante. that does look nice. I'm actually pretty used to kubectl things from like in Toolforge, but maybe I'll learn a new thing in the flowchart
[23:11:36] <bd808>	 I need help from someone to check the grants for the'toolhub' user on the m5-master db server. The toolhub-main container in the pod is crashing with `(1045, "Access denied for user 'toolhub'@'10.64.66.115' (using password: YES)")`. The same password is in the helmfile secrets for both staging (which worked) and eqiad (which is failing) so pretty sure this is a grant problem.
[23:12:17] <legoktm>	 I'll look in a minute
[23:15:45] <wikibugs>	 10serviceops, 10DBA, 10Toolhub, 10database-backups: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7354348, @Marostegui wrote: > Thanks for the update, if you need something from us let me know!  Everything worked as expected in the Kubernetes "st...
[23:17:18] <bd808>	 legoktm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709877/3/modules/role/templates/mariadb/grants/production-m5.sql.erb -- it's the grants. It looks like it at least needs a TO 'toolhub'@'10.64.66.%' variant.
[23:18:56] <mutante>	 it needs a change in that file but then also a ping to DBA to deploy it
[23:19:03] <mutante>	 they need to apply it 
[23:19:31] <bd808>	 *nod*
[23:21:00] <legoktm>	 $ host 10.64.66.115
[23:21:00] <legoktm>	 115.66.64.10.in-addr.arpa domain name pointer kubernetes-pod-10-64-66-115.eqiad.wmnet.
[23:23:05] <legoktm>	 the eqiad range is  10.64.64.0/21  which means 10.64.64.% through 10.64.71.% I believe
[23:23:48] <mutante>	 128 ; Kubernetes pod records for eqiad
[23:23:48] <mutante>	 129 {% for z in range(64,72) -%}
[23:23:48] <mutante>	 130 {% for i in range(256) -%}
[23:23:48] <mutante>	 131 kubernetes-pod-10-64-{{ z }}-{{ i }} 1H IN A 10.64.{{ z }}.{{ i }}
[23:24:25] <mutante>	 hmm.. I wonder if this is the first case where mysql prod grants from this are needed
[23:24:57] <bd808>	 mutante: I think yes. I'm pretty sure the mediawiki grants are 10.%
[23:25:46] <wikibugs>	 10serviceops, 10DBA, 10Toolhub, 10database-backups: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Legoktm) Per https://netbox.wikimedia.org/search/?q=kubernetes+pod&obj_type= the eqiad pod range is 10.64.64.0/21, which would require grants from 10.64.64.% through 10.6...
[23:26:03] <mutante>	 bd808: add the GRANT request to https://phabricator.wikimedia.org/T271480  ?
[23:26:15] <mutante>	 since he has it already anyways and would have to apply the grants
[23:26:29] <mutante>	 I was too slow for lego :)
[23:47:00] <wikibugs>	 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) fwiw: While looking at this I found we have the email alias maxmind@wikimedia and it forwards to fr-tech@w...
[23:48:34] <bd808>	 I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/723329 for anyone following along. I'll give a shout in the dba channel about it. Thanks for the help yet again legoktm and mutante 
[23:48:50] <legoktm>	 :)