[05:54:57] <_joe_> legoktm: the diffs only run for the charts IIRC [07:01:18] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy) >>! In T271736#7335188, @tstarling wrote: > Reading https://github.com/ruflin/Elastica/issues/1913 , it looks like the way out of that infinite regression is to just use --ignore-platform-req=php, o... [08:33:35] SyntaxHighlight (pygments) is using Shellbox on group0 wikis now: https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main&refresh=30s [08:34:26] the load is going to be a bit higher than normal while all the new cache keys get populated, but so far looks fine [08:35:03] nice! [08:36:45] <_joe_> legoktm: great [09:55:45] jelto: o/ around for an helm3 question? [09:57:09] I am trying to deploy for the first time revscoring-editquality to ml-serve-eqiad with helm3, but I get [09:57:12] ser "revscoring-editquality" cannot list resource "secrets" in API group "" in the namespace "revscoring-editquality" [09:57:32] I have set the helm3 flag as indicated previously of course [09:58:15] I was checking clusterrole and related bindings but I a little confused about the -deploy user [09:58:47] elukey: I can take a look. From which host are you trying to deploy? [09:59:08] when I follow https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service do I need to add the new user with the -deploy? [09:59:23] this may be what I did wrong [09:59:41] deploy1002, ml-services dir [09:59:56] (I already synced admin-ng previously for the new namespace etc..) [10:00:26] but while writing I realized that the kubeconfig points to 'revscoring-editquality' user, not the -deploy one [10:00:51] since previously tiller was in charge, so maybe the wikitech page is not working with helm3 [10:00:56] or something else :) [10:03:53] yes I think you are using the "wrong" kubeconfig file. It references /etc/kubernetes/revscoring-editquality-ml-serve-eqiad.config but this is the user "revscoring-editquality". This user only has view permissions. So you need a additional -deploy user in ml-serve-eqiad and change the --kubeconfig to that lubeconfig file [10:05:05] jelto: ack I suspected something like that - so in theory I just need to add another -deploy user + tokens to all configs, and update the helmfile config? [10:09:27] elukey: Yes that should help. I'm still not 100% sure where the ml-serve-eqiad users and kubeconfig files come from. I would assume somewhere from private puppet. Do you know where to add the user? Otherwise I can look around in the puppet code and private puppet repo [10:10:17] jelto: yes yes I am going to add the new user and report back if it works :) thanks! [10:11:41] Ah I guess the deploy user should be added in hieradata/role/common/ml_k8s/master.yaml ;) [10:12:33] yesyes otherwise the kube-api will get mad :D [10:12:52] if it works I can add some details to the wikitech page [10:15:26] jelto: do we need both users to be created though? (in this case, revscoring-editquality and revscoring-editquality-deploy) [10:16:04] (probably yes since before we had tiller + regular user, but I want to be sure) [10:16:31] elukey: yes the idea is that access with kubectl is "read-only". So if you are running kube_env it uses the non-deploy user and only for helm3 deployments the -deploy user is used [10:17:24] jelto: perfect it makes sense [10:29:48] <_joe_> we might need to work a bit on permissions for the different kubeconfigs :) [10:31:08] I am checking the diff for deploy1002: https://puppet-compiler.wmflabs.org/compiler1001/31232/deploy1002.eqiad.wmnet/index.html [10:31:43] the kube_env service list gets the -deploy additional entry as well, is it what we want? [10:32:05] _joe_: +1 yes I was thinking the same [10:32:23] <_joe_> elukey: no we do not [10:33:04] one quick workaround could be to filter -deploy users for that list [10:33:42] we currently do __kube_env_services="<%= @all_service_names.uniq.join(" ") %>" [10:34:31] the name is a little misleading now though [10:35:27] <_joe_> elukey: let's try to do it correctly [10:36:20] yes I was reasoning out loud, we deep merge services and tokens, so the -deploy user gets added in afaics [10:39:42] I can report back this into the task, and then we can proceed from there [10:40:28] <_joe_> let me try to make a patch [10:43:26] <_joe_> elukey: uhm wait, why are you adding the "user" to the list of services? [10:43:38] <_joe_> ok I see, that's quite bad tbh [10:43:48] <_joe_> yeah we need a refactoring, onto it [10:43:59] okok thanks! [10:44:10] <_joe_> elukey: you can merge your patch anyways for now, but it will leave shit behind [10:46:25] ahahahah nono I want to do things right, not hurry [10:47:21] *no rush [10:47:58] I think that I am the first using helm3 for services so it makes sense that some issues come up :) [10:52:26] <_joe_> the problem is basically how we do the deep merge [10:52:57] <_joe_> my idea is to change the data structure for services to accept a list of usernames, and then merge in the tokens for those users [10:53:08] <_joe_> in a hash that we can use everywhere [10:53:29] <_joe_> so we don't need an additional "fake" service, just the additional token [10:55:22] <_joe_> it's puppet, so it's a pain ofc [13:17:53] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) To keep archives happy between Phabricator/IRC - I tried to deploy the new ml `revscoring-editquality` service and got: ` "revscoring-editquality" cannot list resource "secrets"... [13:18:12] updated the helm3 task with what we discussed earlier on --^ [13:31:09] folks, if nobody opposes I'd go forward with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/720997, that adds the possibility to add labels to namespace declarations + the NamespaceDefaultLabelName feature gate (worked with Janis on this earlier on) [13:31:31] it is a no-op for now, there is a follow up patch for kfserving that will use it [14:03:36] FYI I'm going to shepard Petr's eventgate chart changes through...fingers crossed! [14:05:02] ottomata: o/ can you give me 5 mins to sanity check one thing first? [14:05:12] elukey: sure, i've already started merging but haven't done any deployments [14:05:19] ack thanks :) [14:05:33] i'll finish merging but will wait [14:09:56] ottomata: green light! [14:13:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Add label kubernetes.io/metadata.name to all namespaces - https://phabricator.wikimedia.org/T290476 (10elukey) @JMeybohm the code is deployed, feel free to take over and turn on NamespaceDefaultLabelName :) [14:14:17] k ty! [14:15:25] hmm, interesting, because the resource names are changing, helm can't deploy because of port conflicts [14:16:18] _joe_: advice? do I have to delete the existing deployment in order to deploy a newly named e.g. Service resources that use the same port? [14:16:54] e.g. in eventgate-logging-external, the Service is currently named eventgate-logging-external-production-tls-service and uses port 4392 [14:17:11] but after petrs change, this Service has a new name "eventgate-production-tls-service" and uses port 4392 [14:17:12] so [14:17:19] Invalid value: 4392: provided port is already allocated [14:48:33] Pchelolo: FYI ^ looks maybe harder to just use the common templates that we thought [14:51:07] <_joe_> ah damn [14:51:14] oh... [14:51:27] <_joe_> so yes, the way to do it would be to 1) depool from traffic eventgate in one dc [14:51:33] <_joe_> 2) delete the old service [14:51:35] <_joe_> 3) deploy [14:51:46] <_joe_> or, you know [14:51:52] <_joe_> kubectl edit the service itself [14:52:11] <_joe_> sorry I'm between meetings so I don't have much time, maybe akosiaris ? [14:52:51] I am in between 2 different meetings too [14:57:09] for kubectl edit etc.., is there the risk of getting duplicated replica sets? [14:57:22] (it happened to me in the past this is why I am asking, curious) [14:58:13] editing the service sounds risky, lots has changed [14:58:29] right ok, so DC failover to handle the traffic and delete and redeploy [14:58:32] ok [14:59:12] ok i'm going to wait until we have some discussion about https://phabricator.wikimedia.org/T282148#7373078 and label names [14:59:23] IMO the names are confusing, but it might just be a lack of docs for them [14:59:47] (i think the eventgate names and labels are better :p (but would rather conform to common_templates)) [16:37:11] 10serviceops, 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for jobs - https://phabricator.wikimedia.org/T291620 (10Michael) Adding #platform_engineering #serviceops and #analytics as this is related to all three teams. I'm aware that... [18:52:07] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) We tried to deploy this today, but ran into an issue: Since the k8s resources have been renamed, k8s thinks t... [18:53:42] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) > To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repo... [18:55:21] 10serviceops, 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for jobs - https://phabricator.wikimedia.org/T291620 (10Ottomata) Data Eng (analytics) is in the process of [[ https://phabricator.wikimedia.org/T282033 | solving on a simila... [20:05:23] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Oof right. I've already merged the eventgate chart change, and I think to rollback we'd have to revert and th... [20:06:41] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) why rollback? we just make the same changes to eventstreams before going through the deployment [20:17:00] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) I'm worried that in the meantime someone will need to make an emergency fix/change to eventgate and won't be a... [20:21:49] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Pchelolo) oh, yeah. ok. up to you. [22:42:44] legoktm: I think it's alive! The new egress rules are passing traffic to meta at least which feels like a huge victory. [22:44:19] awesome!! [22:44:59] so what's next? :) [22:45:47] I was hoping to fully auth, but it looks like I need to be much smarter about how I proxy into staging cluster to make that work. The leg of exchanging the oauth2 authorization code for a token is failing. I think that is because the hostname is stuck in it on the server side, but I'm not 100% convinced of that yet. [22:46:50] so I need to either figure out how to make the staging instance see itself as "toolhub.wikimedia.org" or I need to get a new grant for a name I can make it see itself as (like staging.svc.eqiad.wmnet:4011) if I'm going to test that in staging [22:47:25] but maybe the better next step is actually to setup the real lvs service and do those bits from the prod cluster? [22:49:10] bd808: we could create a custom TLS cert that includes toolhub.wm.o and use it in the staging cluster, but just doing it in prod will be easier [22:49:45] *nod* it's not even really about certs, its about not needing a port number I think [22:50:03] Ohhh. Not sure how to do that [22:50:09] so yeah, maybe trying to promote up is the more constructive next step [22:51:28] I should probably double check the oauth secrets though. I'm still not sure that the 401 from meta is related to the hostname I'm under... [22:51:44] * bd808 looks at things in /etc on deploy1002 [22:53:26] WIKIMEDIA_OAUTH2_SECRET matches what is stored in my password vault, so that's likely correct [22:54:13] legoktm: ok, so what order of operations and assistance do I need to get the prod cluster ingress setup? [22:55:06] Something needs to be deployed to the cluster that'll pass health checks [22:55:24] And then I or another SRE can deploy the LVS [22:57:00] ok. I can run `helmfile -e eqiad -i apply` and keep my fingers crossed for the first bit. [22:58:37] if it doesnt work it should auto-revert after 5 or 10 minutes of working [22:58:44] eh..waiting [23:04:19] looks like something isn't quite right, possibly the mariadb user grant. /me looks for the ticket on that to check data [23:05:57] bd808: in case it's needed, this is brandnew and seemed pretty useful: https://upload.wikimedia.org/wikipedia/labs/0/07/Kubernetes_Troubleshooting_WMF.png [23:07:34] thanks for the link mutante. that does look nice. I'm actually pretty used to kubectl things from like in Toolforge, but maybe I'll learn a new thing in the flowchart [23:11:36] I need help from someone to check the grants for the'toolhub' user on the m5-master db server. The toolhub-main container in the pod is crashing with `(1045, "Access denied for user 'toolhub'@'10.64.66.115' (using password: YES)")`. The same password is in the helmfile secrets for both staging (which worked) and eqiad (which is failing) so pretty sure this is a grant problem. [23:12:17] I'll look in a minute [23:15:45] 10serviceops, 10DBA, 10Toolhub, 10database-backups: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7354348, @Marostegui wrote: > Thanks for the update, if you need something from us let me know! Everything worked as expected in the Kubernetes "st... [23:17:18] legoktm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/709877/3/modules/role/templates/mariadb/grants/production-m5.sql.erb -- it's the grants. It looks like it at least needs a TO 'toolhub'@'10.64.66.%' variant. [23:18:56] it needs a change in that file but then also a ping to DBA to deploy it [23:19:03] they need to apply it [23:19:31] *nod* [23:21:00] $ host 10.64.66.115 [23:21:00] 115.66.64.10.in-addr.arpa domain name pointer kubernetes-pod-10-64-66-115.eqiad.wmnet. [23:23:05] the eqiad range is 10.64.64.0/21 which means 10.64.64.% through 10.64.71.% I believe [23:23:48] 128 ; Kubernetes pod records for eqiad [23:23:48] 129 {% for z in range(64,72) -%} [23:23:48] 130 {% for i in range(256) -%} [23:23:48] 131 kubernetes-pod-10-64-{{ z }}-{{ i }} 1H IN A 10.64.{{ z }}.{{ i }} [23:24:25] hmm.. I wonder if this is the first case where mysql prod grants from this are needed [23:24:57] mutante: I think yes. I'm pretty sure the mediawiki grants are 10.% [23:25:46] 10serviceops, 10DBA, 10Toolhub, 10database-backups: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Legoktm) Per https://netbox.wikimedia.org/search/?q=kubernetes+pod&obj_type= the eqiad pod range is 10.64.64.0/21, which would require grants from 10.64.64.% through 10.6... [23:26:03] bd808: add the GRANT request to https://phabricator.wikimedia.org/T271480 ? [23:26:15] since he has it already anyways and would have to apply the grants [23:26:29] I was too slow for lego :) [23:47:00] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) fwiw: While looking at this I found we have the email alias maxmind@wikimedia and it forwards to fr-tech@w... [23:48:34] I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/723329 for anyone following along. I'll give a shout in the dba channel about it. Thanks for the help yet again legoktm and mutante [23:48:50] :)