[01:44:54] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10tstarling) I hadn't actually reviewed the workboard or even tried to run MW on PHP 8.0 locally before I vented above about running old versions. There is still some development work left to do befo... [02:53:42] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Reedy) I presume we should be working towards 7.4 in the meantime? [06:53:55] <_joe_> bd808: for now it's manual by using docker-registryctl delete-tags on deneb; but we will improve it once we have the image catalog stuff in place that will tell us if a specific tag of an image is in use or not.; [06:55:33] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Joe) We've talked about this in the serviceops meeting, and we agreed that we will take on upgrading to php 7.4 next quarter, but only work towards 8.x once we've fully migrated to kubernetes. This... [06:56:36] hello everybody, if nobody opposes I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/720048 to refactor the tokens/secrets for helm [06:56:41] helm/k8s [06:57:01] <_joe_> -2 do not merge :P [06:57:01] if people needs to run helmfile please sync with me first just to coordinate :) [06:57:21] <_joe_> elukey: you already have the private change ready too? [06:58:03] _joe_ I disabled puppet on contint*, deploy* and release*, then I am planning to modify the hiera values in private [06:58:16] then merge the change and run puppet one by one [06:58:22] starting from codfw nodes first [06:58:52] but I can modify private and let you inspect before pulling the trigger [06:59:09] (I'd be happy for an extra pair of eyes) [07:00:06] <_joe_> sure I can take a look but I also trust you [07:00:35] ok ok :) [07:35:40] I am still working on the private change, I inserted some tabs by mistake [07:35:48] (without committing but saving) [07:42:07] _joe_ just to be sure, would you mind the current git diff HEAD in private? [07:42:17] <_joe_> sure [07:43:01] thanks :) should be ok now but I have already risked a PEBCAK [07:43:54] <_joe_> elukey: +1 it seems [07:44:12] <_joe_> but yes let's run helmfile -e codfw -i apply on all charts afterwards [07:45:04] yes yes definitely [07:45:07] thanks a lot [07:46:59] proceeding with the public change [07:54:18] there is a puppet mistake (my bad) with new parent dirs for services not created, adding them to puppet [07:59:48] the change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/722818 [08:00:01] for the new private dirs I kept root:wikidev as its parent one, would it be ok? [08:04:33] <_joe_> yes [08:05:04] <_joe_> gave you a +2 [08:05:10] <_joe_> 🚢 [08:05:13] <_joe_> ship it [08:05:59] thanks <3 [08:18:57] ok so all the tls secrets are now populated under /etc/helmfile-defaults/private/main_services/etc.. on deploy2002 [08:19:11] I forgot to add the puppet change an entry for "services" related to admin_ng [08:19:17] will fix now [08:19:30] <_joe_> why is that needed? [08:19:38] <_joe_> I mean it wasn't there before [08:20:13] for kfserving and knative I need some private config to be deployed [08:20:24] before that it was added to the services stuff [08:21:45] <_joe_> oh ok [08:22:00] <_joe_> go on do your thing :P [08:24:27] I am trying to understand where to add it, forgot about this use case in the various changes [08:34:02] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [08:34:24] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Thank you @ssastry, I updated the task descr to include them [08:36:53] _joe_ the only thing that seems to work is https://gerrit.wikimedia.org/r/c/operations/puppet/+/722824/2/hieradata/role/common/deployment_server.yaml, otherwise I can add a separate heira config for admin_ng (to keep things more clear) [08:37:19] <_joe_> elukey: that's why I added that comment to the task yesterday [08:37:42] <_joe_> and yes, I think it should be separated [08:37:57] okok makes sense [08:37:57] <_joe_> also because this way you get private/admin_ng_services [08:38:14] <_joe_> also please that _ng suffix 😢 [08:40:13] so to unblock the current maintenance I can leave things as they are, and change my current helmfile code review to use the old path for admin_ng. In this way I'll deploy the new service configs and I'll study in more detail how to properly add admin_ng [08:40:18] does it sound good? [08:42:53] <_joe_> sorry, what old path? [08:42:58] <_joe_> but yes, please do [08:44:12] _joe_ basically in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/722276 I'll change the admin configs to point to what they are using now [08:44:42] (all the other ones except kfserving and knative use the wrong path anyway, but they don't need secrets afaics) [08:49:32] <_joe_> ok now I guet the source of our confusion [08:49:47] <_joe_> $ ssh deploy1002.eqiad.wmnet ls /etc/helmfile-defaults/private/admin/ [08:49:50] <_joe_> ls: cannot access '/etc/helmfile-defaults/private/admin/': No such file or directory [08:49:53] yeah :D [08:49:54] <_joe_> that directory has never existed [08:50:05] <_joe_> that's why I wasn't worried by it at all [08:53:04] all right merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/722276 and then trying helmfile diff [08:56:54] (running puppet on the other disabled nodes first) [09:10:53] so I went on deploy1002:/srv/deployment-charts/helmfile.d/services/sessionstore and other dirs, tried helmfile -e eqiad diff and it shows a no-op [09:10:59] same thing for the admin_ng dir [09:11:31] if others want to double check I'd be grateful :) [09:12:20] I'll leave the old private service configs there for a day, just in case, and remove them tomorrow [09:18:30] <_joe_> uhm actually you should try to move them before testing I guess :) [09:19:12] the new helmfiles don't really refer to them though [09:19:30] but yes you are right, to be 100% sure [09:19:36] I can move them to a "backup" dir [09:22:40] tried with "apertium", the only diff that it shows it is for a docker image (that I suppose is pending to be deployed) [09:34:15] kube_env seems to work fine [09:40:01] <_joe_> cool [09:43:23] so I have done [09:43:24] for dir in `ls | grep -v README`; do echo $dir; pushd $dir; helmfile -e eqiad diff; popd; done [09:43:39] for all services under helmfile.d, and I got some of them with changes in secrets [09:43:53] like cxserver, going to check [09:46:03] checking deployment-charts, surely a pebcak [09:47:38] yep exactly, sending a fix [10:00:11] fixed, now the diffs look all non-secrets-related [10:00:13] :) [10:16:44] 10serviceops, 10Lift-Wing, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Discussion: dedicated directory in the deployment-chart repository for ML services - https://phabricator.wikimedia.org/T286791 (10elukey) Current status - part of the helmfile private dir refactoring is d... [10:37:03] going afk for lunch, will be afk for a couple of hours, ping me if anything comes up with helmfile :) [10:54:02] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) >>! In T290536#7364817, @Joe wrote: > I have some alternative ideas. Specifically, right now we have a limited number of di... [11:43:13] _joe_ or akosiaris, I need to shoot off, nemo-yiannis is waiting for a puppet run on deploy1002 [11:43:19] so he can deploy on codfw [11:43:54] but currently puppet is disabled because john is doing some puppetdb work [11:44:17] ah nvm [11:44:21] maintenance is done [12:09:44] 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10akosiaris) >>! In T288375#7364858, @Joe wrote: > Sigh somehow I forgot to check the size of the geoip files :/ > > I love the idea of the microservice be... [12:59:36] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) @jijiki I guess you want to rebuild the php base image to include the patches to optimize DOM performance before running the tests [13:52:36] 10serviceops, 10Maps, 10Patch-For-Review, 10User-jijiki: Deploy tegola-vector-tiles to kubernetes - https://phabricator.wikimedia.org/T283159 (10Jgiannelos) We tried testing codfw/eqiad by just reaching out to one of the kubernetes nodes but since there is no discovery in place we faced TLS issues (kinda e... [14:00:44] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy) [15:49:04] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Legoktm) It would be nice if we could also get {T219279} finally finished before we get to this, which is still leftover from the previous PHP/ICU upgrade. [16:12:55] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10thcipriani) [17:14:05] 10serviceops, 10Performance-Team, 10Developer Productivity: Update php-wmerrors page to include request ID - https://phabricator.wikimedia.org/T291192 (10Krinkle) 05Open→03Resolved [19:06:04] 10serviceops, 10SRE, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10Dzahn) [20:37:28] 10serviceops: Support Canary releases on Kubernetes - https://phabricator.wikimedia.org/T282148 (10Ottomata) I have some thoughts about labels and template defines we use. This might not be the right ticket for these questions, but since they do relate to how canary releases work, I'll ask here anyway. --- Fi... [21:32:22] Is there a host that I (a deployer, but otherwise a mortal) can ssh into that has the same sort of egress traffic restrictions as a k8s namespace with the egress rules enabled? I'm looking for a host where `curl -vI https://meta.wikimedia.org/` will fail and where `curl -vI --proxy http://url-downloader.eqiad.wikimedia.org:8080 https://meta.wikimedia.org/` will also fail. [21:36:24] Both commands work from mwdebug1002 and mwmaint1002 which are my goto "what happens when I do this in prod" testing locations. [21:37:26] I don't think such an environment exists, however I thought url-downloader was in the default egress allowlist? [21:39:25] oh, but I don't think you can use url-downloader to hit Meta [21:39:35] you'll want a local envoy set up for that I think. [21:40:42] * legoktm is catching up on the phab discusison [21:43:12] legoktm: yeah. url-downloader (sometimes, it's weird but orthogonal to this discussion) disallows connections originating from the internal network segments to connect to hosts that ultimately resolve to the same set of internal network segments. [21:43:51] so proxying through url-downloader is the right thing for Toolhub to do to reach say https://google.com/, but not to reach https://meta.wikimedia.org/. [21:44:35] I have some ideas about how I can make this work in my code, but I'm having some difficulty making a local test rig that I trust to tell me if my tricks are the right tricks. [21:45:13] I think that with a python repl on a host with the same network restrictions in place I could work out the POC. [21:46:19] I don't think that exists outside of an actual k8s container...which brings us back to the exec permission request :/ [21:46:48] *nod* That would give me 100% same-same testing for sure :) [21:47:44] Alex said that he'd have a better idea of how long that access will take tomorrow I think. [21:52:11] * bd808 goes back to working on the less ideal, but short term workable hack for all this [21:58:38] hmm, the CI output for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/722972/ has no diff [22:01:50] legoktm: `rake run_locally[helm_diffs[$CHART]]` might give you one. [22:02:14] where CHART=wmdebug [22:02:37] https://wikitech.wikimedia.org/wiki/User:BryanDavis/Helm#Run_tests_for_a_particular_chart_locally [22:16:48] bd808: I assume there's some "rake install" type thing I need to run, do you know what it is? I'm currently getting "LoadError: cannot load such file -- git" [22:19:53] legoktm: hmmm... a very good question. There is no Gemfile in the repo to give dependencies. I wonder if I looked up the CI job to figure out what ruby modules were needed? I remember having to fiddle with things to get it working, but not the steps I took. [22:20:21] * legoktm does that [22:20:49] it installs "ruby-git" [22:22:45] which on Fedora is "rubygem-git". And now to disable selinux... [22:23:22] bd808: hm, but would that give me a diff if the change I made is only to the helmfile.d and not the chart itself? [22:24:40] legoktm: no, that's what it uses the `git` shell for, to find the N-1 config to diff against [22:25:14] that diff shows the changes in the generated helmfile output from HEAD~1:HEAD [22:25:40] when I do `rake run_locally[helm_diffs[mediawiki]]` it shows no diff (and appears to finish successfully) [22:26:30] what about for mwdebug? mediawiki doesn't have a helmfile.d does it? [22:27:23] then I get No such file or directory @ rb_sysopen - charts/mwdebug/Chart.yaml [22:29:02] *grumble* how did I learn this and then forget it all in like 3 weeks? [22:32:29] legoktm: even the much slower `rake run_locally[helm_diffs]` shows no diff (which matches what jerkins did). I'm confused... [22:34:40] OH in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/710109 I also had to edit helmfile.d/services/mwdebug/values.yaml, but that section is gone now [22:35:15] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/711101 [22:35:42] oh... right. Your fixture is acting at the /etc/helmfile/.... defaults from puppet but the chart needs to pick which things to apply [22:36:09] and now that it gets the list from puppet, it can't diff it anymore [22:37:19] well, actually it does not get the list from puppet in the CI environment, so you also need to add config in your .fixtures.yaml to mock what you expect in the /etc/helmfile-defaults/mediawiki/tlsproxy.yaml Puppet generated file [22:37:36] right [22:37:42] that was my discovery at https://wikitech.wikimedia.org/wiki/User:BryanDavis/Helm#Helmfile.d_test_fixtures [22:39:05] so this actually makes sense. the diff is empty because you are adding mock values, but nothing is applying them to the templates [22:42:20] got it [22:42:41] the actual services_proxy config is at /etc/helmfile-defaults/general-eqiad.yaml [22:42:57] and then /etc/helmfile-defaults/mediawiki/tlsproxy.yaml has the list of enabled ones for mwdebug [22:43:27] I'm going to merge my change just for consistency and then file a bug about cleaning this up / documenting it [22:43:35] thanks for walking through it with me bd808 [22:43:56] no problem! I'm going to be using you as a rubber duck today too I'm sure :) [22:44:48] The CI setup for the charts is neat but also pretty mysterious [22:44:54] so docs would be nice