[07:17:57] <_joe_>	 https://github.com/johanhaleby/kubetail is pretty slick and can be useful
[07:28:17] <jayme>	 I have used https://github.com/wercker/stern for that mostly
[07:28:45] <jayme>	 looks a bit abandoned, though
[07:29:19] <_joe_>	 jayme: what I like about kubetail is it's a simple bash script
[07:29:27] <_joe_>	 which is almost all you really need for this task
[07:29:27] <jayme>	 indeed
[07:30:47] <jayme>	 there even is a debian package for that :-o
[07:32:06] <_joe_>	 uhhh
[07:32:14] <_joe_>	 for buster?
[07:32:18] <jayme>	 yeah
[07:32:21] <_joe_>	 lol
[07:38:47] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T293728 (10JMeybohm)
[07:41:20] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 (10JMeybohm)
[09:11:10] <godog>	 I flipped graphite read traffic to codfw, what are good/relevant mw dashboards I could use to validate things are working as expected ?
[09:13:30] <_joe_>	 godog: https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1
[09:14:32] <godog>	 thank you _joe_ 
[09:14:34] <godog>	 LGTM
[09:14:47] <godog>	 I'll be moving write traffic shortly
[09:22:38] <godog>	 speaking of which, most of the change for writers (most notably mw) will come from https://gerrit.wikimedia.org/r/c/operations/puppet/+/731433
[09:22:53] <godog>	 what's a cumin selector I should use to force a puppet run ?
[09:26:46] <moritzm>	 e.g. C:mediawiki::packages, that also targets the various related mediawiki like parsoid or labweb hosts
[09:35:09] <godog>	 thank you moritzm ! appreciate it
[09:36:33] <godog>	 yep 366 hosts
[09:36:49] <volans>	 use some batch :-P
[09:40:26] <_joe_>	 oh this reminds me we don't have that in mediawiki on k8s
[09:52:34] <godog>	 volans: hehe yeah I'm batching 50 or so
[09:52:46] <volans>	 too many
[09:52:47] <volans>	 for puppet
[09:52:53] <volans>	 godog: ^^^
[09:53:26] <godog>	 mmhh I'll try again with 30
[09:53:39] <godog>	 where's the bottleneck ?
[09:53:46] <volans>	 https://phabricator.wikimedia.org/T280622
[09:54:12] <volans>	 we didn't come up with a safe number in the end :D
[09:54:15] <volans>	 so pick yours
[09:54:52] <godog>	 yeah 30 seems right
[09:56:00] <godog>	 but as noted in the task it does depend on which catalogs are targeted for sure
[11:14:56] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 (10JMeybohm)
[14:40:33] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm)
[15:28:25] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) {F34697825}
[16:35:18] <wikibugs>	 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy)
[17:37:21] <wikibugs>	 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10MoritzMuehlenhoff) There's no reason for T263437 to be a sub task? It's unrelated work and only needed when we move to a new OS (with a new ICU), but not when we merely migrate to a new PHP release.
[17:59:11] <mutante>	 merging "gitlab: remove cas3 from external providers"
[18:00:15] <mutante>	 brennen: deployed ^
[18:00:25] <brennen>	 mutante: thx!
[18:00:34] <mutante>	 Exec[Reconfigure GitLab] is running right now
[18:00:39] <mutante>	 arnoldokoth: cc: ^
[18:00:51] <mutante>	 wait for it, it did not finish yet
[18:01:17] <mutante>	 GitLab]/returns: Chef Infra Client finished, 18/649 resources updated in 45 seconds
[18:01:23] <mutante>	 Notice: /Stage[main]/Gitlab/Service[gitlab-ce]: Triggered 'refresh' from 1 event
[18:01:26] <mutante>	 Notice: Applied catalog in 80.49 seconds
[18:01:34] <mutante>	 Chef and Puppet but Ansible is gone, heh
[18:02:05] <mutante>	 brennen: ok, now it was refreshed and still up
[18:04:03] <brennen>	 mutante: looks good, confirming that the value is unset in config, will keep an eye on it and make sure we get the expected result for new logins.
[18:04:34] <mutante>	 brennen: great, thanks! 
[18:06:08] <mutante>	 ah, the same done on gitlab2001 (replica) right now
[18:06:29] <mutante>	 though that should not have logins, right
[18:38:32] <wikibugs>	 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10jijiki) >>! In T271736#7441790, @MoritzMuehlenhoff wrote: > There's no reason for T263437 to be a sub task? It's unrelated work and only needed when we move to a new OS (with a new ICU), but not when we me...
[19:05:13] <mutante>	 btw this is why systemd state on mwmaint is degraded. one of the mwmaint periodic jobs fails because of a Fatal in MediaWiki after some recent change
[19:05:18] <mutante>	 https://phabricator.wikimedia.org/T293702
[19:05:26] <mutante>	 but looks like Reedy already has the fix
[20:48:48] <wikibugs>	 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) @Joe So for the current/pre-k8s setup this is resolved, minus one Hiera flip to enable on all appservers w...
[21:00:00] <mutante>	 on deploy1002 the "deploy to mwdebug" service failed
[21:00:16] <mutante>	 trying to start it
[21:01:08] <mutante>	 hrmm.  deploy-mwdebug[17461]: ERROR:root:A previous deployment failed. Check the file at /var/lib/deploy-mwdebug/error and re-run manually with --force
[21:01:25] <mutante>	 not sure I want to --force yet
[21:02:26] <mutante>	 the error file above exists but there is only a timestamp in it
[21:13:43] <legoktm>	 mutante: basically if a deployment fails, it just marks it and requires manual re-intervention to go back to auto deploys
[21:14:35] <legoktm>	 I see some syncs from joe earlier today https://sal.toolforge.org/log/6hH5mHwB1jz_IcWurj3O
[21:14:41] <mutante>	 legoktm: aha, thanks. so the normal procedure would be to delete the error file, like a lock file and then start it ..without --force?
[21:15:11] <mutante>	 it looks more CRIT thant it probably should just because of the whole chain to an Icinga alert that says "systemd broken on deploy1002" you know
[21:15:21] <mutante>	 then I looked why etc
[21:16:04] <mutante>	 oh, looking at SAL some more
[21:17:43] <mutante>	 ACK, nothing crit here, just maybe the question if it should have the prod alert already
[21:18:08] <legoktm>	 I think having it alert makes sense since someone needs to manually intervene
[21:18:59] <legoktm>	 it should clear soon now
[21:19:16] <mutante>	 oh, did you --force it?
[21:20:00] <mutante>	 I see it succeed. confirmed :)
[21:21:32] <legoktm>	 yeah :)
[21:22:14] <mutante>	 well, OK, ACK:)
[21:45:54] <mutante>	 in a completely different matter (but all Icinga alerts), there was "CRIT: large files in puppet client bucket" on mwmaint for some time. I ended up deleting files over 100MB from /var/lib/puppet/clientbucket/ using find and that cleared it. related ticket to that stuff that opops up every once in a while is https://phabricator.wikimedia.org/T165885  because we already started puppetizing 
[21:46:00] <mutante>	 crons for that but they are only active in cloud so far