[07:17:57] <_joe_> https://github.com/johanhaleby/kubetail is pretty slick and can be useful [07:28:17] I have used https://github.com/wercker/stern for that mostly [07:28:45] looks a bit abandoned, though [07:29:19] <_joe_> jayme: what I like about kubetail is it's a simple bash script [07:29:27] <_joe_> which is almost all you really need for this task [07:29:27] indeed [07:30:47] there even is a debian package for that :-o [07:32:06] <_joe_> uhhh [07:32:14] <_joe_> for buster? [07:32:18] yeah [07:32:21] <_joe_> lol [07:38:47] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T293728 (10JMeybohm) [07:41:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 (10JMeybohm) [09:11:10] I flipped graphite read traffic to codfw, what are good/relevant mw dashboards I could use to validate things are working as expected ? [09:13:30] <_joe_> godog: https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1 [09:14:32] thank you _joe_ [09:14:34] LGTM [09:14:47] I'll be moving write traffic shortly [09:22:38] speaking of which, most of the change for writers (most notably mw) will come from https://gerrit.wikimedia.org/r/c/operations/puppet/+/731433 [09:22:53] what's a cumin selector I should use to force a puppet run ? [09:26:46] e.g. C:mediawiki::packages, that also targets the various related mediawiki like parsoid or labweb hosts [09:35:09] thank you moritzm ! appreciate it [09:36:33] yep 366 hosts [09:36:49] use some batch :-P [09:40:26] <_joe_> oh this reminds me we don't have that in mediawiki on k8s [09:52:34] volans: hehe yeah I'm batching 50 or so [09:52:46] too many [09:52:47] for puppet [09:52:53] godog: ^^^ [09:53:26] mmhh I'll try again with 30 [09:53:39] where's the bottleneck ? [09:53:46] https://phabricator.wikimedia.org/T280622 [09:54:12] we didn't come up with a safe number in the end :D [09:54:15] so pick yours [09:54:52] yeah 30 seems right [09:56:00] but as noted in the task it does depend on which catalogs are targeted for sure [11:14:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 (10JMeybohm) [14:40:33] 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) [15:28:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) {F34697825} [16:35:18] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy) [17:37:21] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10MoritzMuehlenhoff) There's no reason for T263437 to be a sub task? It's unrelated work and only needed when we move to a new OS (with a new ICU), but not when we merely migrate to a new PHP release. [17:59:11] merging "gitlab: remove cas3 from external providers" [18:00:15] brennen: deployed ^ [18:00:25] mutante: thx! [18:00:34] Exec[Reconfigure GitLab] is running right now [18:00:39] arnoldokoth: cc: ^ [18:00:51] wait for it, it did not finish yet [18:01:17] GitLab]/returns: Chef Infra Client finished, 18/649 resources updated in 45 seconds [18:01:23] Notice: /Stage[main]/Gitlab/Service[gitlab-ce]: Triggered 'refresh' from 1 event [18:01:26] Notice: Applied catalog in 80.49 seconds [18:01:34] Chef and Puppet but Ansible is gone, heh [18:02:05] brennen: ok, now it was refreshed and still up [18:04:03] mutante: looks good, confirming that the value is unset in config, will keep an eye on it and make sure we get the expected result for new logins. [18:04:34] brennen: great, thanks! [18:06:08] ah, the same done on gitlab2001 (replica) right now [18:06:29] though that should not have logins, right [18:38:32] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10jijiki) >>! In T271736#7441790, @MoritzMuehlenhoff wrote: > There's no reason for T263437 to be a sub task? It's unrelated work and only needed when we move to a new OS (with a new ICU), but not when we me... [19:05:13] btw this is why systemd state on mwmaint is degraded. one of the mwmaint periodic jobs fails because of a Fatal in MediaWiki after some recent change [19:05:18] https://phabricator.wikimedia.org/T293702 [19:05:26] but looks like Reedy already has the fix [20:48:48] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) @Joe So for the current/pre-k8s setup this is resolved, minus one Hiera flip to enable on all appservers w... [21:00:00] on deploy1002 the "deploy to mwdebug" service failed [21:00:16] trying to start it [21:01:08] hrmm. deploy-mwdebug[17461]: ERROR:root:A previous deployment failed. Check the file at /var/lib/deploy-mwdebug/error and re-run manually with --force [21:01:25] not sure I want to --force yet [21:02:26] the error file above exists but there is only a timestamp in it [21:13:43] mutante: basically if a deployment fails, it just marks it and requires manual re-intervention to go back to auto deploys [21:14:35] I see some syncs from joe earlier today https://sal.toolforge.org/log/6hH5mHwB1jz_IcWurj3O [21:14:41] legoktm: aha, thanks. so the normal procedure would be to delete the error file, like a lock file and then start it ..without --force? [21:15:11] it looks more CRIT thant it probably should just because of the whole chain to an Icinga alert that says "systemd broken on deploy1002" you know [21:15:21] then I looked why etc [21:16:04] oh, looking at SAL some more [21:17:43] ACK, nothing crit here, just maybe the question if it should have the prod alert already [21:18:08] I think having it alert makes sense since someone needs to manually intervene [21:18:59] it should clear soon now [21:19:16] oh, did you --force it? [21:20:00] I see it succeed. confirmed :) [21:21:32] yeah :) [21:22:14] well, OK, ACK:) [21:45:54] in a completely different matter (but all Icinga alerts), there was "CRIT: large files in puppet client bucket" on mwmaint for some time. I ended up deleting files over 100MB from /var/lib/puppet/clientbucket/ using find and that cleared it. related ticket to that stuff that opops up every once in a while is https://phabricator.wikimedia.org/T165885 because we already started puppetizing [21:46:00] crons for that but they are only active in cloud so far