[06:03:26] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Marostegui) That would work for me too @wkandek - thanks! [06:24:19] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10Joe) I have one doubt about the idea of using persistent local volumes... that would mean tying pods to specific no... [08:21:46] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [08:22:12] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [08:48:28] 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) p:05Triage→03Medium a:03Joe Access logs and other logs are easy to collect as well. See: - https://istio.io/latest/docs/tasks/observability/logs/ - https://istio.io/l... [09:00:58] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10JMeybohm) I think the basic idea would be, similar to hostPath, to basically have one PV per node. So this would no... [09:01:48] 10serviceops, 10SRE, 10Thumbor: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10Jelto) During the refresh of old mw app servers in eqiad we noticed that thumbor machines `thumbor1001` and `thumbor1002` are renamed/reimaged mw hosts. As mentioned in T280203 and T233196 these machi... [09:35:25] 10serviceops, 10SRE, 10Thumbor: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10jijiki) @Jelto this work is currently stalled, but T285477 is created to accommodate the olderst 2 thumbor hosts. [13:05:46] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [13:06:33] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [13:24:09] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Joe) [13:24:36] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Joe) As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplified helmfile using helm3, and the command `... [13:50:00] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1293.eqiad.wmnet` - m... [14:07:34] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1294.eqiad.wmnet` - m... [14:17:48] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10Joe) The problem is that we'd be forced to mount hostPath as 'read-write' in all pods and allow the first one that... [14:27:56] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [14:33:42] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10JMeybohm) I would agree that adding PV(C) stuff potentially makes thinks way more complicated then they would be us... [14:37:33] should a canary jobrunner have a lower weight than regular jobrunners (because it's a canary) or a higher weight (because it's newer hardware and in codfw I see they have 25 vs 10) [14:50:10] <_joe_> mutante: neither, imho [14:50:13] <_joe_> it should be the same [14:55:04] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7227228, @Joe wrote: > As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplifi... [14:55:26] _joe_: ok, thanks, i'll do that with new servers in eqiad (and later check on codfw about the existing difference) [14:56:29] also the special cases of dedicated jobrunner/dedicated videoscaler but that's separate [14:56:56] <_joe_> mutante: please balance all server groups evenly across rows [14:57:02] <_joe_> also jelto :) [14:57:09] <_joe_> in codfw we had 40% of apis in row A [14:57:16] <_joe_> which is what caused the issue the other day [14:58:19] for row B that was done, for row A we might have to convert some appservers into APIs, but need to count the totals, not just new ones, will do [14:58:49] all new ones in A3 are appservers, that might balance it out [15:01:49] I sorted all new mw servers by rows in https://phabricator.wikimedia.org/T279309 to have a better overview what appservers are running where. Hope this helps a bit [15:03:34] yes ,I noticed that and it did help, checking the global numbers including old ones [15:05:04] I can add the current status for all [15:06:52] https://phabricator.wikimedia.org/P16841 [15:16:38] Counter({'B': 19, 'D': 18, 'C': 17, 'A': 9}) for the summary [15:20:24] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Volans) FYI if that helps this is the current row-distribution of the API appservers in eqiad: ` {'B': 19, 'D': 18, 'C': 17, 'A': 9} ` Full details at P16841 [15:26:40] <_joe_> btw you should count not just servers, but their weight in the cluster :) [15:27:00] <_joe_> volans: that is a pretty good distribution [15:27:17] <_joe_> we can lose any row and it won't be more than ~ 30% of all servers [16:08:28] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1421.eqiad.wmnet ` The log can be found in `/var/log/wmf-... [16:09:30] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1421.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1421.eqiad.wmnet'] ` [16:30:28] I've updated the above paste with the weights too [16:31:30] https://phabricator.wikimedia.org/P16842 is codfw [16:44:01] 10serviceops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) ACKed some more today, gerrit2001.mgmt, wdqs2002.mgmt [16:51:36] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Volans) FYI I've updated the pastes for eqiad and codfw with some more detailed data, all yours now :) [19:07:41] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) {F34558993} Mcrouter instances in codfw are connecting directly to memeched hosts in eqiad [19:08:38] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [19:08:50] 10serviceops, 10SRE, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:09:49] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) 05Open→03Resolved a:03jijiki [20:11:58] Can anyone tell me what I need to do to move https://phabricator.wikimedia.org/T280881 from "externally blocked" to "In progress"? [20:12:17] That's the "deploy Toolhub" task [20:37:28] * legoktm looks [20:52:56] bd808: I added the checklist to the task and asked some questions [20:53:10] thanks legoktm! [22:08:00] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10dduvall) >>! In T286952#7226319, @Joe wrote: > I have one doubt about the idea of using persistent local volumes...... [22:32:43] 10serviceops, 10SRE, 10docker-pkg, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10RLazarus) p:05Triage→03Medium