[06:03:26] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Marostegui) That would work for me too @wkandek - thanks!
[06:24:19] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10Joe) I have one doubt about the idea of using persistent local volumes... that would mean tying pods to specific no...
[08:21:46] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto)
[08:22:12] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto)
[08:48:28] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) p:05Triage→03Medium a:03Joe Access logs and other logs are easy to collect as well. See:  - https://istio.io/latest/docs/tasks/observability/logs/ - https://istio.io/l...
[09:00:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10JMeybohm) I think the basic idea would be, similar to hostPath, to basically have one PV per node. So this would no...
[09:01:48] <wikibugs>	 10serviceops, 10SRE, 10Thumbor: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10Jelto) During the refresh of old mw app servers in eqiad we noticed that thumbor machines `thumbor1001` and `thumbor1002` are renamed/reimaged mw hosts. As mentioned in T280203 and T233196 these machi...
[09:35:25] <wikibugs>	 10serviceops, 10SRE, 10Thumbor: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10jijiki) @Jelto this work is currently stalled, but T285477 is created to accommodate the olderst 2 thumbor hosts.
[13:05:46] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn)
[13:06:33] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn)
[13:24:09] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Joe)
[13:24:36] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Joe) As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplified helmfile using helm3, and the command  `...
[13:50:00] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13  (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1293.eqiad.wmnet` - m...
[14:07:34] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13  (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1294.eqiad.wmnet` - m...
[14:17:48] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10Joe) The problem is that we'd be forced to mount hostPath as 'read-write' in all pods and allow the first one that...
[14:27:56] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13  (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn)
[14:33:42] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10JMeybohm) I would agree that adding PV(C) stuff potentially makes thinks way more complicated then they would be us...
[14:37:33] <mutante>	 should a canary jobrunner have a lower weight than regular jobrunners (because it's a canary) or a higher weight (because it's newer hardware and in codfw I see they have 25 vs 10)
[14:50:10] <_joe_>	 mutante: neither, imho
[14:50:13] <_joe_>	 it should be the same
[14:55:04] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7227228, @Joe wrote: > As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplifi...
[14:55:26] <mutante>	 _joe_: ok, thanks, i'll do that with new servers in eqiad (and later check on codfw about the existing difference)
[14:56:29] <mutante>	 also the special cases of dedicated jobrunner/dedicated videoscaler but that's separate
[14:56:56] <_joe_>	 mutante: please balance all server groups evenly across rows
[14:57:02] <_joe_>	 also jelto  :)
[14:57:09] <_joe_>	 in codfw we had 40% of apis in row A
[14:57:16] <_joe_>	 which is what caused the issue the other day
[14:58:19] <mutante>	 for row B that was done, for row A we might have to convert some appservers into APIs, but need to count the totals, not just new ones, will do
[14:58:49] <mutante>	 all new ones in A3 are appservers, that might balance it out
[15:01:49] <jelto>	 I sorted all new mw servers by rows in https://phabricator.wikimedia.org/T279309 to have a better overview what appservers are running where. Hope this helps a bit
[15:03:34] <mutante>	 yes ,I noticed that and it did help, checking the global numbers including old ones 
[15:05:04] <volans>	 I can add the current status for all
[15:06:52] <volans>	 https://phabricator.wikimedia.org/P16841
[15:16:38] <volans>	 Counter({'B': 19, 'D': 18, 'C': 17, 'A': 9}) for the summary
[15:20:24] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Volans) FYI if that helps this is the current row-distribution of the API appservers in eqiad: ` {'B': 19, 'D': 18, 'C': 17, 'A': 9} `  Full details at P16841
[15:26:40] <_joe_>	 btw you should count not just servers, but their weight in the cluster :)
[15:27:00] <_joe_>	 volans: that is a pretty good distribution
[15:27:17] <_joe_>	 we can lose any row and it won't be more than ~ 30% of all servers
[16:08:28] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1421.eqiad.wmnet ` The log can be found in `/var/log/wmf-...
[16:09:30] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1421.eqiad.wmnet'] `  Of which those **FAILED**: ` ['mw1421.eqiad.wmnet'] `
[16:30:28] <volans>	 I've updated the above paste with the weights too
[16:31:30] <volans>	 https://phabricator.wikimedia.org/P16842 is codfw
[16:44:01] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) ACKed some more today, gerrit2001.mgmt,  wdqs2002.mgmt
[16:51:36] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Volans) FYI I've updated the pastes for eqiad and codfw with some more detailed data, all yours now :)
[19:07:41] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) {F34558993}  Mcrouter instances in codfw are connecting directly to memeched hosts in eqiad
[19:08:38] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki)
[19:08:50] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[19:09:49] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) 05Open→03Resolved a:03jijiki
[20:11:58] <bd808>	 Can anyone tell me what I need to do to move https://phabricator.wikimedia.org/T280881 from "externally blocked" to "In progress"?
[20:12:17] <bd808>	 That's the "deploy Toolhub" task
[20:37:28] * legoktm looks
[20:52:56] <legoktm>	 bd808: I added the checklist to the task and asked some questions
[20:53:10] <bd808>	 thanks legoktm!
[22:08:00] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Next): Perform l10n cache rebuild using initContainers instead of including it in the image - https://phabricator.wikimedia.org/T286952 (10dduvall) >>! In T286952#7226319, @Joe wrote: > I have one doubt about the idea of using persistent local volumes......
[22:32:43] <wikibugs>	 10serviceops, 10SRE, 10docker-pkg, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10RLazarus) p:05Triage→03Medium