[05:03:19] <wikibugs>	 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10Marostegui) p:05Triage→03Medium
[05:08:35] <wikibugs>	 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Marostegui) p:05Triage→03Medium
[05:09:11] <wikibugs>	 10serviceops, 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Marostegui) p:05Triage→03Medium
[05:10:10] <wikibugs>	 10serviceops, 10SRE, 10VPS-project-Codesearch, 10HTTPS: Codesearch main page redirect uses http instead of https - https://phabricator.wikimedia.org/T290819 (10Marostegui) p:05Triage→03Medium
[05:25:57] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki)
[07:33:40] <moritzm>	 I've uploaded the packages with the PHP backport of the DOM parsing fix (T291052), can someone from ServiceSRE take over with the rollout to prod?
[07:33:54] <moritzm>	 _joe_: and ^ for the rebuild of the PHP image
[07:34:07] <_joe_>	 moritzm: sure
[07:45:43] <effie>	 I could do the rollout with arnoldokoth 
[07:59:28] <_joe_>	 effie: thanks for volunteering yourself and arnold :)
[08:24:29] <wikibugs>	 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Joe) I think that if we need to add the maxmind database (I'm not sure if we also need an extension), probably the best two options I can think of are: *...
[08:41:55] <_joe_>	 moritzm: I was proposing we do a relatively quick rollout of the new php package on parsoid
[08:42:04] <_joe_>	 where it's supposed to reap big benefits
[08:42:11] <effie>	 I will roll out as usually 1 day on canaries/mwdebug + let's say 5/20 parsoid servers?
[08:42:22] <_joe_>	 while being more careful with the rest of the cluster
[08:42:28] <_joe_>	 effie: that seems ok to me
[08:42:57] <effie>	 ok sounds good 
[08:47:51] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) I have some alternative ideas. Specifically, right now we have a limited number of different clusters, due to the complexity o...
[08:49:12] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split...
[08:54:27] <wikibugs>	 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10akosiaris) I would prefer the 2nd option as well (configmap), but alas, configmaps have a size limit of 1MB[1], due to etcd having that size limit (there...
[09:02:29] <wikibugs>	 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Joe) Sigh somehow I forgot to check the size of the geoip files :/  I love the idea of the microservice because it would solve the problem of geoip lookup...
[11:49:55] <wikibugs>	 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10jijiki) We'll first roll out on our canaries and 5 parsoid servers, and continue with full roll out tomorrow.
[12:08:23] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki)
[12:19:03] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @ssastry we have done some benchmarks, but non of those were parsoid urls, it would great if you would provide a couple of par...
[13:45:49] <wikibugs>	 10serviceops, 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Reedy) 05Open→03Stalled
[13:53:36] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10jijiki)
[14:09:37] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10jijiki)
[14:13:59] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) Thanks Effie.  I think as well as the microbursts / drops you observed at the server-side, on the 1G interfaces, performance is probably impacted by on...
[14:29:23] <moritzm>	 do we still have services running in stretch containers in production? if you they might need a rebuild for https://phabricator.wikimedia.org/T283165#7365637
[14:33:04] <_joe_>	 moritzm: yes I think so
[14:38:58] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) Ok so looking at the results from the two hosts in question I'm not sure we can make any definitive conclusions.  Following the switchover back to eqia...
[16:29:29] <effie>	 _joe_ moritzm I will rollout to 5 parsoid servers today, and do canaries tomorrow 
[16:29:40] <_joe_>	 effie: ack
[16:29:51] <effie>	 schedule went a littlt behind today 
[16:31:14] <moritzm>	 effie: ack, sounds good
[17:50:24] <Pchelolo>	 Quick question: how often does LVS health-check run?
[17:53:15] <mutante>	 Pchelolo: max-delay: 300  seems to indicate 5 minutes
[17:53:26] <Pchelolo>	 ok, thank you mutante!
[17:53:28] <mutante>	 np
[17:56:06] <mutante>	 Pchelolo: it's configurable per service, but all are set to 300 individually, except labweb is 30
[18:01:57] <cdanis>	 there's two kinds of health checking, right?  there's sending an http request to a healthchecking endpoint, and there's also the 'idlechannel' healthchecking to notice a process that terminated, I think?
[18:05:50] <mutante>	 this one is the request to https://localhost/healthz
[18:07:49] <cdanis>	 sure, just wanted to note in case it was important context
[18:07:50] <cdanis>	 https://phabricator.wikimedia.org/diffusion/ODCB/browse/1.15/pybal/monitors/idleconnection.py
[18:10:06] <mutante>	 then there is also a check_https_lvs_on_port to a discovery.wmnet name. check_https_lvs_on_port does not have special icinga config for the duration. the default value for command_check_interval is -1 which is supposed to mean "as often as possible"
[18:12:21] <_joe_>	 cdanis and mutante are talking about two different things
[18:12:35] <_joe_>	 there is the icinga health check that runs every 5 minutes
[18:12:47] <_joe_>	 and the pybal health checks that run every second or so
[18:13:01] <_joe_>	 in kubernetes, you also have liveness and readiness probes happening
[18:13:24] <_joe_>	 (that's why I want to remove all url fetching from pybal checks for k8s services, btw)
[18:18:28] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) p:05Triage→03Medium
[18:19:44] <cdanis>	 what's the latest on k8s lb, btw? still NodePort ?
[19:33:16] <Pchelolo>	 _joe_: just in theory, if we add access logging to tls proxy envoy that only logs on HTTP 503/504 - would you think that's a crazy idea?
[19:33:33] <Pchelolo>	 and only logs locally into the container
[19:35:58] <Pchelolo>	 cause it seems like all node services running in k8s are returning 503 from time to time, for no apparent reason. node doesn't log anything
[19:40:07] <cdanis>	 I think I've seen this on eventgate as well
[19:42:14] <Pchelolo>	 yeah, eventgate is how we got to this
[19:42:56] <Pchelolo>	 but it's happening for all services
[19:43:22] <Pchelolo>	 and we've poked at it from node side, and from MW side, and the problem is somewhere in the middle
[19:46:06] <legoktm>	 Pchelolo: have you tried https://phabricator.wikimedia.org/T287288#7265748 ?
[20:19:30] <bd808>	 Does anyone have a suggestion on who I could gently nudge about https://phabricator.wikimedia.org/T290357? (wanting a way to `kubectl exec` on the prod k8s clusters)
[20:50:40] <legoktm>	 bd808: it briefly came up during the serviceops meeting today, a.kosiaris is working on a refactor of of users/groups in k8s, partially with yours and dancy's requests in mind so you should nudge him :)
[21:01:07] <_joe_>	 Pchelolo: we are already logging 503s/504s from envoy
[21:04:45] <Pchelolo>	 _joe_: oh, where do I find them?
[21:08:34] <bd808>	 legoktm: ack. that sounds like a slow train to ride, but I will try to find ak.osiaris online tomorrow and see what he thinks it is going to take.