[05:03:19] 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10Marostegui) p:05Triage→03Medium [05:08:35] 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Marostegui) p:05Triage→03Medium [05:09:11] 10serviceops, 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Marostegui) p:05Triage→03Medium [05:10:10] 10serviceops, 10SRE, 10VPS-project-Codesearch, 10HTTPS: Codesearch main page redirect uses http instead of https - https://phabricator.wikimedia.org/T290819 (10Marostegui) p:05Triage→03Medium [05:25:57] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [07:33:40] I've uploaded the packages with the PHP backport of the DOM parsing fix (T291052), can someone from ServiceSRE take over with the rollout to prod? [07:33:54] _joe_: and ^ for the rebuild of the PHP image [07:34:07] <_joe_> moritzm: sure [07:45:43] I could do the rollout with arnoldokoth [07:59:28] <_joe_> effie: thanks for volunteering yourself and arnold :) [08:24:29] 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Joe) I think that if we need to add the maxmind database (I'm not sure if we also need an extension), probably the best two options I can think of are: *... [08:41:55] <_joe_> moritzm: I was proposing we do a relatively quick rollout of the new php package on parsoid [08:42:04] <_joe_> where it's supposed to reap big benefits [08:42:11] I will roll out as usually 1 day on canaries/mwdebug + let's say 5/20 parsoid servers? [08:42:22] <_joe_> while being more careful with the rest of the cluster [08:42:28] <_joe_> effie: that seems ok to me [08:42:57] ok sounds good [08:47:51] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) I have some alternative ideas. Specifically, right now we have a limited number of different clusters, due to the complexity o... [08:49:12] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) I forgot to add: offering the beta feature would be nice, and given it only regards logged-in users, it would not need a split... [08:54:27] 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10akosiaris) I would prefer the 2nd option as well (configmap), but alas, configmaps have a size limit of 1MB[1], due to etcd having that size limit (there... [09:02:29] 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Joe) Sigh somehow I forgot to check the size of the geoip files :/ I love the idea of the microservice because it would solve the problem of geoip lookup... [11:49:55] 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10jijiki) We'll first roll out on our canaries and 5 parsoid servers, and continue with full roll out tomorrow. [12:08:23] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [12:19:03] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @ssastry we have done some benchmarks, but non of those were parsoid urls, it would great if you would provide a couple of par... [13:45:49] 10serviceops, 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Reedy) 05Open→03Stalled [13:53:36] 10serviceops, 10Infrastructure-Foundations, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10jijiki) [14:09:37] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10jijiki) [14:13:59] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) Thanks Effie. I think as well as the microbursts / drops you observed at the server-side, on the 1G interfaces, performance is probably impacted by on... [14:29:23] do we still have services running in stretch containers in production? if you they might need a rebuild for https://phabricator.wikimedia.org/T283165#7365637 [14:33:04] <_joe_> moritzm: yes I think so [14:38:58] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: TCP retransmissions in eqiad and codfw - https://phabricator.wikimedia.org/T291385 (10cmooney) Ok so looking at the results from the two hosts in question I'm not sure we can make any definitive conclusions. Following the switchover back to eqia... [16:29:29] _joe_ moritzm I will rollout to 5 parsoid servers today, and do canaries tomorrow [16:29:40] <_joe_> effie: ack [16:29:51] schedule went a littlt behind today [16:31:14] effie: ack, sounds good [17:50:24] Quick question: how often does LVS health-check run? [17:53:15] Pchelolo: max-delay: 300 seems to indicate 5 minutes [17:53:26] ok, thank you mutante! [17:53:28] np [17:56:06] Pchelolo: it's configurable per service, but all are set to 300 individually, except labweb is 30 [18:01:57] there's two kinds of health checking, right? there's sending an http request to a healthchecking endpoint, and there's also the 'idlechannel' healthchecking to notice a process that terminated, I think? [18:05:50] this one is the request to https://localhost/healthz [18:07:49] sure, just wanted to note in case it was important context [18:07:50] https://phabricator.wikimedia.org/diffusion/ODCB/browse/1.15/pybal/monitors/idleconnection.py [18:10:06] then there is also a check_https_lvs_on_port to a discovery.wmnet name. check_https_lvs_on_port does not have special icinga config for the duration. the default value for command_check_interval is -1 which is supposed to mean "as often as possible" [18:12:21] <_joe_> cdanis and mutante are talking about two different things [18:12:35] <_joe_> there is the icinga health check that runs every 5 minutes [18:12:47] <_joe_> and the pybal health checks that run every second or so [18:13:01] <_joe_> in kubernetes, you also have liveness and readiness probes happening [18:13:24] <_joe_> (that's why I want to remove all url fetching from pybal checks for k8s services, btw) [18:18:28] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) p:05Triage→03Medium [18:19:44] what's the latest on k8s lb, btw? still NodePort ? [19:33:16] _joe_: just in theory, if we add access logging to tls proxy envoy that only logs on HTTP 503/504 - would you think that's a crazy idea? [19:33:33] and only logs locally into the container [19:35:58] cause it seems like all node services running in k8s are returning 503 from time to time, for no apparent reason. node doesn't log anything [19:40:07] I think I've seen this on eventgate as well [19:42:14] yeah, eventgate is how we got to this [19:42:56] but it's happening for all services [19:43:22] and we've poked at it from node side, and from MW side, and the problem is somewhere in the middle [19:46:06] Pchelolo: have you tried https://phabricator.wikimedia.org/T287288#7265748 ? [20:19:30] Does anyone have a suggestion on who I could gently nudge about https://phabricator.wikimedia.org/T290357? (wanting a way to `kubectl exec` on the prod k8s clusters) [20:50:40] bd808: it briefly came up during the serviceops meeting today, a.kosiaris is working on a refactor of of users/groups in k8s, partially with yours and dancy's requests in mind so you should nudge him :) [21:01:07] <_joe_> Pchelolo: we are already logging 503s/504s from envoy [21:04:45] _joe_: oh, where do I find them? [21:08:34] legoktm: ack. that sounds like a slow train to ride, but I will try to find ak.osiaris online tomorrow and see what he thinks it is going to take.