[00:55:28] 10serviceops, 10Performance-Team, 10Release-Engineering-Team (Radar): Create warmup procedure for MediaWiki app servers - https://phabricator.wikimedia.org/T230037 (10Krinkle) 05Open→03Resolved a:03Krinkle We basically have this, and used for dc-switchovers. If and when we need it elsewhere (e.g. for P... [01:26:02] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) FWIW, this error message comes from En... [05:14:19] 10serviceops, 10DBA, 10Toolhub, 10Patch-For-Review, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) @bd808 please review this patch when you get a chance: https://gerrit.wikimedia.org/r/709877 For now I have granted the Pods IP directly... [05:38:29] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10MoritzMuehlenhoff) p:05Triage→03Low [05:56:18] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Ladsgroup) This might be helpful: {T113114} I th... [07:25:58] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10ema) >>! In T287983#7257682, @RLazarus wrote: >... [08:40:14] hello folks [08:40:34] I started to write down some ideas about the kafka-main topic rebalancing from https://phabricator.wikimedia.org/T225005#7255978 onward [08:40:54] when you have a moment lemme know your thoughts [08:41:08] I think that it should be doable incrementally, and the end result could be godo [08:41:11] *good [09:05:24] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [09:49:57] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) I've just reverted the CR adding systemd-sysusers to the dragonfly packages (https://gerrit.wikimedia.org/r/c/operations/debs/drag... [09:50:06] looks like there's been some disk pressure on kubernetes2004 which lead to some evictions [09:52:46] hnowlan: can you check what created the disk pressure? [10:02:18] aiui when there's a diskpressure event k8s cleans up containers that are evicted to save the space so I'm not certain, but it looks like linkrecommendation is using the most disk on two nodes I looked at (~5gb of logs) [10:19:44] logs in the container or on-disk? [10:20:21] I mean are those logs created by the docker daemon or written inside the containers? [10:31:34] they're the stdout logs from the container's application that end up on disk via docker [10:58:54] looks like we had that a couple of times in the last days https://logstash.wikimedia.org/goto/9a49f087412a13d415c11b2b78c41176 [10:59:09] especially on 2004 [11:03:17] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Jelto) [11:03:26] 10serviceops, 10GitLab, 10Infrastructure-Foundations: request service IP / DNS name for gitlab-replica, apply puppet role on gitlab2001 - https://phabricator.wikimedia.org/T285870 (10Jelto) 05Open→03Resolved [14:42:46] joe: if you get a sec sometime, let me know if my reasoning on T287983 sounds reasonable? we ought to be able to do what ema says there, if so [14:45:02] rzl: sure, taking a look soon [15:39:33] rzl: yeah I saw it this morning, didn't have the time to look at options for handling errors in envoy [15:39:55] I would *prefer* if we were able to serve proper error pages from envoy [15:40:13] esp in cases like this one when its local backend is saturated [15:41:00] hm, okay! you wouldn't rather keep it all in one place? [15:41:14] or, well, as many places as it's already spread out in [15:55:05] rzl: it depends on a few factors. but I need to take a closer look indeed [15:57:08] okay cool [15:57:21] I found the config option in Envoy so it looks like it's easy to do one way or the other, will drop a quick note on the task [16:00:14] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) @ema That makes sense, thanks for the... [16:03:13] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) The test with 73 nodes max shows a pull time of 00:01:46 with standard deviation of 00:00:33 [1] which is pretty close to the numb... [16:44:34] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) [16:49:41] side note, it turns out envoy sends content-type: plain/text for those hardcoded errors, rather than the standard text/plain -- I'd be a little tempted to send a patch, except that I'm sure by now somebody's relying on it [16:53:45] I was about to say it seems the perfect foundational bug we could rely on for our error handling [16:53:56] and you should send a patch to our ats lua instead [16:54:01] ahaha [16:54:25] can you deny that would work quite well? [16:54:42] no it'd work great until they fix the bug someday [16:54:52] sure one day the latest hire in serviceops will scratch their heads when they do the routine envoy upgrade [16:54:58] which we can guarantee they'll do, simply by relying on it [16:54:58] and that breaks the error pages [16:55:05] ahahaha [16:55:12] "wait, how did this EVER work" [16:55:19] that will be a great fireside story for them to tell [16:55:34] where we'll have the role of the old crazy ducttapers who ruined their lives [16:55:48] "oh, is THIS why they were all giggling when they asked me to do a 'routine upgrade'" [16:56:46] nah are you kidding? by then we'll have forgot of this "clever hack" [16:57:25] that's also why we should document it in an unlinked page on wikitech, so we can later tell the newbie "If you read the wiki...." [16:57:33] "it's all there" [17:05:51] perfect [17:51:30] I love to see this mentoring ;) [17:53:47] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [17:53:57] 10serviceops, 10MW-on-K8s, 10SRE: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10dancy) 05Open→03Resolved This is fixed now. You can firm by testing with this image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-04-173113-... [18:49:12] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, and 2 others: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10dancy) 05Open→03Resolved This is done. Now whenever a docker-registry.discovery.wmnet/restricted/medi... [18:49:18] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10dancy) [18:51:07] bd808: I'm keeping up with how you and the other old farts mentored me [18:51:53] I was but a fresh faced child when you arrived, but I probably helped shovel bad ideas your way ;) [19:10:10] 10serviceops, 10Peek, 10Security-Team, 10user-sbassett: Disable peek for the Security Team - https://phabricator.wikimedia.org/T284090 (10sbassett) 05Stalled→03Resolved [19:30:13] 10serviceops, 10DBA, 10Toolhub, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10bd808) a:05bd808→03Marostegui [19:50:28] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and profile work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [19:50:37] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [19:51:06] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [19:51:12] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, and 2 others: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10Krinkle) [19:51:20] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10WikimediaDebug: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) a:03dpifke [20:22:11] 10serviceops, 10Analytics, 10Prod-Kubernetes, 10SRE, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [20:22:29] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) [20:23:40] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 3 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10Ottomata) I think that will do it. helm template looks good locally. @JMeybohm is it ok that I moved the debug ports to their own Service?... [20:29:55] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [21:24:26] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) [22:28:00] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Legoktm) >>! In T287983#7257682, @RLazarus wrote... [22:52:20] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Legoktm) bawolff and I discussed this a bit more yesterday and think that something is probably going wrong with `nowait:` / nested locks. In theo... [23:01:26] Does anyone know if the prod k8s cluster supports any version of the CronJob spec? I'm thinking about T276405 and really a CronJob would be easier at this point than all the work for celery with only one job to handle. [23:07:37] bd808: based on the existence of https://gerrit.wikimedia.org/g/operations/deployment-charts/+/f8ec447b0a72dd9ebbd1747b4a722b0024f2148a/charts/linkrecommendation/templates/cronjob.yaml I'd say yes [23:08:13] oh sweet. :) A new thing to add to my charts! [23:09:01] thanks for looking that up legoktm [23:11:23] 10serviceops, 10DBA, 10Toolhub, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7252204, @bd808 wrote: >>>! In T271480#7251102, @Marostegui wrote: >> @bd808 one more question I thought I asked before, but I didn't (sorry!), what... [23:15:34] 10serviceops, 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) [23:18:59] 10serviceops, 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) I have edited the description to remove celery and redis from the initial deployment requirements. There would only be one celery job to run with th...