[05:06:43] 10serviceops, 10DBA, 10Toolhub, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Marostegui) Thanks for the investigation! The database is now created, as soon as we sort the GRANTs code review, we should be ready to go. [07:18:00] legoktm: hey, is it fine to deploy shellbox constraints to testwikidatawiki? [07:18:13] gogogo [07:18:20] ooooh noice [07:19:18] on my list for tomorrow is to edit the Shellbox dashboard so it also supports looking at shellbox-constraints, I've never done something like that before so feel free to beat me to it [07:19:44] also if shellbox-constraints goes down, it'll alert in icinga but *won't* page anyone yet [07:20:19] it's test, it can fully go down. Don't worry [07:20:27] so when we do want to expand its usage (and consider it stable), we should enable paging [07:20:58] noted. Is there another k8s service you were looking for inspiration for the grafana dashboard? [07:22:27] I just wanted to add a selector/dropdown so you could switch the current dashboard to look at the shellbox-constraints metrics instead [07:22:55] I actually just copied it from j.oe's one for mwdebug-on-k8s and K.rinkle has edited since [07:23:16] grafana is a weak spot in my knowledge [07:24:06] I can do it, the least to help [07:26:37] joe: running a 73 node test with the most recent restriced image now for your convenience before cleaning up the abused appservers. Shout if you'd like to test something else [07:27:54] jayme: i think that's ok [08:01:06] joe: avg 0:01:25 with std. dev. of 0:00:35 [08:01:32] so speed of download is overall constant adding more nodes [08:01:45] I guess we can get to ~ 150 per supernode easily [08:02:05] now we have the big issue of swift replication being very slow cross-dc [08:02:19] so we still might need to call the registry cross-dc [08:02:35] yeah...supernode will not be an issue (as we're not using the CDN component of it) [08:02:51] btw [08:03:07] yesterday I had to enlarge the tmpfs for /var/lib/nginx on the registries [08:03:14] we should open a task [08:03:18] ah, for upoading right? [08:03:28] we do have one...more or less [08:03:30] let me check [08:03:46] yes [08:03:53] we had a layer larger than 1 GB apparently [08:05:38] we had a discussion about that in https://phabricator.wikimedia.org/T264209#6547152 [08:06:13] I still think offloading that to disk instead of tmpfs is fine, but we have not tested that by now I guess [08:06:45] given the small memory footprint of everything else, I think it's ok to just make the tmpfs 2gb [08:07:11] ...until we hit that limit :) [08:09:02] so you did that manually on registry2* I guess? [08:09:07] yes [08:09:15] I was between meetings at 7 pm :P [08:11:10] hm, yeah...given the VMs have 4GB of ram increasing it to 2GB probably fine (and means we don't have to do testing) [08:13:29] I'll split out the discussion from the task into a seperate one to make it more visible and we can then add the needed hiera keys [08:15:36] legoktm: https://grafana-rw.wikimedia.org/d/ftVv2pM7z/shellbox-constraint?orgId=1 [08:16:25] can we just have one dashboard that lets you flip between shellboxes? like how the mysql dashboards let you switch between db hosts... [08:16:37] (idk what the grafana term for that is) [08:16:53] hmm, sure, I can give it a try [08:17:21] legoktm: yes, we need to add a variable to the graph, and then use it everywhere in a selector [08:19:05] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10ema) >>! In T287983#7261627, @Legoktm wrote: > I... [08:20:20] Amir1: I'm editing the main shellbox dashboard [08:20:26] so don't touch it now :) [08:24:41] sure [08:26:13] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10JMeybohm) p:05Triage→03High [08:36:45] joe: if you have a minute https://gerrit.wikimedia.org/r/c/operations/puppet/+/710218 [08:37:03] jayme: not now [08:37:11] np, anytime [08:44:20] Amir1: can you take a look at https://grafana-rw.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1 ? [08:44:36] it shoudl be possible to see the data for shellbox-constraints by selecting it from the dropdown [08:44:52] I can show legoktm how to do that when it's a more appropriate time of his day [08:45:30] joe: looks good, it gets 5 reqs/s. We haven't deployed anything yet. Is it health check? [08:45:50] oh super cool [08:45:53] yes [08:45:58] prolly yes [08:46:59] joe: beacuse reasons, I had 2.5 boba teas today, so I'm fully caffinated and going to be up for at least another hour [08:47:36] legoktm: I don't spend my insomnia hours working though :P [08:47:40] go relax :) [08:47:40] joe: did you depool registry2004 [08:48:49] 5 reqs/sec for health check seems a bit excessive :D [08:49:09] 5 pods [08:50:07] jayme: yes sorry [08:50:15] Amir1: so 1 check per sec [08:50:30] we have the same from pybal + monitoring on appservers [08:51:05] joe: ack. From SAL log (https://sal.toolforge.org/log/5Eb4EXsB1jz_IcWuJs-L) I assume we should decrease the size of the on-disk cache pool for nginx as well? [08:51:47] jayme: no, that was me seeing the root partition almost full after realizing the problem was "no space left on device" [08:51:58] ah, okay [08:52:00] I was still figuring out what was wrong [08:52:10] and then realized /var/lib/nignx was a tmpfs [08:52:28] got it, thanks. Will re-pool registry2004 after applying the tmpfs patch [09:24:48] joe: regarding cross-dc traffic of the registry: I did a test run with the latest image pulling cross-dc (from one registry only, as 2004 was depooled) and the numbers are 0:01:35, std dev 0:00:29 ... redoing now with both registry nodes pooled but I don't think we'll have an issue there [09:25:06] that's great [09:26:04] indeed. Also the fact that one registry node is pretty capable of handling that is promising [09:27:49] jayme: just got a timeout trying to pull the images in codfw, FWIW; I think we need to move to dragonfly there too ASAP [09:29:40] joe: Yeah. I'll do the cleanup of appservers first. After that I need to add some monitoring and then we can roll out to codfw I think [09:30:24] it's currently trying to download it to kubernetes2004 [09:33:52] jayme: so let me restrict the pods again to a single server :/ [09:34:03] but tbh I would say it's ok even without more monitoring [09:34:11] anyways, I'll do that [09:34:49] I'll do in parallel. Need to provision a genati vm first anyways [09:36:24] joe: given the metrics from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=dragonfly-supernode1001&var-datasource=thanos&var-cluster=misc&from=now-6h&to=now I'd decrease the memory requirement from 4 to 2gb but keep the number of CPUs at 2 to have some room for more nodes. What do you think? [09:43:13] can we have more supernodes? I thought that didn't work well [09:43:22] but otherwise yes, sure [09:43:50] Apparently the new image fails to download within the timeout on eqiad as well [09:44:04] I guess because we just have 2 pods? [09:44:09] so not enough nodes pulling it [09:45:33] joe: we can just have one active supernode per DC. With >1 supernodes, clients get split between them (which ofc. might be okay when we have a lots of them, but currently it's bad) [09:45:57] one supernode per DC ensures that P2P traffic stays DC local [09:46:28] well > 1 per DC as well ofc. But we should not have peers in one DC use a supernode in the other [09:46:37] actually on eqiad what's failing is helmfile diff [09:47:03] and I have no idea how to debug that [09:48:14] I see the tiller pod had 4 restarts [09:48:23] that smells suspicious [09:49:53] and ofc as soon as I'm looking at tiller logs, it's not failing [09:50:17] eheh [09:50:35] and it crashed again while trying to apply [09:50:52] the diff worked, but it restarted again, no relevant logs [09:50:56] I'll look at k8s [09:50:57] with two registry nodes (cross-dc) numbers go down a bit again - not on significant difference [09:51:18] 0:01:31, std dev 0:00.27 [09:51:56] Normal Killing 4m35s (x2 over 8m25s) kubelet, kubernetes1017.eqiad.wmnet Container tiller failed liveness probe, will be restarted [09:51:59] sigh. [09:52:09] we might need to give more resources to tiller [09:52:26] we might need to get rid of it :) [09:53:14] yeah ok, I have a problem to solve *this morning* :P [09:53:43] is there a way to give more resources to tiller? I don't know where those are defined [09:55:03] just looked and it does not have any resource limits applied [09:55:20] the definition is part of helmfile_namespaces.yaml [09:58:42] ok [09:58:58] so I should just investigate how to bump resources there? [09:59:55] you can just add a resources block to the definition in helmfile_namespaces.yaml I suppose [10:00:12] if it is actually starving, that might help [10:01:34] yeah it looks like it fails to respond to the liveness probe, sigh [10:01:56] it looks like codfw's deployment was left in an unstable state because of these restarts [10:03:24] hey folks, the kfserving yaml from upstream still uses CRD apis with v1beta1, that helm marks as ERROR in the version that we have (3.4.1) and of course I cannot deploy. I opened a gh issue to upstream to move to v1 (but it requires some change to the yaml), so in the meantime if could use helm 3.5.0+ (containing https://github.com/helm/helm/pull/8608) I'd be unblocked. I see that we are our [10:03:30] nice deb, ok if I open a task to upgrade it to say 3.5.0 or more and send a code change? [10:03:47] joe: yeah https://grafana-rw.wikimedia.org/d/hyl18XgMk/jayme-container-details?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=mwdebug&var-pod=tiller-6bc98dbc7c-j2bjc [10:04:11] seems like defaults get applied and the thing is heavily throttling (+ OOMK potentially) [10:05:12] elukey: be my guest - steps on updating all the things are outlined in https://wikitech.wikimedia.org/wiki/Helm [10:06:01] jayme: ack thanks, opening a task :) [10:06:14] elukey: you'd also drive-by fix https://phabricator.wikimedia.org/T274493 :) [10:06:34] * jayme out for lunc [10:06:36] +h [10:08:06] the last upstream seems to be 3.6.3, maybe it is good to target it [10:26:30] so most of shellbox deployment to testwikidatawiki is done, I have to leave for vaccine, I'll be back and finish it off [10:27:45] \o/ [12:15:37] elukey: yes, please target latest upstream release [12:22:33] 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10JMeybohm) 05Open→03Resolved tmpfs resized to 2GB on all registry nodes [12:35:48] jayme: I am following the guide, but I am wondering what do I have to do before the first pushes to gerrit + rebuild + etc.. [12:36:01] is there a way to test helm3 before hand? [12:36:30] I can build it locally and test it via minikube [12:36:31] elukey: I use to build it locally once [12:37:16] yes. I think that's as good as it gets tbh [12:38:11] then CI to see if anything explodes and finally kubeflow! [12:38:27] so we can test it via these ML people always doing weird things [12:44:17] yeah. I usually test locally, the update the CI images and bug hashar about rebuilding them, then run CI with new helm3, then deploy to deploy* [13:15:49] finished the build, I tried to test it locally with minikube (lint/template/upgrade) [13:15:54] looks ok [13:17:35] I'll take a break and then if nobody opposes I'll start the work to upgrade apt + CI + deploy1002 [13:26:00] \o/ [13:30:42] deployed, afk for a bit [13:37:31] 10serviceops, 10SRE, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05Open→03Stalled this is only open due to a single remaining server, the mwmaint servers in codfw. this will be upgraded after we switch D... [13:37:37] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Dzahn) [13:54:27] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) FTR: I also did some tests pulling the images cross-dc (which we do usually because of the current active/passive nature of docker... [14:50:48] new helm3 uploaded to apt! [14:50:54] going to file a change for CI [14:58:40] https://gerrit.wikimedia.org/r/c/integration/config/+/710276 for CI :) [15:43:56] CI was updated, and helm lint now works fine for my kubeflow changes [16:06:42] nicde [16:26:01] joe: can I deploy shellbox for 1% of wikidata now? testwikidata seems fine, even xhgui shows that it calls [16:26:54] 1% would be at most 200 reqs/sec [16:27:38] no, that's 200 reqs/min [16:40:42] legoktm: if you're around ^ [16:41:31] What's the failure behavior if shellbox-constraints goes down / is overloaded? [16:43:26] it gives "constraint violation" basically [16:43:38] which is not nice but acceptable [16:44:04] it wouldn't ruin anything nor cause user facing issues [16:44:40] OK, seems fine to me [16:45:22] awesome, deploying [17:44:02] helm3 upgrade completed! [17:45:07] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) [17:46:34] \o/ [17:58:35] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Jelto) I restored the backup of `gitlab1001` to `gitlab2001` using the [restore instructions](https://gerrit.wikimedia.org/r/plugins/gitiles/operations/gitlab-ansible/+/refs/heads/master/RESTORE.md) of S&F. I enhanc... [18:35:07] legoktm: it was working on wdqs, so failure is clearly tolerable [18:37:00] legoktm: joe the cpu usage is adorable. 21% and it's 0.03 [18:37:07] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=3&orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-job=constraintsRunCheck&from=now-1h&to=now [19:49:10] I see the jobs duration went down by 50% [20:20:05] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) [20:20:48] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) The only remaining on most of these is the idrac setup, This will happen tomorrow (Friday 6 AUG) [20:34:17] 10serviceops, 10Peek, 10Security-Team: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10sbassett) [20:34:34] 10serviceops, 10Peek, 10Security-Team, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10sbassett) [20:35:23] 10serviceops, 10Peek, 10Security-Team, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10sbassett) [21:55:34] 10serviceops, 10SRE: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [21:57:07] 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [21:58:17] 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [22:34:39] 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy) [22:38:17] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [22:41:22] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) 05Open→03Resolved a:03Legoktm I'm going to close this as resolved as I believe everyth... [23:00:22] I started fleshing out https://wikitech.wikimedia.org/wiki/Shellbox [23:01:29] nice [23:57:52] 10serviceops, 10Shellbox: php-fpm for shellbox slow log error failed to ptrace(ATTACH) - https://phabricator.wikimedia.org/T288315 (10Legoktm)