[05:06:43] <wikibugs>	 10serviceops, 10DBA, 10Toolhub, 10Patch-For-Review: Setup production database for Toolhub - https://phabricator.wikimedia.org/T271480 (10Marostegui) Thanks for the investigation! The database is now created, as soon as we sort the GRANTs code review, we should be ready to go.
[07:18:00] <Amir1>	 legoktm: hey, is it fine to deploy shellbox constraints to testwikidatawiki?
[07:18:13] <legoktm>	 gogogo
[07:18:20] <Amir1>	 ooooh noice
[07:19:18] <legoktm>	 on my list for tomorrow is to edit the Shellbox dashboard so it also supports looking at shellbox-constraints, I've never done something like that before so feel free to beat me to it
[07:19:44] <legoktm>	 also if shellbox-constraints goes down, it'll alert in icinga but *won't* page anyone yet
[07:20:19] <Amir1>	 it's test, it can fully go down. Don't worry
[07:20:27] <legoktm>	 so when we do want to expand its usage (and consider it stable), we should enable paging
[07:20:58] <Amir1>	 noted. Is there another k8s service you were looking for inspiration for the grafana dashboard?
[07:22:27] <legoktm>	 I just wanted to add a selector/dropdown so you could switch the current dashboard to look at the shellbox-constraints metrics instead
[07:22:55] <legoktm>	 I actually just copied it from j.oe's one for mwdebug-on-k8s and K.rinkle has edited since
[07:23:16] <legoktm>	 grafana is a weak spot in my knowledge
[07:24:06] <Amir1>	 I can do it, the least to help 
[07:26:37] <jayme>	 joe: running a 73 node test with the most recent restriced image now for your convenience before cleaning up the abused appservers. Shout if you'd like to test something else
[07:27:54] <joe>	 jayme: i think that's ok
[08:01:06] <jayme>	 joe: avg 0:01:25 with std. dev. of 0:00:35
[08:01:32] <joe>	 so speed of download is overall constant adding more nodes
[08:01:45] <joe>	 I guess we can get to ~ 150 per supernode easily
[08:02:05] <joe>	 now we have the big issue of swift replication being very slow cross-dc
[08:02:19] <joe>	 so we still might need to call the registry cross-dc
[08:02:35] <jayme>	 yeah...supernode will not be an issue (as we're not using the CDN component of it)
[08:02:51] <joe>	 btw
[08:03:07] <joe>	 yesterday I had to enlarge the tmpfs for /var/lib/nginx on the registries
[08:03:14] <joe>	 we should open a task
[08:03:18] <jayme>	 ah, for upoading right?
[08:03:28] <jayme>	 we do have one...more or less
[08:03:30] <jayme>	 let me check
[08:03:46] <joe>	 yes
[08:03:53] <joe>	 we had a layer larger than 1 GB apparently
[08:05:38] <jayme>	 we had a discussion about that in https://phabricator.wikimedia.org/T264209#6547152
[08:06:13] <jayme>	 I still think offloading that to disk instead of tmpfs is fine, but we have not tested that by now I guess
[08:06:45] <joe>	 given the small memory footprint of everything else, I think it's ok to just make the tmpfs 2gb
[08:07:11] <jayme>	 ...until we hit that limit :)
[08:09:02] <jayme>	 so you did that manually on registry2* I guess?
[08:09:07] <joe>	 yes
[08:09:15] <joe>	 I was between meetings at 7 pm :P
[08:11:10] <jayme>	 hm, yeah...given the VMs have 4GB of ram increasing it to 2GB probably fine (and means we don't have to do testing)
[08:13:29] <jayme>	 I'll split out the discussion from the task into a seperate one to make it more visible and we can then add the needed hiera keys
[08:15:36] <Amir1>	 legoktm: https://grafana-rw.wikimedia.org/d/ftVv2pM7z/shellbox-constraint?orgId=1
[08:16:25] <legoktm>	 can we just have one dashboard that lets you flip between shellboxes? like how the mysql dashboards let you switch between db hosts...
[08:16:37] <legoktm>	 (idk what the grafana term for that is)
[08:16:53] <Amir1>	 hmm, sure, I can give it a try
[08:17:21] <joe>	 legoktm: yes, we need to add a variable to the graph, and then use it everywhere in a selector
[08:19:05] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10ema) >>! In T287983#7261627, @Legoktm wrote: > I...
[08:20:20] <joe>	 Amir1: I'm editing the main shellbox dashboard
[08:20:26] <joe>	 so don't touch it now :)
[08:24:41] <Amir1>	 sure
[08:26:13] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10JMeybohm) p:05Triage→03High
[08:36:45] <jayme>	 joe: if you have a minute https://gerrit.wikimedia.org/r/c/operations/puppet/+/710218
[08:37:03] <joe>	 jayme: not now
[08:37:11] <jayme>	 np, anytime
[08:44:20] <joe>	 Amir1: can you take a look at https://grafana-rw.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1 ?
[08:44:36] <joe>	 it shoudl be possible to see the data for shellbox-constraints by selecting it from the dropdown
[08:44:52] <joe>	 I can show legoktm how to do that when it's a more appropriate time of his day
[08:45:30] <Amir1>	 joe: looks good, it gets 5 reqs/s. We haven't deployed anything yet. Is it health check?
[08:45:50] <legoktm>	 oh super cool
[08:45:53] <legoktm>	 yes
[08:45:58] <joe>	 prolly yes
[08:46:59] <legoktm>	 joe: beacuse reasons, I had 2.5 boba teas today, so I'm fully caffinated and going to be up for at least another hour
[08:47:36] <joe>	 legoktm: I don't spend my insomnia hours working though :P
[08:47:40] <joe>	 go relax :)
[08:47:40] <jayme>	 joe: did you depool registry2004
[08:48:49] <Amir1>	 5 reqs/sec for health check seems a bit excessive :D 
[08:49:09] <Amir1>	 5 pods
[08:50:07] <joe>	 jayme: yes sorry
[08:50:15] <joe>	 Amir1: so 1 check per sec
[08:50:30] <joe>	 we have the same from pybal + monitoring on appservers
[08:51:05] <jayme>	 joe: ack. From SAL log (https://sal.toolforge.org/log/5Eb4EXsB1jz_IcWuJs-L) I assume we should decrease the size of the on-disk cache pool for nginx as well?
[08:51:47] <joe>	 jayme: no, that was me seeing the root partition almost full after realizing the problem was "no space left on device"
[08:51:58] <jayme>	 ah, okay
[08:52:00] <joe>	 I was still figuring out what was wrong
[08:52:10] <joe>	 and then realized /var/lib/nignx was a tmpfs
[08:52:28] <jayme>	 got it, thanks. Will re-pool registry2004 after applying the tmpfs patch
[09:24:48] <jayme>	 joe: regarding cross-dc traffic of the registry: I did a test run with the latest image pulling cross-dc (from one registry only, as 2004 was depooled) and the numbers are 0:01:35, std dev 0:00:29 ... redoing now with both registry nodes pooled but I don't think we'll have an issue there
[09:25:06] <joe>	 that's great
[09:26:04] <jayme>	 indeed. Also the fact that one registry node is pretty capable of handling that is promising
[09:27:49] <joe>	 jayme: just got a timeout trying to pull the images in codfw, FWIW; I think we need to move to dragonfly there too ASAP
[09:29:40] <jayme>	 joe: Yeah. I'll do the cleanup of appservers first. After that I need to add some monitoring and then we can roll out to codfw I think
[09:30:24] <joe>	 it's currently trying to download it to kubernetes2004
[09:33:52] <joe>	 jayme: so let me restrict the pods again to a single server :/
[09:34:03] <joe>	 but tbh I would say it's ok even without more monitoring
[09:34:11] <joe>	 anyways, I'll do that
[09:34:49] <jayme>	 I'll do in parallel. Need to provision a genati vm first anyways
[09:36:24] <jayme>	 joe: given the metrics from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=dragonfly-supernode1001&var-datasource=thanos&var-cluster=misc&from=now-6h&to=now I'd decrease the memory requirement from 4 to 2gb but keep the number of CPUs at 2 to have some room for more nodes. What do you think?
[09:43:13] <joe>	 can we have more supernodes? I thought that didn't work well
[09:43:22] <joe>	 but otherwise yes, sure
[09:43:50] <joe>	 Apparently the new image fails to download within the timeout on eqiad as well
[09:44:04] <joe>	 I guess because we just have 2 pods?
[09:44:09] <joe>	 so not enough nodes pulling it
[09:45:33] <jayme>	 joe: we can just have one active supernode per DC. With >1 supernodes, clients get split between them (which ofc. might be okay when we have a lots of them, but currently it's bad)
[09:45:57] <jayme>	 one supernode per DC ensures that P2P traffic stays DC local
[09:46:28] <jayme>	 well > 1 per DC as well ofc. But we should not have peers in one DC use a supernode in the other
[09:46:37] <joe>	 actually on eqiad what's failing is helmfile diff
[09:47:03] <joe>	 and I have no idea how to debug that
[09:48:14] <joe>	 I see the tiller pod had 4 restarts
[09:48:23] <joe>	 that smells suspicious
[09:49:53] <joe>	 and ofc as soon as I'm looking at tiller logs, it's not failing
[09:50:17] <jayme>	 eheh
[09:50:35] <joe>	 and it crashed again while trying to apply
[09:50:52] <joe>	 the diff worked, but it restarted again, no relevant logs
[09:50:56] <joe>	 I'll look at k8s
[09:50:57] <jayme>	 with two registry nodes (cross-dc) numbers go down a bit again - not on significant difference
[09:51:18] <jayme>	 0:01:31, std dev 0:00.27
[09:51:56] <joe>	   Normal   Killing    4m35s (x2 over 8m25s)  kubelet, kubernetes1017.eqiad.wmnet  Container tiller failed liveness probe, will be restarted
[09:51:59] <joe>	 sigh.
[09:52:09] <joe>	 we might need to give more resources to tiller
[09:52:26] <jayme>	 we might need to get rid of it :)
[09:53:14] <joe>	 yeah ok, I have a problem to solve *this morning* :P
[09:53:43] <joe>	 is there a way to give more resources to tiller? I don't know where those are defined
[09:55:03] <jayme>	 just looked and it does not have any resource limits applied
[09:55:20] <jayme>	 the definition is part of helmfile_namespaces.yaml
[09:58:42] <joe>	 ok
[09:58:58] <joe>	 so I should just investigate how to bump resources there?
[09:59:55] <jayme>	 you can just add a resources block to the definition in helmfile_namespaces.yaml I suppose
[10:00:12] <jayme>	 if it is actually starving, that might help
[10:01:34] <joe>	 yeah it looks like it fails to respond to the liveness probe, sigh
[10:01:56] <joe>	 it looks like codfw's deployment was left in an unstable state because of these restarts
[10:03:24] <elukey>	 hey folks, the kfserving yaml from upstream still uses CRD apis with v1beta1, that helm marks as ERROR in the version that we have (3.4.1) and of course I cannot deploy. I opened a gh issue to upstream to move to v1 (but it requires some change to the yaml), so in the meantime if could use helm 3.5.0+ (containing https://github.com/helm/helm/pull/8608) I'd be unblocked. I see that we are our 
[10:03:30] <elukey>	 nice deb, ok if I open a task to upgrade it to say 3.5.0 or more and send a code change?
[10:03:47] <jayme>	 joe: yeah https://grafana-rw.wikimedia.org/d/hyl18XgMk/jayme-container-details?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=mwdebug&var-pod=tiller-6bc98dbc7c-j2bjc
[10:04:11] <jayme>	 seems like defaults get applied and the thing is heavily throttling (+ OOMK potentially)
[10:05:12] <jayme>	 elukey: be my guest - steps on updating all the things are outlined in https://wikitech.wikimedia.org/wiki/Helm
[10:06:01] <elukey>	 jayme: ack thanks, opening a task :)
[10:06:14] <jayme>	 elukey: you'd also drive-by fix https://phabricator.wikimedia.org/T274493 :)
[10:06:34] * jayme out for lunc
[10:06:36] <jayme>	 +h
[10:08:06] <elukey>	 the last upstream seems to be 3.6.3, maybe it is good to target it
[10:26:30] <Amir1>	 so most of shellbox deployment to testwikidatawiki is done, I have to leave for vaccine, I'll be back and finish it off
[10:27:45] <joe>	 \o/
[12:15:37] <jayme>	 elukey: yes, please target latest upstream release
[12:22:33] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline: Pushes to docker-registry fail for images with compressed layers of size >1GB - https://phabricator.wikimedia.org/T288198 (10JMeybohm) 05Open→03Resolved tmpfs resized to 2GB on all registry nodes
[12:35:48] <elukey>	 jayme: I am following the guide, but I am wondering what do I have to do before the first pushes to gerrit + rebuild + etc..
[12:36:01] <elukey>	 is there a way to test helm3 before hand?
[12:36:30] <elukey>	 I can build it locally and test it via minikube
[12:36:31] <jayme>	 elukey: I use to build it locally once
[12:37:16] <jayme>	 yes. I think that's as good as it gets tbh
[12:38:11] <elukey>	 then CI to see if anything explodes and finally kubeflow! 
[12:38:27] <elukey>	 so we can test it via these ML people always doing weird things
[12:44:17] <jayme>	 yeah. I usually test locally, the update the CI images and bug hashar about rebuilding them, then run CI with new helm3, then deploy to deploy*
[13:15:49] <elukey>	 finished the build, I tried to test it locally with minikube (lint/template/upgrade)
[13:15:54] <elukey>	 looks ok
[13:17:35] <elukey>	 I'll take a break and then if nobody opposes I'll start the work to upgrade apt + CI + deploy1002
[13:26:00] <jayme>	 \o/
[13:30:42] <Amir1>	 deployed, afk for a bit
[13:37:31] <wikibugs>	 10serviceops, 10SRE, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05Open→03Stalled this is only open due to a single remaining server, the mwmaint servers in codfw. this will be upgraded after we switch D...
[13:37:37] <wikibugs>	 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Dzahn)
[13:54:27] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) FTR: I also did some tests pulling the images cross-dc (which we do usually because of the current active/passive nature of docker...
[14:50:48] <elukey>	 new helm3 uploaded to apt!
[14:50:54] <elukey>	 going to file a change for CI
[14:58:40] <elukey>	 https://gerrit.wikimedia.org/r/c/integration/config/+/710276 for CI :)
[15:43:56] <elukey>	 CI was updated, and helm lint now works fine for my kubeflow changes
[16:06:42] <joe>	 nicde
[16:26:01] <Amir1>	 joe: can I deploy shellbox for 1% of wikidata now? testwikidata seems fine, even xhgui shows that it calls 
[16:26:54] <Amir1>	 1% would be at most 200 reqs/sec
[16:27:38] <Amir1>	 no, that's 200 reqs/min
[16:40:42] <Amir1>	 legoktm: if you're around ^
[16:41:31] <legoktm>	 What's the failure behavior if shellbox-constraints goes down / is overloaded?
[16:43:26] <Amir1>	 it gives "constraint violation" basically
[16:43:38] <Amir1>	 which is not nice but acceptable
[16:44:04] <Amir1>	 it wouldn't ruin anything nor cause user facing issues
[16:44:40] <legoktm>	 OK, seems fine to me
[16:45:22] <Amir1>	 awesome, deploying
[17:44:02] <elukey>	 helm3 upgrade completed!
[17:45:07] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey)
[17:46:34] <rzl>	 \o/
[17:58:35] <wikibugs>	 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Jelto) I restored the backup of `gitlab1001` to `gitlab2001` using the [restore instructions](https://gerrit.wikimedia.org/r/plugins/gitiles/operations/gitlab-ansible/+/refs/heads/master/RESTORE.md) of S&F.  I enhanc...
[18:35:07] <joe>	 legoktm: it was working on wdqs, so failure is clearly tolerable
[18:37:00] <Amir1>	 legoktm: joe the cpu usage is adorable. 21% and it's 0.03
[18:37:07] <Amir1>	 https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=3&orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-job=constraintsRunCheck&from=now-1h&to=now
[19:49:10] <joe>	 I see the jobs duration went down by 50%
[20:20:05] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson)
[20:20:48] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) The only remaining on most of these is the idrac setup, This will happen tomorrow (Friday 6 AUG)
[20:34:17] <wikibugs>	 10serviceops, 10Peek, 10Security-Team: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10sbassett)
[20:34:34] <wikibugs>	 10serviceops, 10Peek, 10Security-Team, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10sbassett)
[20:35:23] <wikibugs>	 10serviceops, 10Peek, 10Security-Team, 10user-sbassett: Decommission peek2001 VM - https://phabricator.wikimedia.org/T288290 (10sbassett)
[21:55:34] <wikibugs>	 10serviceops, 10SRE: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy)
[21:57:07] <wikibugs>	 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy)
[21:58:17] <wikibugs>	 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy)
[22:34:39] <wikibugs>	 10serviceops, 10SRE, 10docker-pkg: Add docker-pkg init subcommand - https://phabricator.wikimedia.org/T288302 (10dancy)
[22:38:17] <wikibugs>	 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm)
[22:41:22] <wikibugs>	 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) 05Open→03Resolved a:03Legoktm I'm going to close this as resolved as I believe everyth...
[23:00:22] <legoktm>	 I started fleshing out https://wikitech.wikimedia.org/wiki/Shellbox
[23:01:29] <rzl>	 nice
[23:57:52] <wikibugs>	 10serviceops, 10Shellbox: php-fpm for shellbox slow log error failed to ptrace(ATTACH) - https://phabricator.wikimedia.org/T288315 (10Legoktm)