[00:07:13] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [00:25:41] 10serviceops, 10Patch-For-Review: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus) 05Open→03Resolved >>! In T323707#8419423, @Joe wrote: > Maybe if the page we're trying to fetch is that cumbersome, we should switch to a different, light... [00:29:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye completed: -... [00:33:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) [00:34:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) 05Open→03Resolved This is done [07:18:57] 10serviceops, 10Patch-For-Review: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10Joe) 05Open→03In progress [07:19:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) [08:03:32] 10serviceops, 10Patch-For-Review: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f3f4f962-02d1-4ce5-ba6c-e8c63e34958f) set by oblivian@cumin1001 for 3:00:00 on 42 host(s) and their services with reason: Appservers `... [08:59:35] hi -- FYI I'm planning to failover to graphite1005 tomorrow, mw-config change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/861361 and I'll grab a deployment window after the utc morning backport, anything else I should know ? [09:07:04] <_joe_> godog: just to check, did you verify we're allowing connections to graphite1005 from k8s? [09:07:16] <_joe_> we might need to add it to our default network policy I guess [09:08:27] _joe_: I did, you were even on the review :P https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/859575 [09:08:37] <_joe_> ahah sorry [09:09:19] <_joe_> 1 week ago, yeah my brain has already evicted that cache [09:09:57] fair enough, anything I should do to verify the change or once deployed that's it ? [09:11:05] <_joe_> that's it [09:11:18] <_joe_> we don't really need to verify it I would say [09:11:43] <_joe_> we can get into a mw pod and verify we can telnet to graphite1005 [09:12:13] SGTM, happy to do that (== copy/paste commands!) [09:12:36] in other words: I have no idea what I'm doing [09:12:42] <_joe_> ahah don't worry, I can do that, but also I'm happy to guide you through it [09:13:00] <_joe_> if you want to learn how to debug stuff on k8s, this is a good start :P [09:13:15] yeah might as well try, if you have five minutes for sure [09:13:59] if not and you'd rather just do it that's fine too [09:14:57] <_joe_> so, the process is, in a nutshell: [09:15:03] <_joe_> * find a pod you want to debug [09:15:09] <_joe_> * ssh to the node where it's running [09:15:26] <_joe_> * find the PID of the container you want to inspect on that host [09:15:51] <_joe_> * use nsenter -t -n to execute commands in the network namespace of that container [09:16:51] <_joe_> so step 1: ssh to the deployment server, kube_env mw-debug eqiad [09:17:05] <_joe_> to get the config for the service mw-debug in the eqiad cluster [09:17:17] <_joe_> kubectl get pods gives you the list of pods there [09:17:23] thank you, yeah I was about to say, I got to https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#Managing_pods,_jobs_and_cronjobs re: kube_env [09:18:33] <_joe_> kubectl describe pod mediawiki-pinkunicorn-758f67f5c4-bvhgf | grep ^Node: will tell you where this pod runs [09:18:51] <_joe_> you also want to write down the docker id of a container from kubectl describe [09:19:34] <_joe_> so now I picked 5f4842f85887c7b97b47d49464fb747b1e1dafdadb0de34ba2cb474fbd04674e, which is the httpd container [09:20:11] *nod* makes sense [09:20:14] <_joe_> on the node I found, kubernetes1019, I run sudo docker top to find the PID on the host [09:20:29] <_joe_> that pid I'll use in nsenter [09:22:03] ok I think I got it! thank you [09:22:49] ok of course this is udp/statsd so "who knows" [09:23:05] <_joe_> nc is on the server [09:24:46] yeah but it isn't like I'm getting stuff back if the "connection" isn't working [09:26:14] that's ok though I'll check graphite1005 [09:28:40] but yeah works as expected, we're good [09:28:48] _joe_: thank you for the hand holding [09:29:21] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10WMDE-Fisch) [09:30:17] <_joe_> godog: tbh I should've worked more on https://github.com/lavagetto/k8sh which automates most of this stuff [09:30:37] <_joe_> when I feel like it I'll spend another weekend on it [09:30:57] 10serviceops, 10Platform Engineering, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, and 4 others: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 (10WMDE-Fisch) [09:31:06] heheh nice [09:38:14] 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10WMDE-Fisch) [09:41:04] Can we close this https://phabricator.wikimedia.org/T276994 or do we think there's still things to do to replicate standard mwdebug? [09:53:36] 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) I've prepared the change. Tell me when you prod on meta and I'll merge it. [10:32:31] <_joe_> claime: "engineers can deploy experimental code without impeding others" [10:32:41] <_joe_> that's still needed :) [10:33:52] ah, right [10:54:17] 10serviceops, 10Maps, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan) [11:03:45] <_joe_> hnowlan: I think we'll have to roll-restart changeprop [11:04:04] <_joe_> because I am adding new jobrunners and removing old ones [11:04:08] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10JMeybohm) > However, according to graphs memory usage is looking pretty meager - something is missing. Keep in m... [11:06:31] _joe_: ack - when? [11:07:21] <_joe_> hnowlan: I'd wait the afternoon to see if things normalize naturally [11:08:17] cool [11:49:31] 10serviceops: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10Joe) 05In progress→03Resolved [11:49:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) [11:51:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) @RLazarus you can proceed with the decommissioning steps whenever you're ready. The servers are still in rotation as of now, and will need to be depooled first. I m... [12:07:42] hiccup with thumbor - firejail would previously at least limit memory use when doing something like an STL conversion using xvfb+3d2png, but on k8s the subprocess just tries to use a lot of memory and gets OOMkilled by the kernel on the kube node. if you give a worker a memory limit of 2gb it works, but that's not realistic and means one worker per pod [12:13:39] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes - https://phabricator.wikimedia.org/T324003 (10Joe) [12:14:55] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes - https://phabricator.wikimedia.org/T324003 (10Joe) p:05Triage→03Medium [13:14:16] <_joe_> hnowlan: so you want to limit the memory used by a subprocess [13:14:22] <_joe_> ugh that is indeed tricky [13:14:46] <_joe_> jayme, akosiaris ^^ any ideas? [13:23:50] I'm not sure I understand why this did not happen in firejail. Wouldn't the process have been killed there as well? [13:30:25] I'm not hugely familiar with firejail myself but it doesn't seem so [13:31:01] if we could, raising per-pod limits to allow for spikes would be the path of least work [14:00:17] hm. maybe some application is aware of the limit in k8s context but is in firejail and just tries to allocate a bunch of memory because it does not see the limit [14:00:31] although I had assumed firejail would enforce that via cgroups as well [14:11:18] <_joe_> jayme: maybe the user we run as in k8s doesn't have the ability to find memory limits properly, or something like that [14:12:45] maybe...or they read frpm /proc/meminfo or vmstat instead of cgroups memory.stat [14:13:17] does firejail maybe provide correct values in meminfo/vmstat? [14:13:34] <_joe_> I don't think so [14:13:54] <_joe_> we need to run the process with strace at the very least [14:14:07] yes [14:14:11] <_joe_> hnowlan: I'd start looking at the syscalls this thing does [14:15:30] sigh, sounds reasonable [14:16:39] the kill in k8s is a legit cgroup kill, seems perfectly reasonable: "[Mon Nov 28 17:45:27 2022] memory+swap: usage 921600kB, limit 921600kB, failcnt 6756" [14:17:07] there's also a chance that there's a hugely increased memory use in xvfb in buster for some reason [14:23:47] <_joe_> so yeah I'd first test with firejail in buster heh :) [14:24:08] oh ffs [14:24:23] so thumbor doesn't have a memory limit of 900MB, imagemagick does and I just completely misread the variables [14:24:54] we specify a MemoryLimit in the systemd service based on the memory available on the host which effectively amounts to *9GB* [14:25:30] interestingly on metal we still have an occasional kill for gifsicle exeeding that but that's not relevant here [14:26:11] so ehhh... might need to limit the number of instances per pod, or make some exceptions for limits maybe [14:26:42] I had the 3d2png command succeed in a pod of a single instance running with a 2GB memory limit but that new information seems to indicate that even that would potentially be constrained [14:27:53] 9GB on the old instances, 18GB on the newer ones (because we do `(@memorysize_mb.to_i * 0.15).round`) [14:27:59] ouch [14:28:30] this is per thumbor instance then, right? [14:28:51] so per thumbor container in the pod if translated to k8s [14:29:20] yeah [14:29:23] <_joe_> I think we might need to revise how we run thumbor then [14:29:39] <_joe_> it makes sense if you can share that space between runners I guess [14:29:45] I mean you practically *can* set limits that high (with reasonable requests) but you won't be guaranteed to have that memor available [14:30:33] the space is practically shared joe [14:30:46] <_joe_> ah so you mean just raise limits not requests [14:30:49] <_joe_> yeah... [14:31:06] might still fail from time to time then ofc. [14:31:13] 10serviceops, 10Release-Engineering-Team, 10Scap: scap fails to sync some new hosts - https://phabricator.wikimedia.org/T324023 (10Jdforrester-WMF) Probably not a scap bug but a new server config issue, then. [14:31:44] it does kinda make me want to jump ahead to have a single instance per pod with no haproxy though [14:32:24] that would ofc. make the distribution way easier indeed [14:32:42] 10serviceops, 10Release-Engineering-Team, 10Scap: scap fails to sync some new hosts - https://phabricator.wikimedia.org/T324023 (10Joe) Please note these are not new hosts, just one of the servers listed is. So I would say either a scap bug or a network issue. [14:32:56] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10Joe) [14:33:48] hnowlan: but IIRC single instance per pod wo haproxy was no option because of the blocking nature of thumbor, right? [14:36:03] could we haproxy over N single instances instead? [14:38:16] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) 05Open→03Resolved [14:38:20] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:45:30] jayme: yeah not without some more aggressive health checking I guess. [14:46:21] but if we keep haproxy? and just haproxy over multiple pods instead of instances in the pod? [14:50:03] as in have the haproxy running outside of k8s? [14:50:32] could be still inside...but we would be having a deployment of haproxy and a deployment of thumbor [14:59:35] ahh so 1:1, yeah, that's fine and doable [15:01:05] not sure I understand. I was thinking a deployment of N haproxy pods fronting a deployment of the same N thumbor pods [15:02:08] this is still confusing. let me try again: I was thinking a deployment of N haproxy pods fronting a deployment of the same set of thumbor pods [15:04:34] aha. So haproxy instances are in their own pod, and each thumbor instance is a pod, in the same namespace? [15:04:43] yes [15:05:46] tbh I do not remember if that was ruled out initially [15:12:21] It wasn't afaik :) [15:19:24] 10serviceops, 10Infrastructure-Foundations: Load IP ranges in reverse-proxy.php from Netbox/Puppet network module - https://phabricator.wikimedia.org/T324020 (10Marostegui) I think this might be more specific for these two teams. [15:21:11] if haproxy can discover backend via DNS, this should be doable by creating a headless service for the thumbor pods (by setting clusterIP: None). DNS will then return the IPs of the actual pods backing that service instead of a service IP [15:22:02] ofc. that needs to be re-evaluated often in case a pod dies etc. [16:36:13] 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) When deploying the updated calico chart to my test cluster I realized that the fixed dependency to the CRD chart does not mean tha... [16:38:42] 10serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Upgrade jwt-authorizer on all registry hosts - https://phabricator.wikimedia.org/T324037 (10dduvall) [16:46:35] <_joe_> jayme: I'd rather have a couple haproxies fronting like 10 thumbor pods [16:46:48] <_joe_> haproxy can scale 1Gx more than thumbor [16:47:01] <_joe_> not sure that would do what you want though [16:47:07] Yeah I was wondering why the 1-1 pattern was needed [16:47:12] that's what I said, no? [16:47:13] <_joe_> it's not [16:47:25] <_joe_> jayme: I understood the exact opposite [16:47:29] And how haproxy alleviates thumbor calls being blocking [16:47:53] Yeah, same as _joe_, I understood 1 haproxy for 1 thumbor instance [16:47:53] "I was thinking a deployment of N haproxy pods fronting a deployment of the same set of thumbor pods" [16:48:23] N haproxys fronting all thumbor pods [16:48:28] <_joe_> jayme: ok, no, I was saying you need 2 haproxies for N (with N > 100) thumbors [16:48:46] <_joe_> "the same set" in this context seems to mean "the same number as" [16:48:59] <_joe_> but you wanted to mean "the same set as previously sized" [16:49:01] <_joe_> right? [16:50:08] ok. then I was not able to communicate that clearly. I wanted to say N (N being > 2) haproxy instances should front all of the replicas we have for thumbor :) [16:50:37] <_joe_> so an haproxy ingress :P [16:51:05] I wanted to make sure that there is no relationship between a haproxy instance and a subset of thumbors anymore in that case [16:51:08] yes, exactly [16:51:23] <_joe_> i mean it could make sense [16:51:37] a pretty dedicated ingress but yea [16:51:49] <_joe_> https://haproxy-ingress.github.io/ :P [16:51:59] I wonder why we did not discuss that option initially [16:52:06] yes yes, I know [16:52:21] <_joe_> i wonder if there's an option in istio to do the same as haproxy does for thumbor [16:52:23] we could also absolutely do that with istio ingress...just to be said [16:52:45] <_joe_> basically we need to ensure we have 1 connection per backend pod [16:53:26] <_joe_> thumbor is critical but not high volume, but I fear there's all kinds of tunables that are not there for thumbor in our istio [16:53:31] What does haproxy do for thumbor, actually? Hold the backend side connection while thumbor completes the job so the caller can move on? [16:53:51] <_joe_> and avoid queueing requests on a busy thumbor worker [16:54:30] That screams "pull-based workflow needed" (I know it's not the discussion but still) [16:54:32] bit of header mangling as well [16:56:00] <_joe_> claime: you mean queueing the thumb onto a queue and have a pool of thumbors pulling from there? [16:56:10] _joe_: yeah [16:56:12] yep, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/thumbor/templates/_haproxy.tpl it's very simple but very particular [16:56:18] <_joe_> there's arguments for and against it [16:56:45] for things like the gifsicle and 3d2png jobs in particular a queue based system makes a lot of sense [16:57:07] hnowlan: because they're so long and resource-intensive? [16:57:40] comparatively yeah. tbh *all* jobs could fit into a queue based system imo, as larger images can be the same [16:57:43] <_joe_> hnowlan: meh, it's part of a sync request flow [16:57:46] and if workers block when doing work anyway [16:58:04] <_joe_> it's very hard to get it right in rewrite.py [16:58:28] <_joe_> basically, swift expects a sync response [16:58:51] <_joe_> so the right way to do this is indeed a system that sends back a sync response [16:59:10] <_joe_> so the enqueueing can happen in haproxy or in a queue, doesn't change much tbh [16:59:27] yeah fair [16:59:30] <_joe_> because the queue should be behind haproxy and do a classical transaction queue work [16:59:35] <_joe_> which haproxy already does [16:59:38] both are bandages on thumbor's execution model [16:59:41] fair enough [16:59:42] <_joe_> yes [17:00:34] as regards the isitio path - is doing things like header mangling something we'd want to enable or is that offering too much complexity for what we want it to accomplish? [17:01:06] I *think* that can already be done with istio CRDs [17:01:33] things like maxconnection (per pod) def. can using DestinationRule [17:02:22] and headers can be manipulated in the virtualservice [17:02:43] ofc. that's nothing I have implemented in the basic ingress module [17:02:55] so it's technically possible - how do we feel about it ideologically? I'm not aware of any services that do really complex using ingress at the moment [17:03:19] and we would probably have to do extensive testing if it behaves like haproxy does then... [17:03:59] that's the other thing...atm. we only have a hand full of services behind ingress [17:04:03] nothing in prod path [17:05:30] yeah, seems like it'd need some time dedicated to it either way [17:06:56] indeed...whereas haproxy is kind of ready (and we know how it behaves). [17:07:09] we could also to both... [17:07:46] as we're in the lucky situation of having a metal fallback anyways [17:08:30] <_joe_> look,the modules/ingress directory has space for a haproxy module :P [17:09:09] ok before we add a full fledged haproxy ingress, we should do it via istio :D [17:09:55] Yeah, let's maybe not do another full catalog of available proxying solutions in our k8s ingress :D [17:11:06] for the immediate term, is increasing limits (but not requests) to something higher off the table? [17:11:26] if you're interested hnowlan we could probably forge a istio config in a couple of hours. you have time on fridays as I understood? :p [17:12:00] haha [17:12:17] whisky and ingress [17:12:52] increasing the limits should be fine as well for the time being I would say [17:13:03] so you don't get blocked by this [17:14:09] but with a gazillion of thumbor instances, we could maneuver us in a bad situation scheduling wise [17:14:48] especially with those super-fat mw pods [17:15:10] yeah I think the limits are a stopgap at best [17:15:53] based on profiling current traffic those kinds of requests are quite rare, I had to make up my own calls to get stuff not in cache so asking for stuff like stl files in gif [17:16:43] Stop mediawiki-shaming, they're not fat, they just have a lot of layers. Like an ogre. Or an onion. [17:17:12] I meant ressource wise ... let's call it rich then. mw pods are ressource rich :) [17:17:59] ;) [17:25:02] 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) CC @BTullis & @bking: This might be relevant for operators as well. [17:35:28] 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10BTullis) Thanks @JMeybohm - I'll definitely bear that in mind. From my work so far with the spark-operator it seems that the operator //its... [21:11:23] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10Urbanecm) This happened again during the UTC late B&C deploy window: ` 21:10:16 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deploy2002.codfw... [22:52:49] 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10dancy) >>! In T324023#8428647, @Jdforrester-WMF wrote: > Probably not a scap bug but a new server config issue, then. A little of both in the end.