[02:28:23] <wikibugs>	 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#9057445, @Reedy wrote: > None of this is helping move the discussion forward. >  > Timo's comment in T27...
[06:19:35] <elukey>	 hello folks!
[06:20:07] <elukey>	 if you like the idea of upgrading change prop to buster + nodejs10, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943037
[06:20:20] <elukey>	 then I also have:
[06:20:22] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943038
[06:20:31] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943039
[06:20:39] <elukey>	 for the exporter's resources 
[06:31:23] <_joe_>	 elukey: we did discuss this yesterday in our team meeting - we are ok with doing the OS upgrade but not with moving to a newer nodejs version
[06:31:50] <_joe_>	 that should be done once we have a proper support on the software side
[06:32:14] <_joe_>	 so let me review your changes :)
[06:39:05] <elukey>	 ack understood!
[06:39:12] <elukey>	 ok I'll start with the os upgrade :)
[06:41:08] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) Also another question. You mention that mediawiki would have to access cache keys here, am I reading it right that we're making shared use of...
[06:42:05] <elukey>	 will do changeprop first, then in a couple of days the job queues as well
[06:48:10] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) Answering myself: the model seems to imply that is the case.  I strongly urge you to reconsider that model mostly because using a database (e...
[07:04:28] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) Thinking a bit more about this: how do we ensure that the sharding of keys across servers is the same in the two systems? One way to do it is...
[07:10:03] <elukey>	 claime: https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&viewPanel=75
[07:10:12] <elukey>	 last upgrade, same pattern
[07:10:41] <elukey>	 in a cluster that handles 130k msg/s
[07:17:43] <elukey>	 changeprop moved to buster, will wait a bit and then deploy the prometheus exporter change too
[07:26:17] <_joe_>	 elukey: ack, thanks
[07:50:01] <elukey>	 all deployed, will watch for metrics
[08:00:47] <elukey>	 so far all good
[08:01:49] <elukey>	 no throttling in the new pods afaics
[08:14:59] <elukey>	 _joe_ proceeding with jobqueues ok?
[08:15:17] <_joe_>	 elukey: yes
[08:15:29] <_joe_>	 sorry I went down a rabbithole with mcrouter
[08:17:00] <elukey>	 super, I also create https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/944155/ for the jobqueue's exporter as well
[08:36:22] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Xover) Over the last few days (early Sunday UTC was when I consciously noted it, bu...
[08:44:13] <elukey>	 jq-codfw done, the only weird thing is backlog timings for a parsoid prewarm job:
[08:44:16] <elukey>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-1h&to=now
[08:44:37] <elukey>	 given how it is recovering it seems related to the deployment (I mean the re-creating of pods etc..)
[08:45:39] <_joe_>	 what is wrong specifically?
[08:46:45] <elukey>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-1h&to=now&viewPanel=63
[08:47:26] <elukey>	 not sure if it is wrong or not, but the p90 and p99 looks weird
[08:48:12] <elukey>	 happened also for the avg but it recovered 
[09:05:22] <_joe_>	 it's below 10 seconds
[09:05:24] <_joe_>	 ignore it
[09:06:43] <elukey>	 super
[09:06:54] <elukey>	 I'll redeploy codfw in a bit with the exporter changes
[09:06:56] <elukey>	 and then eqiad
[09:47:20] <elukey>	 all good, deployed all change prop instances
[09:47:44] <elukey>	 in eqiad I see some jobs' p90/99 with ~10/15s, but trending down (like codfw)
[09:47:55] <elukey>	 the rest looks good, no more throttling for the exporters
[11:38:34] <elukey>	 Checked the throttling status with:
[11:38:35] <elukey>	 sum(irate(container_cpu_cfs_throttled_seconds_total{pod=~"changeprop.*"}[5m])) by (pod, container)
[11:38:49] <elukey>	 if you check in thanos for the past hours it is clear that there was a problem
[11:39:18] <elukey>	 should we have a dashboard for it? Or maybe a dedicated panel somewhere
[11:39:29] <claime>	 https://grafana.wikimedia.org/goto/xCCxVXqVz?orgId=1
[11:39:33] <claime>	 :)
[11:40:00] <claime>	 It's a user dashboard, but feel free to copy the panel
[11:41:31] <elukey>	 there is always a Janis dashboard, I should know it by now
[11:41:52] <claime>	 :D
[11:42:06] <elukey>	 we should probably make it more production ready, it is very useful
[11:42:43] <claime>	 Maybe integrate it into https://grafana.wikimedia.org/d/000000519/kubernetes-overview?orgId=1 somehow
[11:42:48] <claime>	 Or in the service dashboards
[11:44:42] <elukey>	 I like the drill down that Janis did, having various angles (sum by namespace, etc..)
[11:47:50] <elukey>	 I'll work on a user dashboard as well, then we can decide
[11:48:04] <elukey>	 I am pretty sure that I have some throttling happening on ml-serve as well 
[11:50:30] <wikibugs>	 10serviceops, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342761 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Alert has resolved
[12:35:07] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) >>! In T297815#9057808, @Joe wrote: > Also another question. You mention that mediawiki would have to acces...
[12:39:37] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) >>! In T297815#9057831, @Joe wrote: > Thinking a bit more about this: > how do we ensure that the sharding...
[12:42:32] <wikibugs>	 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) That's exactly what I wanted to do, add a new entry to `wgObjectCaches` with a specific `routingPrefix`; I wasn't sure...
[13:05:27] <godog>	 hi folks, https://gerrit.wikimedia.org/r/c/operations/puppet/+/942692 is changing redirects.dat, is there anything I should be doing post-merge ?
[13:18:53] <godog>	 or said another way: has the above any chance for havoc ?
[13:27:47] <elukey>	 (created https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling as alternative to Janis' dashboard)
[13:30:33] <elukey>	 we could think about having per-namespace alerts
[13:30:49] <elukey>	 see https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?orgId=1&var-dc=thanos&var-site=eqiad&var-prometheus=k8s&var-sum_by=namespace&var-service=changeprop-jobqueue&var-ignore_container_regex=&from=now-12h&to=now
[13:31:11] <elukey>	 if we have constantly $seconds of throttling in aggregate something is probably off
[13:33:24] <elukey>	 and this is maybe me not understanding well throttling, but I see some services with an aggregate on all the pods in the namespace crossing the 1 second mark every datapoint:
[13:33:28] <elukey>	 https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?orgId=1&var-dc=thanos&var-site=eqiad&var-prometheus=k8s&var-sum_by=namespace&var-service=All&var-ignore_container_regex=&from=now-3h&to=now
[13:33:41] <elukey>	 thumbor, tegola, mobileapps, linkrec
[13:43:37] <hnowlan>	 thumbor's throttling is a bit of a lost cause :P 
[13:43:51] <elukey>	 ahhahah okok
[13:44:48] <elukey>	 hnowlan: to learn the background - is it because thumbor just uses too many resources? And latency-wise it is fine to get throttled?
[13:56:43] <hnowlan>	 it uses a huge amount yeah
[13:56:53] <hnowlan>	 we've bumped it up a lot over time but it'll still happen 
[13:57:02] <elukey>	 ack makes sense
[13:57:03] <hnowlan>	 It's not ideal that it get throttled though
[13:58:36] <claime>	 It's forking a lot, so we can't get away with what we're probably going to do for mw (remove limits)
[13:58:42] <claime>	 iiuc
[13:59:12] <claime>	 Maybe reducing the timeslot for the cpu quota calculation would help a tad, but with such a huge amount of throttling I'm not even sure it'd make a dent
[14:00:29] <elukey>	 didn't know about the no-limits, interesting
[14:01:05] <claime>	 elukey: yeah basically the throttling is because of the way cpu quota calculation is made in the cgroup
[14:01:19] <elukey>	 yep yep that part I know
[14:01:22] <claime>	 No limit, no cgroup, no waiting 100ms for the quota calculation period to end, basically
[14:01:29] <elukey>	 the whole milicore every 100ms etc..
[14:01:46] <elukey>	 ahhhh wow
[14:01:48] <claime>	 But you can only do that if you're hard bound on your number of threads/processes, like with php-fpm for instance
[14:02:04] <claime>	 else you run the risk of runaway resource consumption
[14:02:09] <elukey>	 how do you enable this behavior?
[14:02:17] <elukey>	 is there a special limits value?
[14:02:21] <elukey>	 (ignorant about it)
[14:02:23] <claime>	 Just don't give any limits
[14:02:29] <claime>	 Just requests
[14:02:32] <elukey>	 ack thanks
[14:02:46] <elukey>	 I wondered what happened with that setting, good to know :D
[14:02:52] <elukey>	 it can easily backfire as you said :D
[14:03:00] <claime>	 Then if you know you have static workloads, you can play with CPU pinning as well
[14:03:07] <claime>	 but you pay the elasticity cost
[14:03:47] <claime>	 https://danluu.com/cgroup-throttling/
[14:04:21] <claime>	 (I've had my head buried in this for the past few workdays)
[14:04:39] <elukey>	 will read thanks!
[14:04:41] <claime>	 (I think I'm going slightly insane because of it)
[14:07:17] <elukey>	 (that's fine it is all related to k8s)
[14:24:42] <wikibugs>	 10serviceops, 10MW-on-K8s, 10noc.wikimedia.org, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Clement_Goubert) I saw today that the mw-misc deployment on which noc on kubernetes relies is not updated by scap. The image is rebuilt and updated, but th...
[15:18:41] <godog>	 any opinion re: redirects.dat changes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/942692 ? will it set the air on fire?
[15:39:49] <_joe_>	 godog: let me take a look
[15:40:09] <godog>	 thank you _joe_ 
[15:40:56] <_joe_>	 godog: just one doubt. graphite-labs doesn't get a response from the mediawiki cluster, which is where redirects.dat gets used
[15:41:27] <_joe_>	 so I'm not sure what the fit is there.
[15:41:59] <_joe_>	 I would ask traffic for advice about where to set up such a redirect
[15:42:14] <godog>	 _joe_: more or less the same doubt I had in the comment I left heh
[15:44:19] <_joe_>	 godog: yeah I mean, this is how we historically used those redirect files, but it's typically for mediawiki-related stuff. It really doesn't seem like a good fit.
[15:44:40] <_joe_>	 but maybe we don't have anything better, in that case, I rest my case :)
[15:45:23] <godog>	 yeah I see what you are saying, I don't think it'll be there for long tbh, I think the intention is just not to blackhole the vhost
[17:33:51] <wikibugs>	 10serviceops, 10SRE, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) 05Open→03Declined Closing in favor of {T292707} as it makes little sense at this point to consider putting Wikitech into legacy production hosting.