[02:28:23] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#9057445, @Reedy wrote: > None of this is helping move the discussion forward. > > Timo's comment in T27... [06:19:35] hello folks! [06:20:07] if you like the idea of upgrading change prop to buster + nodejs10, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943037 [06:20:20] then I also have: [06:20:22] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943038 [06:20:31] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943039 [06:20:39] for the exporter's resources [06:31:23] <_joe_> elukey: we did discuss this yesterday in our team meeting - we are ok with doing the OS upgrade but not with moving to a newer nodejs version [06:31:50] <_joe_> that should be done once we have a proper support on the software side [06:32:14] <_joe_> so let me review your changes :) [06:39:05] ack understood! [06:39:12] ok I'll start with the os upgrade :) [06:41:08] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) Also another question. You mention that mediawiki would have to access cache keys here, am I reading it right that we're making shared use of... [06:42:05] will do changeprop first, then in a couple of days the job queues as well [06:48:10] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) Answering myself: the model seems to imply that is the case. I strongly urge you to reconsider that model mostly because using a database (e... [07:04:28] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) Thinking a bit more about this: how do we ensure that the sharding of keys across servers is the same in the two systems? One way to do it is... [07:10:03] claime: https://grafana.wikimedia.org/d/000000027/kafka?forceLogin&from=now-3h&orgId=1&to=now&var-datasource=thanos&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&var-disk_device=All&viewPanel=75 [07:10:12] last upgrade, same pattern [07:10:41] in a cluster that handles 130k msg/s [07:17:43] changeprop moved to buster, will wait a bit and then deploy the prometheus exporter change too [07:26:17] <_joe_> elukey: ack, thanks [07:50:01] all deployed, will watch for metrics [08:00:47] so far all good [08:01:49] no throttling in the new pods afaics [08:14:59] _joe_ proceeding with jobqueues ok? [08:15:17] <_joe_> elukey: yes [08:15:29] <_joe_> sorry I went down a rabbithole with mcrouter [08:17:00] super, I also create https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/944155/ for the jobqueue's exporter as well [08:36:22] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Xover) Over the last few days (early Sunday UTC was when I consciously noted it, bu... [08:44:13] jq-codfw done, the only weird thing is backlog timings for a parsoid prewarm job: [08:44:16] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-1h&to=now [08:44:37] given how it is recovering it seems related to the deployment (I mean the re-creating of pods etc..) [08:45:39] <_joe_> what is wrong specifically? [08:46:45] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-1h&to=now&viewPanel=63 [08:47:26] not sure if it is wrong or not, but the p90 and p99 looks weird [08:48:12] happened also for the avg but it recovered [09:05:22] <_joe_> it's below 10 seconds [09:05:24] <_joe_> ignore it [09:06:43] super [09:06:54] I'll redeploy codfw in a bit with the exporter changes [09:06:56] and then eqiad [09:47:20] all good, deployed all change prop instances [09:47:44] in eqiad I see some jobs' p90/99 with ~10/15s, but trending down (like codfw) [09:47:55] the rest looks good, no more throttling for the exporters [11:38:34] Checked the throttling status with: [11:38:35] sum(irate(container_cpu_cfs_throttled_seconds_total{pod=~"changeprop.*"}[5m])) by (pod, container) [11:38:49] if you check in thanos for the past hours it is clear that there was a problem [11:39:18] should we have a dashboard for it? Or maybe a dedicated panel somewhere [11:39:29] https://grafana.wikimedia.org/goto/xCCxVXqVz?orgId=1 [11:39:33] :) [11:40:00] It's a user dashboard, but feel free to copy the panel [11:41:31] there is always a Janis dashboard, I should know it by now [11:41:52] :D [11:42:06] we should probably make it more production ready, it is very useful [11:42:43] Maybe integrate it into https://grafana.wikimedia.org/d/000000519/kubernetes-overview?orgId=1 somehow [11:42:48] Or in the service dashboards [11:44:42] I like the drill down that Janis did, having various angles (sum by namespace, etc..) [11:47:50] I'll work on a user dashboard as well, then we can decide [11:48:04] I am pretty sure that I have some throttling happening on ml-serve as well [11:50:30] 10serviceops, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342761 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Alert has resolved [12:35:07] 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) >>! In T297815#9057808, @Joe wrote: > Also another question. You mention that mediawiki would have to acces... [12:39:37] 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) >>! In T297815#9057831, @Joe wrote: > Thinking a bit more about this: > how do we ensure that the sharding... [12:42:32] 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) That's exactly what I wanted to do, add a new entry to `wgObjectCaches` with a specific `routingPrefix`; I wasn't sure... [13:05:27] hi folks, https://gerrit.wikimedia.org/r/c/operations/puppet/+/942692 is changing redirects.dat, is there anything I should be doing post-merge ? [13:18:53] or said another way: has the above any chance for havoc ? [13:27:47] (created https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling as alternative to Janis' dashboard) [13:30:33] we could think about having per-namespace alerts [13:30:49] see https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?orgId=1&var-dc=thanos&var-site=eqiad&var-prometheus=k8s&var-sum_by=namespace&var-service=changeprop-jobqueue&var-ignore_container_regex=&from=now-12h&to=now [13:31:11] if we have constantly $seconds of throttling in aggregate something is probably off [13:33:24] and this is maybe me not understanding well throttling, but I see some services with an aggregate on all the pods in the namespace crossing the 1 second mark every datapoint: [13:33:28] https://grafana-rw.wikimedia.org/d/Q1HD5X3Vk/elukey-k8s-throttling?orgId=1&var-dc=thanos&var-site=eqiad&var-prometheus=k8s&var-sum_by=namespace&var-service=All&var-ignore_container_regex=&from=now-3h&to=now [13:33:41] thumbor, tegola, mobileapps, linkrec [13:43:37] thumbor's throttling is a bit of a lost cause :P [13:43:51] ahhahah okok [13:44:48] hnowlan: to learn the background - is it because thumbor just uses too many resources? And latency-wise it is fine to get throttled? [13:56:43] it uses a huge amount yeah [13:56:53] we've bumped it up a lot over time but it'll still happen [13:57:02] ack makes sense [13:57:03] It's not ideal that it get throttled though [13:58:36] It's forking a lot, so we can't get away with what we're probably going to do for mw (remove limits) [13:58:42] iiuc [13:59:12] Maybe reducing the timeslot for the cpu quota calculation would help a tad, but with such a huge amount of throttling I'm not even sure it'd make a dent [14:00:29] didn't know about the no-limits, interesting [14:01:05] elukey: yeah basically the throttling is because of the way cpu quota calculation is made in the cgroup [14:01:19] yep yep that part I know [14:01:22] No limit, no cgroup, no waiting 100ms for the quota calculation period to end, basically [14:01:29] the whole milicore every 100ms etc.. [14:01:46] ahhhh wow [14:01:48] But you can only do that if you're hard bound on your number of threads/processes, like with php-fpm for instance [14:02:04] else you run the risk of runaway resource consumption [14:02:09] how do you enable this behavior? [14:02:17] is there a special limits value? [14:02:21] (ignorant about it) [14:02:23] Just don't give any limits [14:02:29] Just requests [14:02:32] ack thanks [14:02:46] I wondered what happened with that setting, good to know :D [14:02:52] it can easily backfire as you said :D [14:03:00] Then if you know you have static workloads, you can play with CPU pinning as well [14:03:07] but you pay the elasticity cost [14:03:47] https://danluu.com/cgroup-throttling/ [14:04:21] (I've had my head buried in this for the past few workdays) [14:04:39] will read thanks! [14:04:41] (I think I'm going slightly insane because of it) [14:07:17] (that's fine it is all related to k8s) [14:24:42] 10serviceops, 10MW-on-K8s, 10noc.wikimedia.org, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Clement_Goubert) I saw today that the mw-misc deployment on which noc on kubernetes relies is not updated by scap. The image is rebuilt and updated, but th... [15:18:41] any opinion re: redirects.dat changes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/942692 ? will it set the air on fire? [15:39:49] <_joe_> godog: let me take a look [15:40:09] thank you _joe_ [15:40:56] <_joe_> godog: just one doubt. graphite-labs doesn't get a response from the mediawiki cluster, which is where redirects.dat gets used [15:41:27] <_joe_> so I'm not sure what the fit is there. [15:41:59] <_joe_> I would ask traffic for advice about where to set up such a redirect [15:42:14] _joe_: more or less the same doubt I had in the comment I left heh [15:44:19] <_joe_> godog: yeah I mean, this is how we historically used those redirect files, but it's typically for mediawiki-related stuff. It really doesn't seem like a good fit. [15:44:40] <_joe_> but maybe we don't have anything better, in that case, I rest my case :) [15:45:23] yeah I see what you are saying, I don't think it'll be there for long tbh, I think the intention is just not to blackhole the vhost [17:33:51] 10serviceops, 10SRE, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) 05Open→03Declined Closing in favor of {T292707} as it makes little sense at this point to consider putting Wikitech into legacy production hosting.