[06:28:55] hello folks! [06:31:41] I have some changeprop things to propose [06:32:22] 1) Upgrade to Buster - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943037. The next step will be to upgrade to nodejs12/bullseye, but as intermediate step migrating to buster and keep nodejs10 seems ok as well. [06:34:31] 2) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/943038/1 and next - afaics the prometheus statsd exporter in changeprop is heavily throttled (almost constantly), I fear that this is the issue why we loose metrics [06:56:51] for example, https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=changeprop&var-pod=changeprop-production-76447bc6bf-nwdds&var-container=All is one of the pod mostly affected [07:20:09] ah wow afaics eventgate runs on buster + nodejs 10 sigh [07:20:25] <_joe_> elukey: sigh [07:20:46] <_joe_> elukey: now a lot of service owners will have an interesting wakeup call regarding that [07:21:22] <_joe_> buster, old versions of nodejs [07:21:58] <_joe_> elukey: apart from that, I don't know if I agree with upgrading to buster or anything else before there's someone with ownership [07:22:14] <_joe_> actually, I disagree but I want to take this to the team [07:23:37] _joe_ I think a bare minimum upgrade is needed, I am just a worried "customer" since ML is planning on using it for streams.. I think that upgrading to buster + fixing the metrics is overdue, all the rest can become something to discuss for sure [07:23:54] (well ML is already using it for streams) [07:23:58] <_joe_> elukey: I agree it's overdue [07:24:09] <_joe_> and please tell your manager that changeprop is unowned [07:24:14] <_joe_> and that it's blocking you [07:24:24] <_joe_> see why things never get properly fixed here? [07:24:39] <_joe_> you do some work that's not on you out of goodwill/need to unblock yourself [07:24:49] <_joe_> and the problem gets kicked down the road by a year [07:28:50] sure sure I 100% agree [09:05:01] https://usercontent.irccloud-cdn.com/file/1EG9jNo7/7u9wk4.jpg [09:12:40] <_joe_> ahahahah [09:14:09] <_joe_> now the other patches require a bit more scrutiny I think [09:14:20] <_joe_> although I tested all I could locally [09:17:50] is it possible to load them in mwdebug and test them? [09:31:16] <_joe_> I guess so, but we also need to add the vhost for noc [09:31:54] <_joe_> Amir1: tell me when you want to run such a test [09:32:06] <_joe_> I added tests for most things I could think of at least [09:32:21] sounds good [09:36:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Clement_Goubert) [09:38:00] Amir1: lmao [09:38:55] xD [10:25:34] just discovered that somebody outside the WMF runs a changeprop instance (with UA SampleChangePropInstance, the default in the repo's config) [10:25:45] and hits the /precache ores URI as well [10:25:46] lol [10:26:07] O_O [10:26:09] lmao [10:26:18] I wonder how old it is [10:26:57] hnowlan: https://logstash.wikimedia.org/goto/d616b4f5a215b46be5d14f2baa37e121 [10:27:02] same IP [10:28:43] no idea how to reach out [10:28:55] maybe I can add a requestctl rule [10:29:12] That's a lot more requests than I was expecting. [10:31:18] interestingly, it seems that it started around the 21st [10:31:24] and ramped up since then [10:37:51] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Clement_Goubert) 05Open→03Resolved Resolving for now, we will reopen if issues reappear. [11:09:22] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) >>! In T275319#9054277, @stjn wrote: > Wikisource editors can absolutely split pages into smaller ones, since those longer... [11:29:38] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10MoritzMuehlenhoff) There were some hosts still on 1.13 (cloudweb, mwmaint, deployment servers, scandium, snapshot) and parse1002 (which was down during... [11:31:34] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10Clement_Goubert) >>! In T340087#9054939, @MoritzMuehlenhoff wrote: > There were some hosts still on 1.13 (cloudweb, mwmaint, deployment servers, scandiu... [12:12:06] <_joe_> elukey: yeah let's block this? [12:42:35] _joe_ back sorry, I am thinking a requestctl rule targeting the IP, does it make sense? [12:50:41] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) >>! In T275319#9054275, @stjn wrote: > For the record, I don't think that the need to be able to build even long... [12:57:41] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10stjn) ‘Readers expect us to dump everything on one page’ is just your opinion, and so is ‘from usability standpoint, it’s better... [13:08:14] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) >>! In T275319#9055247, @stjn wrote: > From usability standpoint, it’s better to have a page that doesn’t weigh... [13:29:39] <_joe_> elukey: the UA actually [13:29:57] <_joe_> create a per-ip throttle for people calling us with that generic UA [13:30:09] <_joe_> that was my idea [13:30:13] <_joe_> I can get on it later [13:30:33] yep yep, I'll use superset's requestctl rule generator later on [13:52:33] <_joe_> elukey: oh no that's verbose and creates rules that are not properly linted [13:52:45] <_joe_> it shows the author has put no spicerack in that implementation [13:54:55] you don't have to follow it by the letter, aggregation is left for the user as an exercise [13:55:13] as for linting... what's the issue? patches are welcome you know ;) [14:00:15] Joe still has it, I cannot catch Riccardo like this [14:07:47] <_joe_> the difference is you don't first make him feel like he's on the wrong side of a linter [14:13:14] x) [14:38:43] Ok so this is the baseline: https://superset.wikimedia.org/requestctl-generator?q=d7VvNGwB1X0 [14:39:10] one more note - This particular UA acts as changeprop and hits /v3/precache, that in turn warms up the Redis cache [14:39:30] a complete block may also be ok [14:39:36] but we can start with throttling [14:40:23] I can stage the above if people are ok, then some soul can review and I'll apply [14:58:03] <_joe_> elukey: seems legit [15:08:33] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) p:05Triage→03Medium [15:52:48] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) We have already procured the servers for this work, and they're set up already. [17:19:59] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) >>! In T297815#9056015, @Joe wrote: > We have already procured the servers for this work, and they're set up already. Yup, I ju... [18:30:15] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Joe) >>! In T297815#9056348, @Jdforrester-WMF wrote: >>>! In T297815#9056015, @Joe wrote: >> We have already procured the servers for this work, a... [21:23:24] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#9054277, @stjn wrote: > Wikisource editors can absolutely split pages into smaller ones, since those lon... [22:02:35] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10stjn) >>! In T275319#9055301, @Alexey_Skripnik wrote: > Could you elaborate on why serving 2.3 Mb of HTML is bad from a usability... [22:49:36] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) >>! In T275319#9057235, @stjn wrote: > Because heavy pages load worse for readers, especially on poorer connecti... [23:15:15] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#9057235, @stjn wrote: > (@Vladis13 please keep in mind https://www.mediawiki.org/wiki/Bug_management/Pha... [23:44:29] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Reedy) None of this is helping move the discussion forward. Timo's comment in T275319#7947012 is still relevant. And at the sam... [23:45:58] 10serviceops, 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#9057297, @Alexey_Skripnik wrote: > Readers don't care directly about the weight of a webpage's HTML. Wha...