[05:30:45] duesen: This should be it https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=x2&var-role=All&from=now-2d&to=now [05:58:32] <_joe_> duesen: thanks to Amir1's work, we can happily move more of parsoid to use the warmup job without too much worry [07:05:08] Amir1: thank you! [07:05:49] <3 [07:30:04] Amir1: is there a way to get disk utilization too? [07:39:41] duesen: yeah find the x2 master (in eqiad.json) then go to "host overview" graph. Sorry on phone [07:40:09] Amir1: no rush! thank you! [08:10:17] <_joe_> duesen: can we talk here please? so we won't interfere with deployments [08:11:03] <_joe_> so, from https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=parsoid-php&viewPanel=4&from=now-90d&to=now I would expect around 500 jobs/s at most [08:11:26] <_joe_> I would expect the number of jobs to be similar to the number of calls from changeprop to parsoid nowadays [08:12:12] <_joe_> err from restbase I meant [08:12:57] <_joe_> so yeah, we definitely will need a separate changeprop config [08:13:23] <_joe_> and this is also why I told you we would need to move over servers :) [08:15:32] <_joe_> rn we have about 800 idle workers on the jobrunners; about 400 busy on the parsoid servers [08:16:32] <_joe_> so we should be ok to move most midsized wikis, actually anything but enwiki probably? [08:16:43] <_joe_> enwiki/dewiki maybe, given it's edit traffic [08:23:40] _joe_: small and medium wikis already have it. we just added frwikis [08:24:01] <_joe_> which by edit count I guess is a large wiki? [08:24:14] i am going by dblist [08:24:17] <_joe_> I have a better grasp on the numbers of visits [08:24:21] <_joe_> yeah it makes sense [08:24:25] frwiki is in large.dblist [08:27:00] _joe_: note we are excluding Commons and wikidata regardless [08:27:17] So the biggest firehoses should be out [08:27:23] <_joe_> Amir1: I was aware of wikidata yeah [08:27:31] <_joe_> that was the case with resbtase too btw IIRC [08:28:12] <_joe_> duesen: out of curiosity, when did you enable frwiki? [08:28:17] <_joe_> I want to check one thing [08:28:34] _joe_: one hour ago, this deployment window [08:29:22] scap finished 7:55 UTC [08:30:56] _joe_: you can see the bump here: https://grafana.wikimedia.org/goto/HTJJwjwVz?orgId=1 [08:31:04] <_joe_> yeah it didn't have that big of an effect on the latency of the parsoid cluster [08:31:10] <_joe_> which is slightly surprising [08:53:39] fyi frwiki averages from 27k to 30k new revisions each day [08:53:57] so whatever that works out to in terms of edit speed [09:32:12] arturo: FYI I don't think that DHCP requests from cloudcontrol2004-dev are making it to the install server [09:32:17] for your reimage [09:32:42] volans: yeah, we saw that. Cathal suspect of missing firmware updates or other cabling issues [09:33:00] k [10:45:16] <_joe_> duesen: sorry, I'm missing one piece - why are we also generating jobs from codfw? [10:46:23] _joe_: we generate jobs on page view when we detect that the cache is stale. We do the same for the old parser's output [10:46:27] <_joe_> I'm sure you already told me and it splipped my mind [10:46:35] <_joe_> yeah ok, that was it :) [10:47:05] _joe_: can we work on the changeprop config together? [10:47:06] <_joe_> and we do the check independently, correct? [10:48:05] we are doing the check for the traditional parser output. we just assume that, if that is stale, we also need to re-parse with parsoid. which is going to be true 99% of the cases I think [10:48:07] <_joe_> I'll explain what worries me slightly - right now, the parsoid parsercache is probably empty for a lot of objects? [10:48:20] <_joe_> ah ok, good [10:48:39] no, it should be already filled for most pages, at least the one that are acively read and edited [10:48:58] its being populated when restbase asks for updated copies of these pages [10:48:59] <_joe_> so if the cache is stale, that triggers a sync re-parse with the traditional parser, and an async one with parsoid [10:49:01] <_joe_> ok [10:49:18] the job is only needed to ensure that the cache stays up to date when we turn off parsoid updates via restbase [10:49:26] <_joe_> yep [10:50:38] <_joe_> so the changeprop change - as a quick pointer - you need to edit operations/deployment-charts:helmfile.d/services/changeprop-jobqueue/values.yaml [10:51:00] <_joe_> add a configuration for this job to high_traffic_jobs_config [10:51:45] <_joe_> I would suggest we start relatively low with the concurrency, something similar to cdnPurge, maybe [10:52:01] <_joe_> have you checked the backlog for the job right now? [10:52:27] I wouldn't know how [10:52:52] <_joe_> let me take a look [10:54:29] <_joe_> so the backlog is typically ok it seems [10:54:32] <_joe_> and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=74 [10:54:42] <_joe_> tells me the concurrency is low enough right now [10:55:34] <_joe_> https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=5 this is the mean backlog time [11:03:48] _joe_: so we don't need to do anything with changeprop before we enable this on enwiki? [11:04:03] <_joe_> duesen: I would not expect it [11:04:18] ok cool, thank you! [11:04:22] <_joe_> duesen: I'm curious what will happen when we turn off pregeneration via restbase on frwiki [11:04:33] <_joe_> it's possible more of these jobs are actual parses [11:04:49] the number of jobs will stay the same, but the load will go up, because more of these jobs will end up doing actual work [11:04:55] this will also drive up concurrency [11:05:13] * duesen has to go and pick up his daughter from shool [11:26:06] <_joe_> duesen: yes exactly, that's why I said "let's see what happens in that case [14:08:06] stupid question> is there an important distinction between "present" and "installed" for package resources in puppet? I think ensure_packages sets the latter [14:11:45] # Alias the 'present' value. [14:11:47] aliasvalue(:installed, :present) [14:11:58] Emperor: ^ no difference [14:14:44] thanks :) [was flagged as a diff by pcc] [14:34:36] Emperor: fyi it did use to be present but got changed in https://github.com/puppetlabs/puppetlabs-stdlib/pull/1196 [14:34:47] there is a fix but we are not on that version of stdlib yet https://github.com/puppetlabs/puppetlabs-stdlib/pull/1300 [14:36:08] effie: with the reorg changes starting July 1st, I'm thinking if it makes sense to change some of the doc tree on wikitech. The current "MediaWiki" navigation is mostly a proxy for (part of) ServiceOps, which makes it less obvious how much "MediaWIki Engineering" stuff to put under there (from Perf, and from elsewhere). [14:36:56] Krinkle: happy to discuss and make changes. Would you like to create a task and we can pick it up next Q ? [14:37:09] curious if maybe you have thoughts on what you'd like to see for your team. I could split it into "Service Ops" and "MW Eng" but maybe that's too wide for ServiceOps to have a single nav only? Alternatively could do something like "MW eng" and "MW ops" (the latter would e.g. have Memcached, Envoy, Citoid etc) [14:37:31] <_joe_> It's very hard to split things evenly tbh [14:37:43] we can meet halfway [14:39:41] Note that there is also https://wikitech.wikimedia.org/wiki/Category:SRE_Service_Operations, so that will remain unchanged either way [14:40:23] lets have a meeting to discuss it as I am in the middle of something else [14:40:34] our next meeting ? [14:42:19] sure [14:44:15] <3 [14:44:52] hahahah you beat me to the doc [14:44:53] :p [14:44:59] cheers timo [15:35:18] we're seeing some check_puppetrun crashes on cp nodes while switching port 80 from varnish to haproxy, filled https://phabricator.wikimedia.org/T337951 [15:48:28] vgutierrez: next tiome you see it if you could grab a copy of /var/lib/puppet/state/last_run_report.yaml anmd attached to the task would be usefull [15:48:45] jbond: will do [15:48:50] cheers [15:48:57] fabfur: ^^ [15:49:03] just in case you spot it first :) [15:49:28] ok [15:50:24] vgutierrez: fabfur: id put the content in an WMF-NDA past, it shouldn;t have any sensetive information in it but it migh and its to big to check manualy [15:50:55] will do, thanks jbond [15:51:07] great cheers i also added a similar comment to the task [21:22:01] godog: I'm trying out the XFF/remoteip approach as suggested, but it seems to not result in effective access control. https://gerrit.wikimedia.org/r/c/operations/puppet/+/919419/12 [21:22:13] > {"timestamp": "2023-06-01T21:16:50", "RequestTime": "6813", "Client-IP": "172.16.0.113", "Handle/Status": "application/x-httpd-php/200", "ResponseSize": "571", "Method": "POST", "Url": "http://performance.wikimedia.beta.wmflabs.org/excimer/speedscope/", "MimeType": "text/html", "Referer": "-", "X-Forwarded-For": "137.220.80.57, 172.16.0.113", "User-Agent": "curl/7.87.0", "Accept-Language": "-", "X-Analytics": "-", "User": "-", [21:22:13] "UserHeader": "-", "Connect-IP": "172.16.0.113", "X-Request-Id": "-", "X-Client-IP": "137.220.80.57"} [21:22:31] I'm guessing this should use X-Client-IP and not X-Forwarded-For like Grafana? [21:22:43] given XFF includes the internally trusted ip [21:23:31] it does seem to work for grafana indeed [21:25:03] ah... https://gerrit.wikimedia.org/r/c/operations/puppet/+/835623/ [21:32:09] > {"timestamp": "2023-06-01T21:30:05", "RequestTime": "3233", "Client-IP": "137.220.80.57", "Handle/Status": "application/x-httpd-php/200", "ResponseSize": "571", "Method": "POST", "Url": "http://performance.wikimedia.beta.wmflabs.org/excimer/speedscope/", "MimeType": "text/html", "Referer": "-", "X-Forwarded-For": "137.220.80.57, 172.16.0.113", "User-Agent": "curl/7.87.0", "Accept-Language": "-", "X-Analytics": "-", "User": "-", [21:32:09] "UserHeader": "-", "Connect-IP": "172.16.0.113", "X-Request-Id": "-", "X-Client-IP": "-"} [21:32:29] That looks better. It's now taking my external IP as the interpreted "Client_IP" and apparently unsetting X-Client-IP as side-effect. [21:32:34] However it is still not restricting access. [21:33:48] Hacking it up locally with curl from localhost confirms that it allows any arbitrary IP. [21:33:49] $ curl -vvi -d 'x' -X POST -H 'X-Client-IP: 200.1.1.1' https://deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloudscope/er/speeds [21:38:18] left details at https://gerrit.wikimedia.org/r/c/operations/puppet/+/919419 for now