[05:30:45] <Amir1>	 duesen: This should be it https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=x2&var-role=All&from=now-2d&to=now
[05:58:32] <_joe_>	 duesen: thanks to Amir1's work, we can happily move more of parsoid to use the warmup job without too much worry
[07:05:08] <duesen>	 Amir1: thank you!
[07:05:49] <Amir1>	 <3
[07:30:04] <duesen>	 Amir1: is there a way to get disk utilization too?
[07:39:41] <Amir1>	 duesen: yeah find the x2 master (in eqiad.json) then go to "host overview" graph. Sorry on phone 
[07:40:09] <duesen>	 Amir1: no rush! thank you!
[08:10:17] <_joe_>	 duesen: can we talk here please? so we won't interfere with deployments
[08:11:03] <_joe_>	 so, from https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=parsoid-php&viewPanel=4&from=now-90d&to=now I would expect around 500 jobs/s at most
[08:11:26] <_joe_>	 I would expect the number of jobs to be similar to the number of calls from changeprop to parsoid nowadays
[08:12:12] <_joe_>	 err from restbase I meant
[08:12:57] <_joe_>	 so yeah, we definitely will need a separate changeprop config
[08:13:23] <_joe_>	 and this is also why I told you we would need to move over servers :)
[08:15:32] <_joe_>	 rn we have about 800 idle workers on the jobrunners; about 400 busy on the parsoid servers
[08:16:32] <_joe_>	 so we should be ok to move most midsized wikis, actually anything but enwiki probably?
[08:16:43] <_joe_>	 enwiki/dewiki maybe, given it's edit traffic
[08:23:40] <duesen>	 _joe_: small and medium wikis already have it. we just added frwikis
[08:24:01] <_joe_>	 which by edit count I guess is a large wiki?
[08:24:14] <duesen>	 i am going by dblist
[08:24:17] <_joe_>	 I have a better grasp on the numbers of visits
[08:24:21] <_joe_>	 yeah it makes sense
[08:24:25] <duesen>	 frwiki is in large.dblist
[08:27:00] <Amir1>	 _joe_: note we are excluding Commons and wikidata regardless 
[08:27:17] <Amir1>	 So the biggest firehoses should be out
[08:27:23] <_joe_>	 Amir1: I was aware of wikidata yeah
[08:27:31] <_joe_>	 that was the case with resbtase too btw IIRC
[08:28:12] <_joe_>	 duesen: out of curiosity, when did you enable frwiki?
[08:28:17] <_joe_>	 I want to check one thing
[08:28:34] <duesen>	 _joe_: one hour ago, this deployment window
[08:29:22] <duesen>	 scap finished 7:55 UTC
[08:30:56] <duesen>	 _joe_: you can see the bump here: https://grafana.wikimedia.org/goto/HTJJwjwVz?orgId=1
[08:31:04] <_joe_>	 yeah it didn't have that big of an effect on the latency of the parsoid cluster
[08:31:10] <_joe_>	 which is slightly surprising
[08:53:39] <apergos>	 fyi frwiki averages from 27k to 30k new revisions each day 
[08:53:57] <apergos>	 so whatever that works out to in terms of edit speed
[09:32:12] <volans>	 arturo: FYI I don't think that DHCP requests from cloudcontrol2004-dev are making it to the install server
[09:32:17] <volans>	 for your reimage
[09:32:42] <arturo>	 volans: yeah, we saw that. Cathal suspect of missing firmware updates or other cabling issues
[09:33:00] <volans>	 k
[10:45:16] <_joe_>	 duesen: sorry, I'm missing one piece - why are we also generating jobs from codfw?
[10:46:23] <duesen>	 _joe_: we generate jobs on page view when we detect that the cache is stale. We do the same for the old parser's output
[10:46:27] <_joe_>	 I'm sure you already told me and it splipped my mind
[10:46:35] <_joe_>	 yeah ok, that was it :)
[10:47:05] <duesen>	 _joe_: can we work on the changeprop config together?
[10:47:06] <_joe_>	 and we do the check independently, correct?
[10:48:05] <duesen>	 we are doing the check for the traditional parser output. we just assume that, if that is stale, we also need to re-parse with parsoid. which is going to be true 99% of the cases I think
[10:48:07] <_joe_>	 I'll explain what worries me slightly - right now, the parsoid parsercache is probably empty for a lot of objects?
[10:48:20] <_joe_>	 ah ok, good
[10:48:39] <duesen>	 no, it should be already filled for most pages, at least the one that are acively read and edited
[10:48:58] <duesen>	 its being populated when restbase asks for updated copies of these pages
[10:48:59] <_joe_>	 so if the cache is stale, that triggers a sync re-parse with the traditional parser, and an async one with parsoid
[10:49:01] <_joe_>	 ok
[10:49:18] <duesen>	 the job is only needed to ensure that the cache stays up to date when we turn off parsoid updates via restbase
[10:49:26] <_joe_>	 yep
[10:50:38] <_joe_>	 so the changeprop change - as a quick pointer - you need to edit operations/deployment-charts:helmfile.d/services/changeprop-jobqueue/values.yaml
[10:51:00] <_joe_>	 add a configuration for this job to high_traffic_jobs_config
[10:51:45] <_joe_>	 I would suggest we start relatively low with the concurrency, something similar to cdnPurge, maybe
[10:52:01] <_joe_>	 have you checked the backlog for the job right now?
[10:52:27] <duesen>	 I wouldn't know how
[10:52:52] <_joe_>	 let me take a look
[10:54:29] <_joe_>	 so the backlog is typically ok it seems
[10:54:32] <_joe_>	 and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=74
[10:54:42] <_joe_>	 tells me the concurrency is low enough right now
[10:55:34] <_joe_>	 https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=5 this is the mean backlog time
[11:03:48] <duesen>	 _joe_: so we don't need to do anything with changeprop before we enable this on enwiki? 
[11:04:03] <_joe_>	 duesen: I would not expect it
[11:04:18] <duesen>	 ok cool, thank you!
[11:04:22] <_joe_>	 duesen: I'm curious what will happen when we turn off pregeneration via restbase on frwiki
[11:04:33] <_joe_>	 it's possible more of these jobs are actual parses
[11:04:49] <duesen>	 the number of jobs will stay the same, but the load will go up, because more of these jobs will end up doing actual work
[11:04:55] <duesen>	 this will also drive up concurrency
[11:05:13] * duesen has to go and pick up his daughter from shool
[11:26:06] <_joe_>	 duesen: yes exactly, that's why I said "let's see what happens in that case
[14:08:06] <Emperor>	 stupid question> is there an important distinction between "present" and "installed" for package resources in puppet? I think ensure_packages sets the latter
[14:11:45] <jhathaway>	 # Alias the 'present' value.
[14:11:47] <jhathaway>	 aliasvalue(:installed, :present)
[14:11:58] <jhathaway>	 Emperor: ^ no difference
[14:14:44] <Emperor>	 thanks :) [was flagged as a diff by pcc]
[14:34:36] <jbond>	 Emperor: fyi it did use to be present but got changed in https://github.com/puppetlabs/puppetlabs-stdlib/pull/1196
[14:34:47] <jbond>	 there is a fix but we are not on that version of stdlib yet https://github.com/puppetlabs/puppetlabs-stdlib/pull/1300
[14:36:08] <Krinkle>	 effie: with the reorg changes starting July 1st, I'm thinking if it makes sense to change some of the doc tree on wikitech. The current "MediaWiki" navigation is mostly a proxy for (part of) ServiceOps, which makes it less obvious how much "MediaWIki Engineering" stuff to put under there (from Perf, and from elsewhere).
[14:36:56] <effie>	 Krinkle: happy to discuss and make changes. Would you like to create a task and we can pick it up next Q ?
[14:37:09] <Krinkle>	 curious if maybe you have thoughts on what you'd like to see for your team. I could split it into "Service Ops" and "MW Eng" but maybe that's too wide for ServiceOps to have a single nav only? Alternatively could do something like "MW eng" and "MW ops" (the latter would e.g. have Memcached, Envoy, Citoid etc)
[14:37:31] <_joe_>	 It's very hard to split things evenly tbh
[14:37:43] <effie>	 we can meet halfway 
[14:39:41] <Krinkle>	 Note that there is also https://wikitech.wikimedia.org/wiki/Category:SRE_Service_Operations, so that will remain unchanged either way
[14:40:23] <effie>	 lets have a meeting to discuss it as I am in the middle of something else
[14:40:34] <effie>	 our next meeting ?
[14:42:19] <Krinkle>	 sure
[14:44:15] <Krinkle>	 <3
[14:44:52] <effie>	 hahahah you beat me to the doc
[14:44:53] <effie>	 :p
[14:44:59] <effie>	 cheers timo 
[15:35:18] <vgutierrez>	 we're seeing some check_puppetrun crashes on cp nodes while switching port 80 from varnish to haproxy, filled https://phabricator.wikimedia.org/T337951
[15:48:28] <jbond>	 vgutierrez: next tiome you see it if you could grab a copy of /var/lib/puppet/state/last_run_report.yaml anmd attached to the task would be usefull
[15:48:45] <vgutierrez>	 jbond: will do
[15:48:50] <jbond>	 cheers
[15:48:57] <vgutierrez>	 fabfur: ^^ 
[15:49:03] <vgutierrez>	 just in case you spot it first :)
[15:49:28] <fabfur>	 ok
[15:50:24] <jbond>	 vgutierrez: fabfur: id put the content in an WMF-NDA past, it shouldn;t have any sensetive information in it but it migh and its to big to check manualy 
[15:50:55] <vgutierrez>	 will do, thanks jbond 
[15:51:07] <jbond>	 great cheers i also added a similar comment to the task
[21:22:01] <Krinkle>	 godog: I'm trying out the XFF/remoteip approach as suggested, but it seems to not result in effective access control. https://gerrit.wikimedia.org/r/c/operations/puppet/+/919419/12
[21:22:13] <Krinkle>	 > {"timestamp": "2023-06-01T21:16:50", "RequestTime": "6813", "Client-IP": "172.16.0.113", "Handle/Status": "application/x-httpd-php/200", "ResponseSize": "571", "Method": "POST", "Url": "http://performance.wikimedia.beta.wmflabs.org/excimer/speedscope/", "MimeType": "text/html", "Referer": "-", "X-Forwarded-For": "137.220.80.57, 172.16.0.113", "User-Agent": "curl/7.87.0", "Accept-Language": "-", "X-Analytics": "-", "User": "-", 
[21:22:13] <Krinkle>	 "UserHeader": "-", "Connect-IP": "172.16.0.113", "X-Request-Id": "-", "X-Client-IP": "137.220.80.57"}
[21:22:31] <Krinkle>	 I'm guessing this should use X-Client-IP and not X-Forwarded-For like Grafana?
[21:22:43] <Krinkle>	 given XFF includes the internally trusted ip
[21:23:31] <Krinkle>	 it does seem to work for grafana indeed
[21:25:03] <Krinkle>	 ah... https://gerrit.wikimedia.org/r/c/operations/puppet/+/835623/
[21:32:09] <Krinkle>	 > {"timestamp": "2023-06-01T21:30:05", "RequestTime": "3233", "Client-IP": "137.220.80.57", "Handle/Status": "application/x-httpd-php/200", "ResponseSize": "571", "Method": "POST", "Url": "http://performance.wikimedia.beta.wmflabs.org/excimer/speedscope/", "MimeType": "text/html", "Referer": "-", "X-Forwarded-For": "137.220.80.57, 172.16.0.113", "User-Agent": "curl/7.87.0", "Accept-Language": "-", "X-Analytics": "-", "User": "-", 
[21:32:09] <Krinkle>	 "UserHeader": "-", "Connect-IP": "172.16.0.113", "X-Request-Id": "-", "X-Client-IP": "-"}
[21:32:29] <Krinkle>	 That looks better. It's now taking my external IP as the interpreted "Client_IP" and apparently unsetting X-Client-IP as side-effect.
[21:32:34] <Krinkle>	 However it is still not restricting access.
[21:33:48] <Krinkle>	 Hacking it up locally with curl from localhost confirms that it allows any arbitrary IP.
[21:33:49] <Krinkle>	 $ curl -vvi -d 'x' -X POST -H 'X-Client-IP: 200.1.1.1' https://deployment-webperf21.deployment-prep.eqiad1.wikimedia.cloudscope/er/speeds
[21:38:18] <Krinkle>	 left details at https://gerrit.wikimedia.org/r/c/operations/puppet/+/919419 for now