[00:00:51] TimStarling: k, rebased to see if tests pass.. [00:02:16] I'm looking at the configuration cache breakage meanwhile [00:02:30] Following revalidation=0 [00:02:46] https://phabricator.wikimedia.org/T311788 [00:05:45] The only way I can think of to fix this in the current model is to either get the mtime of IS.php as it was when opcache compiled it, or otherwise know it's serialised contents in memory. But it seems easier at this point to just get rid of it. From my measures, the time spent in getGlobals is less than the time to read and parse JSON. The only added cost would be the db list files for specifying the relevant wikiTags. Which I intend to [00:05:45] optimise away by making the compiled output of db expressions and YaML stuff .php instead of txt like now [00:06:40] Or perhaps apcu will do, either by mtime or unconditional for a minute [00:12:13] ok, get rid of it [00:29:19] any update on the mcrouter-primary-dc deployment? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809326/ [00:32:46] TimStarling: nope, haven't gotten reply forom j.oe yet. [00:33:01] I did look at some graphs earlier to see if the shift from redis had any noticable effect one way or the other. [00:33:33] https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-job=redis_sessions&var-instance=All&from=now-7d&to=now [00:33:33] https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&from=now-7d&to=now [00:33:38] https://grafana.wikimedia.org/d/000000383/login-timing?orgId=1 [00:35:32] https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?orgId=1&refresh=5m&var-metric=&var-module=centralauthtoken&from=now-7d&to=now [00:36:19] anyway, nothing out of the ordinary. except mediawiki api backend breakdown seems to be going up since two days at: https://grafana.wikimedia.org/d/000000002/api-backend-summary?orgId=1&refresh=5m&from=now-30d&to=now [00:36:25] not related presumably, but did stand out [00:40:26] why does that load graph have a y-axis measured in time units, fixing... [00:47:32] the bottom row is cumulative walltime [00:47:38] second to last is req/s [00:48:00] I guess for the pie chart, the values arne't helpful, could probably be hidden label [00:48:33] the one labelled "load breakdown (top 10)", I added that panel and wrote the description [00:48:42] a couple of years ago probably [00:50:02] I fixed it and saved the dashboard [00:51:40] ack, fixed the pie chart as well [00:52:22] I wish all services had two simple metrics: latency and load (i.e. connections, concurrency) [00:54:52] not median or p99 latency, mean latency like I put in that API dashboard [00:56:08] if this were my job I'd have a set of dashboards tracking those two metrics through the whole stack [01:07:15] as part of the SLO work SRE is doing something a bit like that, though standardising not on mean latency but histogram buckets so that one can e.g. answer what % of requests last N time were within X ms and still get an accurate answer both per minute, hour and e.g. month/quarter for routine reporting and whether we're makking the goals (which is not currently something we can do given each minute mean or p99 is unweighted when [01:07:15] aggregated by Graphite) [01:08:13] though you can certainly get the mean from that, as some dashboards do