[00:10:57] (TrafficServerRestarted) firing: ATS backend server restarted on cp5024:9122 - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server - https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin&var-instance=cp5024&var-layer=backend - https://alerts.wikimedia.org/?q=alertname%3DTrafficServerRestarted [04:10:57] (TrafficServerRestarted) firing: ATS backend server restarted on cp5024:9122 - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server - https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin&var-instance=cp5024&var-layer=backend - https://alerts.wikimedia.org/?q=alertname%3DTrafficServerRestarted [06:48:25] 10netops, 10Infrastructure-Foundations, 10SRE: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ayounsi) On the resiliency side, this protects us from a double failure: the cr1-cr2 link to fail as well as a transport link. Low risk but still a risk. I agree... [08:10:57] (TrafficServerRestarted) firing: ATS backend server restarted on cp5024:9122 - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server - https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin&var-instance=cp5024&var-layer=backend - https://alerts.wikimedia.org/?q=alertname%3DTrafficServerRestarted [08:22:19] 10Traffic, 10Infrastructure-Foundations: NIC autonegotiation takes 4s in esams - https://phabricator.wikimedia.org/T344604 (10Fabfur) 05Open→03Resolved a:03Fabfur This has been fixed by: - https://gerrit.wikimedia.org/r/c/operations/puppet/+/951152 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/... [08:41:49] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10akosiaris) >>! In T340036#9062519, @vadim-kovalenko wrote: > Hi there! I'm responsible for Kiwix migration to another API,... [08:42:55] 10Traffic: ATS automatically restarted due to receiving SIGUSR2 - https://phabricator.wikimedia.org/T344674 (10Vgutierrez) [08:44:36] 10Traffic: ATS automatically restarted due to receiving SIGUSR2 on cp5024 - https://phabricator.wikimedia.org/T344674 (10Vgutierrez) [08:45:11] 10Traffic: ATS automatically restarted due to receiving SIGUSR2 on cp5024 - https://phabricator.wikimedia.org/T344674 (10Vgutierrez) p:05Triage→03Medium [08:45:57] (TrafficServerRestarted) resolved: ATS backend server restarted on cp5024:9122 - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server - https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin&var-instance=cp5024&var-layer=backend - https://alerts.wikimedia.org/?q=alertname%3DTrafficServerRestarted [08:46:09] 10Traffic, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10vadim-kovalenko) @akosiaris , I've updated regexp, and now it works, thank you! [09:08:17] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Direct 5% of all traffic to mw-on-k8s - https://phabricator.wikimedia.org/T341780 (10Clement_Goubert) Pending more hardware, we will move on to 2% first. [12:03:35] 10netops, 10Infrastructure-Foundations, 10Patch-For-Review: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) @cmooney mentioned me that the previous syntax didn't work, this is because the `as-path-calc-length` term ignore... [14:07:50] 10Traffic, 10IP Info, 10SRE, 10SRE-OnFire, 10IP-Blocking-Impacts: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) [14:12:40] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10taavi) [14:18:39] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) I'm no longer blocked as on https://en.wikipedia.org/wiki/Special:Contributions/ST47ProxyBot this message can be read: 14:13, 22 August 2023 Yamla talk contribs bl... [14:31:37] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) [14:31:41] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10Johannnes89) Note: Multiple dewiki users were reporting a similar problem regarding the IP 10.80.1.7 which also doesn't belong to those users (same /28-range as 10.80.1.11) https... [14:33:20] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10ssingh) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/951508 @taavi has rolled this out so this should be resolving shortly. Thanks for filing the task. [14:33:33] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10taavi) a:03ssingh [14:37:07] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10Yamla) Feel free to lift my block on https://en.wikipedia.org/w/index.php?title=Special:Log/block&page=User%3AST47ProxyBot once this fix is deployed and working. No need to consu... [14:40:42] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10taavi) 05Open→03Resolved [14:49:09] 10Traffic, 10SRE, 10Patch-For-Review: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&callback=&format=json&formatversion=2 now returns my actual IP. Thanks for the qui... [15:33:25] 10Traffic, 10SRE: Blocked on English Wikipedia / Wikimedia thinks my IP is 10.80.1.11 - https://phabricator.wikimedia.org/T344704 (10AlexisJazz) >>! In T344704#9109942, @Johannnes89 wrote: > Note: Multiple dewiki users were reporting a similar problem regarding the IP 10.80.1.7 which also doesn't belong to tho... [18:16:35] <_joe_> I was telling suk.he but it's worth repeating it: there is a surge of approx 1k rps in eqiad since repooling esams; this reflects almost 1:1 the drop in the total "cache_text fresh backend" in esams+drmrs since re-introducing esams [18:16:45] <_joe_> from 2.1k/s to ~ 1k/s [18:17:03] <_joe_> because while drmrs is still around 900/s, esams is only 100/s [18:17:16] <_joe_> which I think is the cause of the higher request rate to the backends [18:17:56] I got lost above, "surge" vs 2.1k->1k/s, and esams being smaller than drmrs? [18:18:17] we've been discussing the still-elevated rate of appserver reqs in eqiad though [18:19:04] <_joe_> bblack: drmrs was doing 2.1k cache hits per second in ats before repooling [18:19:10] <_joe_> now it's doing 900 [18:19:12] part of it's the single-backend model that esams switched to while moving + swapping hardware. that has at least somewhat elevated the reqrates everywhere it's been done before, especially with text on still having the one-disk config (being fixed, probably in Q2) [18:19:25] ah ok [18:19:32] <_joe_> bblack: yeah that explains what we're seeing [18:19:55] <_joe_> in esams right now cache text is doing 100 hits vs 4000 misses if I read grafana correctly [18:20:34] <_joe_> I do see a slow growing trend though [18:20:41] we already decided we don't want to stick with this "smaller disk size in text cluster" cost optimization after seeing it in ulsfo+eqsin. there's budget in Q2 to fix those two sites. eqiad's the next site due for this hw refresh, and it will have expanded text disks from thestart [18:20:57] esams got missed in the shuffle and still has the smaller-text-disks config [18:21:48] <_joe_> bblack: ack [18:22:13] <_joe_> operatively, we can live with the additional queries for now, let's see if things improve somewhat between now and tomorrow [18:22:24] anyways, sukhe is gonna work that out with willy for the medium-long term (get esams text disk expansion in q2) [18:22:29] <_joe_> maybe it will settle to about 500 hits :) [18:22:35] <_joe_> glad to hear [18:22:48] <_joe_> our "savings" on hardware I think cost the org a lot of money in the end [18:22:58] yeah [18:23:37] well, we were aiming for a certain budget threshold or whatever when we did the first orders of this config, and that was the cost compromise: hope that half-sizing the text disks would be ok :) [18:26:20] yeah