[01:59:43] 10Traffic, 10SRE, 10vm-requests: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10ssingh) Thanks for all the help and sorry it took a while! I think 10G should be fine for now given the current usage on the other Wikidough hosts. [10:49:55] ema: is something going on with the ulsfo CP servers? they're pulling ton of data from codfw, possibly saturating the ulsfo-codfw link [10:50:51] see https://librenms.wikimedia.org/device/device=91/tab=port/port=8039/ for example [10:52:29] vgutierrez: ^ ? [10:55:41] bblack: ^ ? [10:58:27] hmm [11:01:01] looks like it's going down now [11:02:57] that's the overall usage fyi: https://librenms.wikimedia.org/graphs/to=1622718000/id=16787/type=port_bits/from=1622696400/ topping at 8Gbps [11:03:48] could the caches suddenly became cold? [11:05:41] Relatively large usage increase starting from approx 04:00 UTC today. There was a bit of a dip just prior to that, starting about 03:20. [11:06:32] Then another jump up in usage starting about 07:30, climbing until very recently. [11:08:08] So jump in usage about 7 hours ago, then further jump 3.5 hours back. [11:09:05] topranks: you can see all ulsfo external traffic in https://librenms.wikimedia.org/bill/bill_id=10/ btw [11:09:16] and hoover on individual links [11:10:01] Thanks that's nice single place to get it :) [11:10:14] XioNoX: Apple and bing crawling can give that impression [11:11:34] vgutierrez: I'd expect in that case that the traffic we're pulling from codfw reflect on what we're sending to apple/bing? [11:23:17] Levels seem to be on the way up again. I could be imagining it but it seems there is a "ramp up, drop off a bit, start ramping up again" pattern. [11:25:36] https://usercontent.irccloud-cdn.com/file/VYQ1N8H8/image.png [11:30:06] yeah so there's definitely a visible increase in bytes received on all cache upload hosts in ulsfo -- that's cp402[1-6] [11:30:17] see for example https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=66&orgId=1&from=now-2d&to=now&var-site=ulsfo%20prometheus%2Fops&var-instance=cp4021 [11:33:35] raw request rate is more than 2x compared to last week, but still in ulsfo that does not mean very much at all (1.8K rps right now) [11:34:01] see https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?orgId=1&var-cluster=upload&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1622678779113&to=1622719905513 [11:36:05] ema: why the TX doesn't match the RX? [11:37:55] XioNoX: you'd expect them to kind-of match when looking at a reverse proxy right? Whatever we receive from the origins we send to the clients. However, we do cache :) [11:39:04] RX here is from the user or from codfw? [11:39:05] so yeah, in case of for example a significant increase of cache hits for large objects you can easily imagine that tx would go up significantly while rx does not (HTTP requests are much smaller than responses) [11:39:29] XioNoX: rx and tx is from the point of view of the host network card [11:39:47] https://usercontent.irccloud-cdn.com/file/yl41u2kh/image.png [11:39:51] ah ok [11:40:02] where do they come from, where do they go, we cannot tell from that graph [11:40:05] ^^ extremely rough visual comparison of the traffic on transport link versus that graph. [11:40:05] so yeah, RX should always be < TX on a cache? [11:40:12] Doesn't seem to correlate. [11:41:31] topranks: cache-host-drilldown does seem to correlate though, doesn't it [11:41:35] https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=66&orgId=1&from=now-24h&to=now&var-site=ulsfo%20prometheus%2Fops&var-instance=cp4021 [11:45:14] Yeah absolutely, traffic is coming going to those CP boxes. Lines up exactly. [11:46:48] right, so we know for sure that it's the upload cluster and not text [11:48:43] I notice the "local backend cache hitrate" in ulsfo (from first dashboard you posted ema,) is way down. [11:49:56] topranks: it is, but it seems to be compensated almost precisely by "requests served by frontends" [11:50:31] ok, you'll need to excuse my lack of knowledge a little here :) [11:50:54] no, that was a great observation [11:51:53] so the way our CDN works is that we have a cache frontend layer (varnish), cache backend layer (ats), and misses at the backend layer go to the origins [11:52:21] cool thanks.. makes sense. [11:54:51] now if there's an increase on the ulsfo<->codfw link you'd expect to see a matching increase in the amount of bytes received by the ats backend layer [11:55:14] but I don't see that: [11:55:16] https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?viewPanel=6&orgId=1&var-datasource=ulsfo%20prometheus%2Fops&var-layer=backend&var-cluster=upload [11:57:39] It's odd, there is a drop there starting at 03:20 which lines up with the drop I mention on the links between sites, before the first big jump in traffic at 04:00. [11:58:36] But as you say the overall level on that graph is lower, which should mean less traffic on the link [11:59:52] is it possible that a request to commons (upload.wo) causes a cache miss on the ulsfo side, that then goes to fetch the item in codfw, but then doesn't send it to the user? (like the connection has been closed in between)? [12:04:31] XioNoX: everything is possible :) However that is unlikely to happen often enough to be visible like what we're seeing now, plus you'd expect the ATS backend to account for the bytes received anyways [12:08:57] if it's an anomaly over previous days (and/or this dow a week ago, etc), it's likely the "user" patterns have shifted. [12:09:25] about that last page https://librenms.wikimedia.org/device/device=162/tab=port/port=15261/ [12:09:28] basically, the user-facing req volume may be similar to this time period, but more of them are unique [12:09:58] not sure which hosts are in cause yet [12:10:30] (e.g. there's a new client showing up since 04:00 which doesn't add much to the total RPS on the frontend side, but they're scanning all URLs and thus have a low backend hitrate and cause significantly more transport backhaul traffic than "natural" users which see more cache hits) [12:11:51] it can be hard to make sense of that scenario on graphs, because it exposes the hidden hitrate variable in correlating the front and back traffic, which has an outsized effect for very small changes. [12:12:18] what's puzzling is that I don't see an increase in response bytes neither at the varnish nor at the ats-be level [12:13:33] https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-2d&to=now&var-cluster=cache_upload&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [12:13:50] text is a little "interesting" too, but upload's deviance seems more-pronounced [12:14:08] but yeah, something is abnormal with the user requests [12:14:56] vs the day/week before there's a spiky higher volume of reqs, and the dispositions (hitrate, int errors, etc) zigs and zags with the reqrate too [12:15:21] they seem to cause a lot of "int" (which is varnish-sent errors or redirects) [12:16:04] but it seems like they really started much easlier (08:00 previous day) and is just notably worse since ~04:00 today. [12:21:25] https://w.wiki/3Rpa [12:21:44] ^ shows the int spikes on upload@ulsfo the past week. it's a pattern going back a while, just worse now. [12:22:24] what's "int" ? [12:22:34] ah, you said it above [12:22:38] yeah [12:23:07] so if "int" is correlating with transport usage and not much outbound, it could mean the novel requests are mostly clients giving up on big transfers in a way that wastes them [12:23:47] they requst large file foo, varnish starts a big transfer from codfw swift, client goes away and never gets the bytes (getting the int error instead) but the transport still happens on the back side for the whole file [12:24:26] bblack: you'd see those at the ats-be layer accounted as bytes received though [12:24:45] yeah, maybe [12:25:25] unless ATS accounting is only done when the whole response body is slurped by "the client" [12:25:42] yeah, it could be falling into some crack in the stats like that [12:25:54] unfortunately we don't have per-cgroup network data [12:26:00] getting 502s from turnilo now :P [12:26:14] it would be nice to distinguish between bytes rx by varnish vs ats that way [12:26:24] but we only do cpu/memory accounting using cgroups, not net [12:26:45] switching chans... [13:35:41] 10Traffic, 10netops, 10SRE, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) I fear we could be quite disappointed about "corporate workstations" being on IPv6 if we went to look ;) Either way I assume we want this li... [14:54:20] 10netops, 10Analytics, 10SRE: Audit analytics firewall filters - https://phabricator.wikimedia.org/T279429 (10Ottomata) Ok, for the kafka term, we no longer need any logstash hosts. kafka logging cluster used be colocated on a few logstash hosts, but no longer, they are all on kafka-loggingXXXX. This [[ ht... [15:01:42] 10Traffic, 10netops, 10SRE, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) Peerings to doh3001 and doh3002 added on cr1-esams and cr2-esams now. Anycast range is being announced and from here in Ireland I'm hitting doh3... [15:52:30] 10Traffic, 10netops, 10SRE, 10Patch-For-Review: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) There was an issue with peering to doh3002 due to a problem that occurred with Netbox automation, triggered by the VM creation running twice I be... [15:55:46] \o/, thanks topranks and XioNoX! [17:37:36] 10Traffic, 10SRE, 10ops-eqiad: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) 05Open→03Resolved @ema Replaced the DIMM A6, powered on and replacement recgonized. Message PR1: Replaced part detected for device: DDR4 DIMM(Socket A6). Booted to the OS Cleared th... [18:09:38] 10Traffic, 10SRE, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) [18:15:16] 10Traffic, 10SRE, 10vm-requests: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10ssingh) [18:34:18] 10Traffic, 10SRE, 10ops-eqiad: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Dzahn) 05Resolved→03Open a:05Cmjohnson→03ema https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp1087 [18:38:43] 10Traffic, 10SRE, 10ops-eqiad: cp1087 down with hardware issues - https://phabricator.wikimedia.org/T278729 (10Dzahn) END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729 [19:22:11] 10Traffic, 10SRE, 10Patch-For-Review, 10User-ArielGlenn, and 2 others: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10Dzahn) checked on install* that nginx-full is gone, nginx-light is there and restarted nginx to be sure this did not remove other nginx-* module packages though [19:45:44] 10Traffic, 10SRE, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) doh5001 has been created but doh5002 hit resource limits here as well, even though we just used 10G disk, it is maybe another resource: ` dzahn@cu... [19:49:23] 10Traffic, 10SRE, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) @ssingh @BBlack Our issue over here is lack of the resource of .. public IPs, it looks: ` 13729 File "/usr/lib/python3/dist-packages/spicerack/_... [19:53:28] 10Traffic, 10SRE, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) {F34479670} [19:54:19] bblack: eqsin public network is full .. https://phabricator.wikimedia.org/F34479670 [19:54:28] just need one more IP though [19:55:22] https://netbox.wikimedia.org/ipam/prefixes/28/ip-addresses/ [20:13:32] huh? [20:13:38] looking... [20:15:34] mutante: yeah that's... not great. there's plenty of IP space in the /24 in general, but yeah maybe we should've given more than /28 to public1-a [20:15:44] err public1-eqsin [20:16:56] esams public1 is a /25 for comparison, whereas ulsfo is a /28 like eqsin [20:17:38] XioNoX: ^ when you have time to take a peek, curious about your thoughts [20:19:11] in the ulsfo case, there's empty /28's right after public1 that we could expand into with some pain, but sandbox1 is adjacent to public1 in eqsin [20:20:10] in both cases we might have to take some site downtime and renumber to fix it properly [20:21:30] (obviously we could make a public2 without downtime, but then we've got two vlans relatively-pointlessly) [20:22:49] ACK, so the makevm cookbook failed again due to lack of resources but unlike the other day it did not talk about "disk" at all, so it wasn't as obvious why, needed some log digging the resource in question was actually IP addresses [20:42:36] 10Traffic, 10Okapi [Wikimedia Enterprise], 10SRE: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10RBrounley_WMF) Checking in here @Eugene.chernov, any blockers? [20:50:25] 10Traffic, 10Okapi [Wikimedia Enterprise], 10SRE: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10BBlack) @RBrounley_WMF I think he's waiting on me, sorry! Will sync up with him [20:53:35] mutante: just saw, thanks [20:54:00] so let's just go with one for now; I think that's fine for testing [20:54:44] I will update the patches [20:59:41] sukhe: ACK, ok. I'll install an OS on that after https://gerrit.wikimedia.org/r/c/operations/puppet/+/698047 [21:04:17] thank you! [22:22:02] 10Traffic, 10Okapi [Wikimedia Enterprise], 10SRE: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10RBrounley_WMF) No problem, thanks @BBlack! [22:42:16] 10Traffic, 10SRE, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10colewhite) p:05Triage→03Medium [23:07:13] 10Traffic, 10SRE, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqsin - https://phabricator.wikimedia.org/T284246 (10Dzahn) a:03Dzahn