[00:02:57] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:07:57] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [02:48:56] (HAProxyEdgeTrafficDrop) firing: 40% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:03:56] (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:05:56] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:15:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:11:34] there are alerts for ecdsa/rsa certs on haproxy expiring in a few days, known ? [09:09:56] indeed [09:10:03] we got the new digicert-2022 in place already [09:10:22] it's getting aged a little bit before putting it into production [09:10:28] to avoid clock skewing issues [09:10:40] it should be ready by the end of the day :) [09:11:56] *nod* thank you that makes sense, when known I recommend acking the alerts for alert-fatigue reasons [12:16:59] 10Traffic, 10Infrastructure-Foundations, 10SRE: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10BBlack) 05Open→03Resolved a:03BBlack Should've been resolved a while back! [12:20:32] 10Traffic, 10SRE, 10Patch-For-Review: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 (10BBlack) Bump - we should revisit this, but perhaps after finishing the cache role name cleanup (text vs text_envoy vs text_haproxy...). [12:43:58] Would one of y'all have some time to help me navigate https://phabricator.wikimedia.org/T316296 and the order of what I should do, as well as if we're going about this the right way from your PoV? No urgency, we can schedule something up for later. [13:38:56] claime: I think the patches there would do what you expect, if you start with the trafficserver part. What's going to happen when that merges is the search.wikimedia.org traffic will go down the default path towards MediaWiki, where I assume it's an unconfigured domain and would try to show them the same output as https://www.wikimedia.org/ (or a 301 to HTTPS first, most likely) [13:39:45] I think there's probably further cleanup patches needed at other layers (maybe?), but starting with that will get over the hurdle of "see if anyone screams" before doing the rest. [13:51:00] bblack: Yes, there is the rest of the service removal from the LB and in the catalog, plus the kubernetes cleanup for which I'm currently experimenting/writing the necessary underlying puppet code. By "starting with that" you mean only the trafficserver patch right? The dns removal and lvs_setup is for when I actually start removing the service? [13:51:42] claime: correct [13:52:12] the trafficserver patch is enough to break any "real" usage of the service. start with that, and then we can launch into all the other decom later after whatever grace period [13:52:44] Are there further actions to take after merging that first patch? cookbook to run, or services to restart, or is it all handled by puppet? [13:52:55] it should all be handled by puppet [13:53:14] Fantastic, thanks for the info [13:59:59] claime: the one thing to be aware of there (not necessarily a concern) is that ofc puppet runs will be gradual over 30 minutes ... so the result will be inconsistent between different IP addresses for that duration [14:01:17] "different IP addresses" being both client and server ones, depending on who's looking from where [14:01:25] but it will all work out within ~30 minutes or so [14:01:51] cdanis: Yes, I figured as much. In that particular case it's not a concern, the service isn't supposed to be used, so a bit of inconsistency over 30 minutes is completely acceptable [14:05:51] 10HTTPS, 10Traffic, 10SRE, 10serviceops, and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Vgutierrez) [14:37:22] Can confirm that https://search.wikimedia.org/ now outputs the same as https://www.wikimedia.org/, 301 to https for http://search.wikimedia.org [14:37:25] Thanks y'all [14:57:46] np! [16:54:38] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 32 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [16:59:38] (LVSHighCPU) firing: (2) The host lvs1020:9100 has at least its CPU 24 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [16:59:48] uh? [17:00:01] isn't lvs1020 the secondary one? [17:00:26] yeah? [17:01:03] according to cumin's A:lvs-secondary-eqiad yes [17:01:46] so what's pybal doing to eat 1 core? sigh [17:04:19] spinning? :) [17:05:00] the backups are the ones with the most services to manage and most healthchecks to run, since they include everything from all the others [17:05:16] in the past, we've had problems where healthchecks were running behind on the wallclock because ofit [17:09:38] (LVSHighCPU) resolved: (4) The host lvs1018:9100 has at least its CPU 36 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:11:23] perf top says python's spending a lot of time in eval() [17:13:29] jobrunner healthchecks have some recurrent failures lately [17:13:38] mw1337 + mw1338 [17:14:06] none of that's probably super-relevant [17:19:53] (LVSHighCPU) firing: (6) The host lvs1018:9100 has at least its CPU 16 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:20:08] (LVSHighCPU) resolved: (6) The host lvs1018:9100 has at least its CPU 16 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:24:53] (LVSHighCPU) firing: (7) The host lvs1018:9100 has at least its CPU 16 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:29:53] (LVSHighCPU) resolved: (6) The host lvs1018:9100 has at least its CPU 16 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:31:38] 10Domains, 10Analytics-Radar, 10SRE, 10Traffic-Icebox, and 3 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10BCornwall) 05Open→03Resolved ` [~]$ curl -s -I https://en.wikipedia.org/ | grep Las... [17:34:53] (LVSHighCPU) firing: (6) The host lvs1018:9100 has at least its CPU 28 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:39:53] (LVSHighCPU) resolved: (4) The host lvs1018:9100 has at least its CPU 28 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [17:56:38] (LVSHighCPU) firing: The host lvs1018:9100 has at least its CPU 22 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [18:01:38] (LVSHighCPU) resolved: The host lvs1018:9100 has at least its CPU 22 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [18:12:38] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 38 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [18:17:38] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 38 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [21:30:45] 10Traffic, 10Discovery-Search, 10Observability-Alerting: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10bking) [21:31:08] 10Traffic, 10Discovery-Search, 10Observability-Alerting: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10bking) [21:38:07] 10Traffic, 10Discovery-Search, 10Observability-Alerting: Use DNS name instead of IP in PyBal alerts - https://phabricator.wikimedia.org/T322377 (10RKemper)