[00:34:40] 10Traffic, 10SRE: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10nshahquinn-wmf) a:05nshahquinn-wmf→03Fabfur Thanks for the reminder! The list is at P52488. A few wikis listed seem not to have a mobile site, either lacking the footer link entirely or giving a D... [08:42:30] XioNoX: do you wanna move forward with https://gerrit.wikimedia.org/r/c/operations/dns/+/931992 :? [08:43:24] vgutierrez: yeah, I don't see any blockers for it [08:43:38] vgutierrez: can you do a quick review? [08:46:19] XioNoX: data comes from https://phabricator.wikimedia.org/T337318#8953809, right? [08:46:34] vgutierrez: yeah [08:46:59] doesn't need to be a full thrrough review, but just that there is no obvious mistake [08:47:10] I already checked but a 2nd pair of eyes is welcome [08:48:11] tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine [08:48:25] funny :) [08:48:41] I wonder if some of the choices were made to make the eqiad/codfw split more even [08:48:51] at the small cost of very few ms [08:49:04] hmmm not really [08:49:23] probably the opposite [08:49:26] asking just because basically all the changes are moving eqiad first instead of codfw [08:49:50] vgutierrez: I mean the existing ones [08:50:08] eqiad has way more traffic than codfw [08:50:16] of course it could be a problem of population distribution [08:50:18] exactly and we're adding more [08:54:09] ulsfo gets a Canadian state.. but yeah [08:54:27] basically those that change the main DC are adding more traffic to eqiad [08:54:27] https://en.wikipedia.org/wiki/File:U.S._states_and_territories_by_population_density.svg [08:54:30] :D [09:32:03] I've found some efforts to get less traffic to ulsfo like https://github.com/wikimedia/operations-dns/commit/6322a1b094189c79c5d5b5d384ccbe5efdf43fdf [09:32:30] (reverted afterwards) [09:32:50] we got some in our puppet history for unloading eqiad a little bit while refreshing esams 4 years ago [09:34:05] I guess we could use some context from bblack [09:34:44] vgutierrez: ok! no blockers from I/F or Netops, if you think it's good for traffic feel free to deploy it [11:20:00] Could we try routing traffic to wikifeeds again today? It was (mostly) successful last time so I'm not worried about the rollout as far as this chance is concerned https://gerrit.wikimedia.org/r/c/operations/puppet/+/956895 [11:20:13] Main issues last time were in the service and the mobile app, which we've since addressed in both [11:46:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) I found a bit of time to play with some of the above mentioned solutions and those are my findings. ####... [12:15:09] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) After speaking to @ayounsi I have a better idea of how we intend to use the "routed mode" ganeti. In many ways it's similar to what I propose above: * Both ha... [12:59:34] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10akosiaris) [13:48:22] 10Traffic, 10SRE, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Thanks @nshahquinn-wmf , I've started working on this, obviously we will add rules for mobile domain redirect only for domains that have a real mobile counterpart... [14:30:54] Any possibility of doing the above today please? <3 shouldn't even need a puppet-stopping given how the last rollout went [14:33:07] I believe there's a spurious "pybal backend checks" alert firing, I'd like to restart pybal on lvs2014 and lvs1020 (standby) [14:37:23] hmmm [14:37:27] https://www.irccloud.com/pastebin/lFWr5QB7/ [14:37:37] godog: could you provide a valid service config for thanos-web_443? [14:38:18] hnowlan: sure [14:42:28] vgutierrez: to make sure I understand, the alert says "marked down but pooled" and that output says that the server is up, that looks like a mismatch? [14:42:55] godog: sure, I'm just suggesting that the mismatch could be triggered by the impossible depool_threshold [14:43:08] re: depool threshold, this service went from 4 servers to 6, before that change we always kept one server pooled and pybal was fine with it [14:44:38] we can definitely tune it too, that seems like a good idea regardless [14:45:26] also I'll be removing the thanos-fe hosts from that service soon, with https://gerrit.wikimedia.org/r/c/operations/puppet/+/956888 [14:45:39] which then should be fine wrt depool threshold I think [14:49:07] vgutierrez: thanks! on second thought I'll do the usual dance with cp2037 if that's ok [14:49:20] hnowlan: sure [15:00:33] change looks good, enabling puppet [15:04:58] hnowlan: hmm no cache-control header? [15:06:50] https://www.irccloud.com/pastebin/bOMpDppp/ [15:07:09] hnowlan: maybe I'm missing something obvious.. [15:07:33] vgutierrez: looking - I thought this had been addressed for wikifeeds :/ [15:09:15] :_) [15:11:46] this was added to the service but might be disabled somehow, following up with them [15:12:00] `curl https://wikifeeds.discovery.wmnet:4101/en.wikipedia.org/v1/feed/onthisday/all/02/06 -v > /dev/null` no header here [15:14:08] https://github.com/wikimedia/mediawiki-services-wikifeeds/commit/91f5c801bd2b2e3e708fee49bfc05dc2a526aca1 [15:17:29] `curl -v -o /dev/null https://wikifeeds.svc.codfw.wmnet:4101/en.wikipedia.org/v1/aggregated/onthisday/all/01/01` does - seems the non-aggregated endpoint doesn't hvae it [15:26:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) The problem getting them by ASN is that there may be "collateral damage" sometimes. i.e. If you pull th... [15:32:49] vgutierrez: fixed - there was a rule ordering issue that applied the non-all rule to the all rule first [15:34:15] fix isn't live in eqiad, right? [15:34:21] sorry, codfw [15:34:39] no, deploying now though [15:34:54] ack [15:35:01] done [15:35:10] nice [15:35:17] (already double checked) [15:38:02] hello folks, I have a quick question about the caching config for stream.wikimedia.org - I see that we have caching "pipe" for /v2/stream, I am trying to find traces of new clients in webrequest and I don't see much.. I guess because of the pipe config? [15:41:30] elukey: correct. "pipe" more or less means "straight tcp proxy", at which point we get no deeper analytics on the requests inside the pipe. [15:44:10] bblack: ack! So basically eventstreams in the backend gets to see the HTTP request after having the TCP conn proxied by Varnish (so any HTTP header etc.. that I am interested in should be found, hopefully, logging it in the backend code) [15:44:41] (I am basically trying to track down consumers of streams since I need to deprecate one) [15:48:17] yeah only eventstreams itself will know [15:49:03] technically varnish does parse the initial request, just to be able to see that eventstreams was the destination, before it flips to raw proxy mode. I don't think that gives us any analytics on that first request though (because of how our analytics are hooked into the response side of things) [15:50:21] yeah I don't see anything in webrequest logs, it is unfortunate since we don't have a strong set of logs to review for ES.. I'll try to come up with something on the backend side. Thanks a lot! [16:22:36] XioNoX, the US/CA change has been merged [16:31:27] sweet! [16:41:38] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) p:05Triage→03Low [16:41:47] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) a:03BCornwall [16:43:38] 10Traffic, 10SRE, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) We started adding the `*.wikimedia.org` domains to the Varnish configuration, some notes: Currently we have these domains without a mobile (m..wikimedia.org) coun... [17:28:56] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm [18:49:21] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm executed with errors: - cp4052 (**FAIL**) - Downtimed on Ic... [19:15:04] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm [21:46:45] (HAProxyRestarted) firing: HAProxy server restarted on cp4052:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4052&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [21:47:42] (SystemdUnitFailed) firing: haproxy.service Failed on cp4052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:44] (VarnishHighThreadCount) firing: (6) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [22:53:27] (PurgedHighEventLag) firing: (6) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [22:53:44] (VarnishHighThreadCount) firing: (28) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [22:57:54] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm executed with errors: - cp4052 (**FAIL**) - Removed from Pu... [22:58:26] (PurgedHighEventLag) resolved: (9) High event process lag with purged on cp5017:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [22:58:44] (VarnishHighThreadCount) firing: (29) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:01:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp4052:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=ulsfo%20prometheus/ops&var-instance=cp4052&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [23:02:42] (SystemdUnitFailed) resolved: haproxy.service Failed on cp4052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:45] (VarnishHighThreadCount) firing: (30) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:18:45] (VarnishHighThreadCount) firing: (29) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:23:44] (VarnishHighThreadCount) firing: (30) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:28:44] (VarnishHighThreadCount) firing: (27) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:33:44] (VarnishHighThreadCount) firing: (31) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:38:45] (VarnishHighThreadCount) firing: (46) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:43:45] (VarnishHighThreadCount) firing: (45) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:53:45] (VarnishHighThreadCount) firing: (38) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [23:58:45] (VarnishHighThreadCount) resolved: (22) Varnish's thread count is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount