[00:10:11] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10732069 (10Quiddity) Thanks for the drafts, both! I will add this to Tech News tomorrow, **pending your confirmation** on the wording-tweaks I've made,... [01:49:04] * DemiMarie FYI HAProxy can be configured to delete all headers starting with X-, which would allow not doing so in VCL. [06:49:39] 06Traffic, 10conftool, 10Hiddenparma: Requestctl needs to be able to check if a header is set, not just not set. - https://phabricator.wikimedia.org/T391368#10732419 (10Joe) a:05Vgutierrez→03Joe [07:59:04] 06Traffic, 10Liberica: Alert on control plane <-> etcd mismatches - https://phabricator.wikimedia.org/T391659 (10Vgutierrez) 03NEW [07:59:14] 06Traffic, 10Liberica: Alert on control plane <-> etcd mismatches - https://phabricator.wikimedia.org/T391659#10732515 (10Vgutierrez) p:05Triage→03Medium [07:59:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:04:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:14:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:24:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:34:40] RESOLVED: [5x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:42:00] 06Traffic, 10conftool, 10Hiddenparma: Rendered requestctl rules for varnish and haproxy use \r\n (Windows line endings) - https://phabricator.wikimedia.org/T391662 (10Vgutierrez) 03NEW [09:22:04] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10732673 (10Ladsgroup) >>! In T355914#10732069, @Quiddity wrote: > Thanks for the drafts, both! I will add this to Tech News tomorrow, **pending your con... [09:26:53] thumb steps is 85%, I see the hit ratio is not bad but I wonder if the extra network is okay, 85% of the images served on wikis are slightly large now. [09:28:19] someone said extra network ? [09:28:47] Amir1: what's up ? do you have more info ? maybe a graph ? [09:29:03] xD [09:29:14] The context is this: https://phabricator.wikimedia.org/T360589 [09:29:23] plus https://phabricator.wikimedia.org/T355914 [09:30:31] TLDR: Default image size has been bumped from 220px to 250px and the rest of images are being served slightly larger to be in discrete sizes (improve cache). I assume traffic to backends should get better [09:30:47] but from CDN to users, it should slightly go up in upload cluster [09:31:31] I'm not seeing any noticeable bump in here for example https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1&from=now-90d&to=now&viewPanel=30 [09:32:30] noted, thx! [09:34:28] if anything, every looks better which is probably because we banned scrapers https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1&from=now-90d&to=now&viewPanel=31 [09:35:15] Amir1: upload@esams as a reference: https://grafana.wikimedia.org/goto/lCQouBANg?orgId=1 [09:36:49] doesn't look bad [09:36:56] I was really worried [09:37:26] Thanks! [09:37:47] Amir1: and ATS upload@esams traffic: https://grafana.wikimedia.org/goto/-4bfXB0Ng?orgId=1 [09:37:56] it looks like it's getting less traffic from swift [09:38:26] Amir1: have you measured the impact on swift servers? [09:39:02] That should be expected, since if cache rate ratio improves, more will be served from CDN -> less traffic to backend [09:39:21] that is sorta main point of discrete steps [09:39:35] sure [09:39:43] the hit ratio hasn't improved drastically though https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=1d&var-cluster=cache_upload&var-site=All&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-90d&to=now&viewPanel=8 [09:40:21] (It'll take time, a lot will be regenerated, etc. I think it'll be more visible in a week or two) [09:40:22] Amir1: you're checking varnish hitrate [09:40:29] all thumbnails fit on varnish? [09:41:38] ah, my bad. my brain hasn't moved passed the varnish time [09:42:01] where is a better place to check? [09:44:23] hmm it seems to be the right place [09:44:30] I was checking the queries [09:44:41] it's rending X-Cache data gathered by varnish [09:44:50] but that includes ATS metrics as well [09:44:59] misleading dashboard name :) [09:45:49] I was double wrong => being right! [09:46:42] I continue monitoring that and will report back [09:46:49] Amir1: it's "interesting" how slow can be cache misses even on core DCs [09:47:44] Amir1: see https://grafana.wikimedia.org/goto/ZD7S9BAHg?orgId=1 [09:47:58] thumb generation can be quite slow specially on large images, I think mediawiki holds an intermediary size (1280px) to produce the smaller ones out of that instead of the original, I don't know if thumbor does it too. [09:49:02] wow [09:49:24] One reason is that I'm deleting all thumbnails as one-off :D https://phabricator.wikimedia.org/T379942 [09:49:35] Deleting around 1TB every day [09:49:50] (from swift) [09:50:12] but also they are being served from HDD if they are stored in swift [09:57:59] Amir1: 1s TTFB from the applayer IMHO isn't admisible [09:58:43] https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Backend_performance_practices#Ballpark_numbers [09:58:57] When performing write actions, respond within 500ms at the p99 [09:59:10] (we are seeing 1s at p95 at the moment) [09:59:28] so yeah.. all the work designed to reduce that gap is welcomed :D [10:03:19] The first step would be to make sure thumbor has an actual owner [10:03:49] We could ask our multimedia team to take care of it but they are busy not existing :( [10:06:20] lol :[ [10:24:47] 06Traffic: Staticize haproxy directives from hiera to template - https://phabricator.wikimedia.org/T391670 (10Fabfur) 03NEW [10:24:59] 06Traffic: Staticize haproxy directives from hiera to template - https://phabricator.wikimedia.org/T391670#10732856 (10Fabfur) 05Open→03In progress p:05Triage→03Low [11:55:22] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10733064 (10Silvan_WMDE) I believe this must have been an infrastructure issue which hasn't occured any mor... [12:14:08] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:19:08] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:13:23] 06Traffic, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Unable to save Jupyter Notebooks or start IPython kernel on stat1008 - https://phabricator.wikimedia.org/T390959#10733260 (10Gehel) [13:16:09] 06Traffic, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: TLS cert for search.svc.eqiad.wmnet expired on elastic1068 - https://phabricator.wikimedia.org/T390599#10733349 (10Gehel) [13:49:43] 06Traffic, 06Experimentation Lab, 13Patch-For-Review, 07Voice & Tone: Replace ableist usage of `sane` and `insane` in libvmod-wmfuniq codebase - https://phabricator.wikimedia.org/T391633#10733647 (10BBlack) @bd808 - Can you review the proposed fixups in the MR above? Thank you! [14:19:59] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10733737 (10Bugreporter) >last 10 newly created Wikidata Properties Note the issue are only reported in ite... [14:21:08] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10733743 (10Bugreporter) [15:47:09] 06Traffic, 10Liberica: Alert on depool threshold being enforced - https://phabricator.wikimedia.org/T391697 (10Vgutierrez) 03NEW [15:47:18] 06Traffic, 10Liberica: Alert on depool threshold being enforced - https://phabricator.wikimedia.org/T391697#10733955 (10Vgutierrez) p:05Triage→03High [16:05:01] 06Traffic: liberica fails to refresh liberica_cp_unhealthy_pooled_realservers_total metric - https://phabricator.wikimedia.org/T387880#10734003 (10Vgutierrez) 05Open→03Resolved [16:21:51] 06Traffic: liberica control plane hangs if it fails to get an etcd endpoint - https://phabricator.wikimedia.org/T387278#10734051 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [18:23:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10734558 (10RobH) The two new optics arrived for this, one spare and one to swap in. >>! In T390766#10730347, @RobH wrote: > @cmooney: So I've figur... [18:28:09] 06Traffic, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Unable to save Jupyter Notebooks or start IPython kernel on stat1008 - https://phabricator.wikimedia.org/T390959#10734583 (10CDobbins) Thanks for looking into this. I don't know what happened, but apparently it resolved itself. Sorry about the delay in... [18:28:25] 06Traffic, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Unable to save Jupyter Notebooks or start IPython kernel on stat1008 - https://phabricator.wikimedia.org/T390959#10734584 (10CDobbins) 05Open→03Resolved [19:05:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10734723 (10cmooney) >>! In T390731#10734558, @RobH wrote: > How is best to proceed? Since this is a redundant link can I just enter a remote h... [19:54:12] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10734945 (10RobH) [22:15:59] 06Traffic, 10Liberica, 13Patch-For-Review: Replace current L4LB with with Katran-based alternative - https://phabricator.wikimedia.org/T332027#10735303 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half yea... [22:16:44] 06Traffic, 06SRE: Add version flag to purged - https://phabricator.wikimedia.org/T347839#10735334 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half years (see `T380300`). [22:29:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [22:34:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [23:10:52] 06Traffic, 10Infrastructure Security, 06Privacy Engineering, 06Wikipedia-Android-App-Backlog, and 2 others: Integrate In-App Internet censorship circumvention by domain fronting - https://phabricator.wikimedia.org/T327286#10735577 (10violets_hide_shyly_in_cool_shaded_corners_adding_subtle_beauty) ##### The... [23:41:41] Are Varnish ⇒ backend connections encrypted, perhaps by HAProxy?