[00:08:42] paladox: haproxy is the outer most edge of the Wikimedia CDN, so yes it sees the "real" visitor IP. [00:08:59] ah, thanks! [00:13:01] had to find a work around so I can get the real-ip in http-request as src would use cf ip [00:25:32] paladox: you can set the `src` to a custom header in haproxy (usually X-Forwarded-For) [00:26:12] check https://docs.haproxy.org/2.8/configuration.html#4.2-http-request%20set-src for example [00:26:57] yeh took me a bit to figure out the syntax. I'm doing `http-request set-header X-Real-IP %[req.hdr(CF-Connecting-IP)] if is_cloudflare_ip cf_ip_hdr`, `http-request set-header X-Real-IP %[src] if !is_cloudflare_ip` and `http-request set-var(txn.real_ip) req.hdr(X-Real-IP)` and can then use it in http-request using var(txn.real_ip) [00:28:30] oh yeh I saw set-src but I couldn't use src in http-request after I think similar issue to set-header hence the var (real_ip) [00:28:47] although I guess I should set-src even if it's post all of it [00:30:26] if you need to define acls or stick tables or such, you can use any header the same way you would use `src` [00:30:51] oh yeh I'm using the variable in that seems to work with set-header. [00:33:37] would set-src getting the header work with set-header? X-Real-IP is set our side using set-header and I'm not sure if it has a similar issue where you can't then fetch it without a var. fabfur [00:36:03] I suppose `http-request set-src hdr(CF-Connecting-IP) if is_cloudflare_ip` would work? [00:38:13] actually... in nginx we only allow setting the real-ip based on certain ips (which would be our cache proxies). So we don't want to override src. [04:57:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [05:02:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [07:32:00] 06Traffic, 13Patch-For-Review: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10720342 (10Fabfur) [07:32:21] 06Traffic, 13Patch-For-Review: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10720344 (10Fabfur) 05In progressβ†’03Resolved [09:11:40] FIRING: VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp3066&datasource=esams%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [09:17:20] fabfur: ^^ [09:17:32] fabfur: any ongoing operation in cp3066? [09:17:36] nope [09:41:40] RESOLVED: VarnishChildRestarted: varnish-text restarted on cp3066 - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&viewPanel=66&var-server=cp3066&datasource=esams%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DVarnishChildRestarted [09:44:19] 06Traffic: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 (10Vgutierrez) 03NEW [09:44:40] 06Traffic: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334#10720703 (10Vgutierrez) p:05Triageβ†’03High [09:46:29] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10720712 (10cmooney) @ayounsi I've noticed a few gaps starting to appear in the gnmic graphs in Grafana since the newer devices... [10:05:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10720747 (10cmooney) It seems the work yesterday has not stopped the carrier transitions reported, although the number has decreased: {F59013584 wid... [10:12:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [10:17:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [11:05:15] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10720903 (10cmooney) >>! In T387145#10713076, @Vgutierrez wrote: > reimaging them is fine by me Ok cool. So what we should do is run the 'decom' workflow against the existing servers, b... [11:08:39] 06Traffic: remove hardcoded certificate path from acme_chief server - https://phabricator.wikimedia.org/T391338 (10Fabfur) 03NEW [11:10:31] 06Traffic: remove hardcoded certificate path from acme_chief server - https://phabricator.wikimedia.org/T391338#10720926 (10Vgutierrez) > We should allow client specifying the target symlink basepath or let him override the symlink destination completely as discussed on IRC back in the day, this is not possible... [11:16:33] 06Traffic, 10Liberica: liberica etcd watcher obsessed over an outdated index - https://phabricator.wikimedia.org/T391340 (10Vgutierrez) 03NEW [11:16:35] 06Traffic, 10Liberica: liberica etcd watcher obsessed over an outdated index - https://phabricator.wikimedia.org/T391340#10720958 (10Vgutierrez) p:05Triageβ†’03High [11:17:11] thumbnail steps are at 75% now [11:17:51] Amir1: nice, thanks <3 [11:22:39] topranks: I'll get back to you on lvs1016/lvs1017 later today [11:23:00] I'm trying to stay out of it lol :) [11:23:11] topranks: but it should be OK given nowadays we don't need L2 adjacency [11:23:36] yeah I did wonder that, could simplify things significantly [11:32:13] o/ I'm removing two services from lvs (https://gerrit.wikimedia.org/r/1135008) and need to do a restart. Is that safe/okay to do some time today? [11:33:01] and is the process in https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service the best one to follow still? ie using cumin instead of sre.loadbalancer.restart-pybal [11:36:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:39:09] FIRING: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [11:41:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:44:09] RESOLVED: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:31:40] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:42:04] hnowlan: happy to do in ~45 mins or so? [12:51:40] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:37:40] sukhe: thanks! whenever suits [13:41:51] 06Traffic: remove hardcoded certificate path from acme_chief server - https://phabricator.wikimedia.org/T391338#10721523 (10Volans) @Vgutierrez one thing that we could try is to have acme-chief return the `relative_path` and set `destination=null` in the response (I see relative path is set to None right now). T... [13:42:42] hnowlan: ok looking now [13:43:57] 06Traffic: remove hardcoded certificate path from acme_chief server - https://phabricator.wikimedia.org/T391338#10721535 (10Vgutierrez) >>! In T391338#10721523, @Volans wrote: > @Vgutierrez one thing that we could try is to have acme-chief return the `relative_path` and set `destination=null` in the response (I... [13:45:54] hnowlan: you can do either; the cookbook will need you to ACK some alerts for it to finish, or, you can simply restart manually and log [13:45:59] no preferences from our side [13:46:06] (as long as we log the restart action) [13:46:19] change looks good, I saw you already removed the DNS records [13:48:35] great, thank you! I will get started in a few minutes [13:57:36] 06Traffic: remove hardcoded certificate path from acme_chief server - https://phabricator.wikimedia.org/T391338#10721601 (10Volans) I'm not suggesting to set destination as a relative path, but to populate the `relative_path` field and set the `destination` one to null, like in this example in that page for a se... [13:58:35] 10Acme-chief, 06Traffic: remove hardcoded certificate path from acme_chief server - https://phabricator.wikimedia.org/T391338#10721605 (10taavi) [14:12:17] about to start the restarts [14:13:23] πŸ‘ [14:14:02] oh heh I assume lvs1013 is liberica [14:14:11] 1020 and 1019 [14:14:13] `Unit pybal.service not found` [14:14:14] first the 1020 [14:14:23] ack [14:16:26] done [14:17:37] looks okay to me [14:18:01] just a sec [14:19:18] ok for me [14:21:05] thanks - I'll continue with lvs1019 if that's okay? [14:21:52] ack! [14:22:45] done [14:24:17] hnowlan: looks ok to me [14:25:26] codfw primary is lvs2013 and secondary is lvs2014, correct? [14:26:07] yep [14:26:16] for reference you can also find out from the cumin aliases: [14:26:20] A:lvs-low-traffic-codfw [14:26:24] A:lvs-secondary-codfw [14:26:25] etc [14:26:51] they are automatically updated and so refer to the latest versoin [14:26:54] *version [14:27:04] as in, the correct relevant hosts [14:30:06] yeah, I started off using the aliases but the eqiad secondaries list had a liberica host in it :D [14:30:17] much easier than reading puppet either way! [14:30:39] I'll do codfw secondary now [14:31:00] ooo, good catch! didn't realize that it was returning the test host as well [14:31:04] will filter it out [14:31:16] thanks! [14:34:40] done, secondary looks okay if someone would like to check please? [14:35:08] sure [14:36:02] ok for me [14:36:16] thanks! I'll continue with primary [14:37:58] done [14:38:34] checking [14:39:04] πŸ‘ for me [14:39:54] cool, thank you! I will do the ipvsadm cleanup on the service after this meeting [14:40:11] ack [14:49:11] 06Traffic, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: TLS cert for search.svc.eqiad.wmnet expired on elastic1068 - https://phabricator.wikimedia.org/T390599#10721999 (10bking) We're waiting on a review from Traffic, will ping again in their IRC channel. [14:49:53] Hello Traffic, does anyone have the cycles to review ^^? It's touching some modules y'all own (tlsproxy.pp) but I think Elastic hosts are the only ones actually still using nginx for TLS termination, so the stakes are pretty low [14:51:30] inflatador: we don't own tlsproxy I think [14:53:54] inflatador: it's interesting that you need to do that BTW [14:54:52] actually... # the certificate renewal does not trigger any of the File [14:54:53] # resources to get refreshed, so ensure we pick up the new [14:54:53] # certs whenever the chain gets updated [14:55:00] how's that possible? [14:57:43] inflatador: oh... I've seen your comment in the CR, puppet being down for a long time is an issue on its own and you should alert based on that [14:58:08] if puppet doesn't run, the TLS material doesn't get refreshed so your script won't detect a difference [14:58:24] and you'll have the same problem AFAIK [15:02:51] 10netops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Test prototype fundraising pybal replacement based on haproxy + anycast-healthchecker. - https://phabricator.wikimedia.org/T373942#10722080 (10Jgreen) 05Openβ†’03Resolved Closing this as completed because we know it works. There is still m... [15:16:07] vgutierrez thanks for taking a look. to be perfectly transparent, I think this was an edge case and I don't know that we need to do too much...especially if (as you said) it won't actually make a difference [15:16:39] inflatador: I'd add cert monitoring in your service [15:16:58] so you get an alert if you're using a close-to-expire-date/expired certificate [15:17:03] we already have alerts for Puppet being down, and for SSL certs...so I think the main problem is SREs (read: me) not responding quickly enough [15:18:37] IMHO an SSL cert expire alert shouldn't be ignored [15:20:48] 100% agree on that ;) . We're working on better visibility for alerts [15:22:19] 06Traffic, 10conftool, 10Hiddenparma: Requestctl needs to be able to check if a header is set, not just not set. - https://phabricator.wikimedia.org/T391368 (10Joe) 03NEW [16:12:08] I'm going to start deleting the jobrunner ipvs tcp service on lvs in a few minutes if that's okay. I'll be going in order of this morning (eqiad secondary, eqiad primary, codfw secondary, codfw primary) [16:13:31] ok [16:13:46] 06Traffic, 10conftool, 10Hiddenparma: Requestctl needs to be able to check if a header is set, not just not set. - https://phabricator.wikimedia.org/T391368#10722559 (10Volans) Wild suggestion, what if we merge both proposals? Make `header_value` to accept either a bool or a string with the following meaning... [16:25:27] all done, thank you! [16:28:34] you did the work :) [17:05:15] 06Traffic, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: TLS cert for search.svc.eqiad.wmnet expired on elastic1068 - https://phabricator.wikimedia.org/T390599#10722733 (10bking) Per IRC conversation with @Vgutierrez , there are some concerns about whether or not this would actually work... [17:05:18] 06Traffic, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: TLS cert for search.svc.eqiad.wmnet expired on elastic1068 - https://phabricator.wikimedia.org/T390599#10722734 (10bking) a:05bkingβ†’03None [17:06:23] 06Traffic: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334#10722739 (10Vgutierrez) the thread worker watchdog has been introduced in Varnish 6.2, from their changelog: > We have added a β€œwatchdog” for thread pools that will panic the worker process, causing it to restart, if scheduling tasks onto... [17:47:55] 06Traffic, 06Data-Platform-SRE: Unable to save Jupyter Notebooks or start IPython kernel on stat1008 - https://phabricator.wikimedia.org/T390959#10722911 (10Ahoelzl) [17:48:05] 06Traffic, 06Data-Platform-SRE: Unable to save Jupyter Notebooks or start IPython kernel on stat1008 - https://phabricator.wikimedia.org/T390959#10722914 (10Ahoelzl) @Gehel can you investigate? [21:03:35] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10723774 (10Jdforrester-WMF) >>! In T355914#10717142, @Ladsgroup wrote: > It'd be nice to add this to next week's tech news. Worth mentioning this has bee... [22:02:59] 06Traffic: Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411 (10BBlack) 03NEW p:05Triageβ†’03High [22:04:00] 06Traffic: Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10724054 (10BBlack) [22:19:30] 06Traffic, 06Experimentation Lab: Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10724066 (10VirginiaPoundstone) [22:29:24] 06Traffic, 06Experimentation Lab: Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411#10724102 (10BBlack)