[00:19:23] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9924471 (10Papaul) [06:31:28] 10netops, 06Infrastructure-Foundations: magru ipv6 issues - https://phabricator.wikimedia.org/T368499 (10ayounsi) 03NEW [08:03:10] hello folks [08:03:33] I see "Puppet CA certificate puppetmaster1003.eqiad.wmnet is about to expire" on alerts, I'd just run the cookbook to upgrade its host certificate but I'd like to be sure since it is a puppetmaster [08:27:40] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Update CAS to 6.6.15.2 - https://phabricator.wikimedia.org/T368503#9925169 (10Peachey88) [08:28:43] elukey: I'm 99% sure these only refer to the host key, so that should be fine. given how old these servers are it's not surprising that we see their host certs expire [08:29:29] moritzm: I am 99% sure as well, I'll recheck the cookbook just-in-case and I'll run it [08:39:31] Did anything change on idp recently? I see the following from docker-report [08:39:34] Jun 26 08:19:20 build2001 docker-report-base[1587692]: requests.exceptions.RetryError: HTTPSConnectionPool(host='docker-registry.wikimedia.org', port=443): Max retries exceeded with url: /v2/_catalog?last=wikimedia%2Foperations-software-bitu&n=100 (Caused by ResponseError('too many 504 error responses')) [08:40:26] (I am trying to get with bitu is mentioned in there though) [08:41:05] 10netops, 06Infrastructure-Foundations, 06SRE: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513 (10cmooney) 03NEW p:05Triage→03Medium [08:45:25] the docker registry doesn't use the IDPs, must be something different [08:46:41] ack thanks for confirming [09:01:25] 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 6.6.15.2 - https://phabricator.wikimedia.org/T368503#9925370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1002.wikimedia.org with OS bookworm completed: - idp-test1002 (**PASS**) - Downtimed... [09:19:58] elukey: it s erroring out calling http://docker-registry.wikimedia.org/v2/_catalog?last=wikimedia%2Foperations-software-bitu&n=100 I think [09:22:38] claime: yeah I have retried at it fails in a different step, trying to get from the registry logs if anything is not working [09:28:56] it is interesting that I get a 504 in the browser, but on the registry's nginx I see a HTTP 499 logged [09:29:58] we should maybe look at lengthening the nginx timeouts [09:34:12] also one thing that I noticed is that we don't add the time spent serving the request in the nginx's log format [09:34:42] by default it uses the "combined" format, and from https://nginx.org/en/docs/http/ngx_http_log_module.html I don't see any info about time spent [09:34:54] it could be useful to understand what is the bottleneck [09:36:00] this one [09:36:00] $request_time [09:36:01] request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client [09:36:35] but I guess that we'd need to puppetize the nginx config [09:37:20] It's puppetized [09:37:30] \o/ [09:37:35] checking the puppet code [09:38:14] puppet/modules/docker_registry_ha/files/nginx.conf [09:38:20] yep yep [09:43:23] I think we should maybe enable the docker registry metrics and have prometheus scrape them because we're encountering a bunch of perf issues anyways [09:43:54] ah they are [09:48:22] claime: I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049876, I think having more info the access log would also be useful [09:54:03] elukey: sorry.. I just noticed that TLS config is outdated [09:54:14] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9925639 (10cmooney) >>! In T367439#9921613, @ayounsi wrote: > Your proposal seems good to me. > > Adding the anycast AS makes sens, I think I in... [09:54:18] elukey: TLSv1 and 1.1 need to be removed and 1.3 added O:) [09:58:22] vgutierrez: I thought that Riccardo was the master in setting up highlights, but you are able to inspect patches too.. Next level :D [09:58:42] jokes aside, I can change those as well :) [09:59:26] I'll add you to the patch [10:01:07] TLS 1.3 assuming those instances are running >=bullseye [10:01:15] but I guess that's a safe assumption nowadays [10:01:35] ah no those are busters :( [10:02:09] yeah.. so then just drop 1.0 and 1.1 :) [10:03:17] vgutierrez: I am a little hesitant in doing it now, I am 99% sure that nothing will break but everything k8s-based pulls from the registry, maybe we do it as follow up? [10:03:25] to check with serviceops etc.. [10:04:34] sure [10:05:07] elukey: are you using ssl_ciphersuite() to configure the vhost it should be a NOOP though [10:05:47] vgutierrez: what do you mean? [10:06:50] ssl_settings => ssl_ciphersuite('nginx', 'mid'), [10:07:04] that already sets TLS versions [10:07:45] for 'mid' it already excludes TLS1.0 and 1.1 [10:07:55] ah you mean if we were using that [10:08:01] yeah, and you're using it [10:08:13] modules/profile/manifests/docker_registry_ha/registry.pp:97 [10:08:31] but it should say ssl_ciphersuite('nginx', 'strong') nowadays [10:09:35] vgutierrez: ah ok I just noticed.. but afaics the ssl_settings var is not used in the nginx template [10:09:44] see line 118 [10:09:56] we have the following hardcoded [10:09:57] ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE [10:11:17] elukey: registry-nginx.conf.erb:84: <%= @ssl_settings.join("\n ") %> [10:12:12] nginx allows overriding TLS versions per server stanza [10:13:04] ok sorry I am confused, I see 'puppet:///modules/docker_registry_ha/nginx.conf' for /etc/nginx/nginx.conf [10:13:08] this is why I was saying that [10:13:37] but it gets overriden by the sites-enabled [10:14:01] ok now we are on the same page, sorry :) [10:15:02] yep... some insanity in that config :) [10:16:16] okok so it is just a matter of bumping the ciphers [10:17:26] yep [12:23:49] https://github.com/ipxe/ipxe/issues/1141#issuecomment-2191547252 [14:24:34] I wrote some docs how the d-i image gets updated in case Debian issues a new point release: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_Foundations/Debian-installer#Updating_the_netboot_image [14:45:52] thanks [14:46:01] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544 (10Vgutierrez) 03NEW [14:46:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9926761 (10Vgutierrez) p:05Triage→03Medium [14:51:00] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545 (10Vgutierrez) 03NEW [14:51:13] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9926781 (10Vgutierrez) p:05Triage→03Medium [15:12:34] 10netops, 06Infrastructure-Foundations, 06serviceops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9926889 (10cmooney) > IPIP encapsulation has a 20 bytes overhead that needs to be accounted somehow, in high-traffic[12] services we chose... [15:19:12] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9926900 (10jcrespo) p:05Triage→03Low This seems to not be reproducible, maybe it was related to cold caches after reboot? Lowerin... [15:19:46] XioNoX: great to see that progress on the /32 ipxe stuff :) [16:57:07] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850#9927454 (10cmooney) [16:59:14] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:14] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:16] topranks: [18:18:21] Are you sure you want to delete IP address 10.3.0.2/32? [18:18:30] this will actually delete this /32 and not the entire /8 right? right? [18:18:33] :P [18:18:48] in Netbox? [18:18:52] yes [18:19:00] or should I just remove the DNS name? [18:19:02] and not delete the IP [18:19:30] no delete the IP [18:19:31] https://netbox.wikimedia.org/ipam/ip-addresses/7051/ [18:19:38] ok was double checking [18:19:45] yep it can go wrong [18:19:45] I mean it says /32 there but :) [18:20:12] took a screenshot as well lol [18:20:46] where it can go wrong is if you open up the /24 prefix it is part of [18:20:51] then go to the "ip address" tab on that [18:20:54] https://usercontent.irccloud-cdn.com/file/XwDCeG81/image.png [18:20:56] haha [18:21:02] lol [18:21:05] ^^ so in the above I have ticked the one to delete [18:21:14] worse, somehow I am 10.0.0.0/8 [18:21:17] clicking the bottom delete button will delete the IP, as it's ticked [18:21:31] but the top-right delete button still refers to the entire 10.3.0.0/24 prefix :) [18:21:37] :) [18:21:44] this is how we have all deleted devices when trying to remove one interface etc. [18:21:54] yeah, I deleted an entire /24 once I think. the only time but stil [18:22:34] Arzhel has been doing great work on Netbox 4, it has a few improvements to make this problem go away [18:22:49] yeah I heard that. thanks! the IP is gone now [18:23:03] running the DNS cookbook to remove the DNS records [18:23:10] perfect [18:43:30] Also deleting a prefix doesn't delete the IPs it contains [19:17:56] that's no fun [19:18:12] I'll stick to deleting cr2-codfw for my kicks in that case [21:37:34] sflow interpretation question: naively, I'd expect that querying sflow data in turnilo with measure: Bytes and split: Time (Minute) would give me bytes per minute. spot checking against what I have from host metrics, these instead look like bps :) [21:37:34] is there some magic that corrects this to, effectively, average bps over each minute bucket? [21:40:37] good question, cdanis might know [22:02:04] 10Mail, 06Infrastructure-Foundations, 06SRE: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9928764 (10jhathaway) [23:16:20] swfrench-wmf: it should show bytes I believe [23:16:40] the sflow protocol exports the number of frames and their size in bytes, I don't believe we extract that out to bits/sec [23:16:51] I can ask Arzhel tomorrow he originally set it up [23:17:57] also bear in mind we sample 1 of 1000 packets (and then multiple than by 1000 for the stats in Turnilo), so that reduces the accuracy quite a bit [23:19:43] also remember we sample packets *sent* by hosts only. that should capture all internal traffic though (as some internal server has to send it for it to be received elsewhere) [23:19:55] thanks, topranks! yeah, it seems like the raw data from sflow is bytes, which was another reason I expected "bytes per minute" naively. I guess this might come down to what the turnilo dashboard is doing exactly. [23:20:10] but not all our switches export sflow, so packets sent from hosts connected to devices that don't is effectively lost :( [23:20:11] also thanks for the reminder about the sampling (and where the measurement actually happens) [23:20:32] yeah it's tricky. what exactly are you looking to get? [23:20:55] is there a host with multiple IPs and you want to see traffic levels to one of them as opposed to the aggregate interface rate? [23:23:02] not it's not urgent or anything, so feel free to leave it for tomorrow :) [23:26:28] content: I'm just trying to understand the data we have from the dumps related incident earlier this week. specifically, to correlate what we see in sflow as aggregate volume toward mwlog1002 vs. what we can infer from host-side metrics (which shows us which mw deployments were contributing the most) [23:27:21] (we already know the answer to that last part - mw-jobrunner) [23:27:24] ok yeah, it definitely is useful for that kind of analysis [23:28:12] I think the biggest gap is half of the codfw rows and 2/3 of eqiad hosts are on switches that don't export sflow [23:28:52] this is changing as we upgrade switches, soon all of codfw will be done, next FY we'll get another 1/3 of eqiad upgraded [23:29:20] that's very good to know re: the coverage in eqiad, thank you [23:30:38] it definitely makes it less useful right now :( [23:32:41] that's fair, particularly for smaller things that may be localized behind one of those switches, but for something "big" like this where it basically means 1/3 is missing, still pretty useful [23:33:00] in any case, thank you again and go enjoy your evening :) [23:34:33] all you'll see in it in terms of traffic to mwlog1002 will be from hosts in eqiad rows E/F (and codfw rows A/B if there was codfw traffic destinted to it) [23:36:38] effectively these hosts: [23:36:41] https://netbox.wikimedia.org/dcim/devices/?q=&site_id=6&location_id=49&location_id=50&status=active&role_id=1&serial=&asset_tag=&mac_address=&console_ports=&console_server_ports=&power_ports=&power_outlets=&interfaces=&pass_through_ports=&has_primary_ip=&virtual_chassis_member=&local_context_data=&cf_bgp=&cf_purchase_date=&cf_ticket= [23:39:05] that's really helpful, thank you! (also I misread your comment before, it's ~ 2/3 of hosts that will be missing)