[12:01:33] <_joe_> uh https://grafana.wikimedia.org/d/-K8NgsUnz/home?orgId=1&viewPanel=8 [12:05:01] which cluster? [12:05:46] <_joe_> upload [12:05:53] <_joe_> Emperor: I think it's swift eqiad [12:06:12] Emperor: could be related to the issue you were just debugging, then [12:10:18] I see also a reduction of 4xx at 21h, around a deployment [12:14:11] jynus: YM T327681 ? I just spent quite some time log-diving to no utility whatsoever :( [12:14:11] T327681: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T327681 [12:14:34] what I mean is there seems to be a trend, not just 1 error [12:14:49] <_joe_> the issue started yesterday evening [12:14:52] see _joe_'s link [12:15:05] <_joe_> all from esams/eqiad, AFAICT [12:15:36] swift dashboard doesn't show unusual error rates [12:15:39] <_joe_> also, time_firstbyte is always 7 seconds [12:15:56] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=codfw&var-prometheus=codfw+prometheus%2Fops [12:15:57] <_joe_> see https://logstash.wikimedia.org/goto/b488ba13395f2522756ff21251acda74 [12:16:04] then could be cache <-> swift link? [12:16:07] <_joe_> eqiad, not codfw [12:16:43] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=eqiad&var-prometheus=eqiad+prometheus%2Fops&from=1673957796450&to=1674562536450 likewise unremarkable [12:17:12] similarly the other swift dashboard https://grafana.wikimedia.org/d/000000584/swift-4gs?orgId=1&var-DC=eqiad&var-prometheus=thanos&from=1674476222151&to=1674562562151 [12:18:07] <_joe_> vgutierrez: how can I see the error rate by backend for trafficserver? [12:18:29] https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1 [12:19:24] hmmm swift on eqiad doesn't seem happy since yesterday ~ 17 [12:19:26] https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=upload&var-origin=swift.discovery.wmnet [12:19:40] https://grafana.wikimedia.org/goto/sVA8afo4z?orgId=1 [12:19:42] 16:38 actually [12:19:59] <_joe_> vgutierrez: that's what I was saying, yes [12:20:13] <_joe_> Emperor: I think it needs the usual roll restart tbh [12:20:19] 502 errors [12:20:45] <_joe_> yes it's 7s timeouts [12:21:00] sigh, why are the swift dashboards lying to me again? :'( [12:21:04] <_joe_> it might be thumbor, for all I know, timing out [12:21:11] Emperor: my suggestion is to convert the specific task into a general task about the issue; users usually thank that [12:21:42] is thumbor also under swift.discovery? [12:22:23] jynus: AIUI, yes, thumbs all go via swift (I think the thinking was that thumbor was less able to handle the load, but it does make everything more confusing) [12:22:28] I am guessing yes, because there is only other option that is karthotherian [12:23:12] and probably that causes a gap on the graphs, seen upstream but not on the swift proxies (?) [12:23:16] the thumb request goes to swift which serves it if available (essentially a slow cache) otherwise swift has a custom 404 handler (ugh) that sends the request to thumbor [12:24:04] let me know if I can help in any way, Emperor [12:24:27] e.g. I can even handle the ticket communication while you research more [12:29:04] and these logs make no sense anyway. I see Jan 24 06:42:32 ms-fe1009 proxy-server: ERROR with Object server 10.64.32.64:6010/sdc1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.a8/a/a8/S-train_service_C.svg/30px-S-train_service_C.svg.png: Timeout (10.0s) (txn: tx3420e82339e349b3b1a55-0063cf7dce) (client_ip: 212.43.126.33) [12:29:14] but then on that backend, Jan 24 06:42:37 ms-be1062 object-server: 10.64.0.166 - - [24/Jan/2023:06:42:37 +0000] "GET /sdc1/2826/AUTH_mw/wikipedia-commons-local-thumb.a8/a/a8/S-train_service_C.svg/30px-S-train_service_C.svg.png" 304 - "GET http://127.0.0.1/v1/AUTH_mw/wikipedia-commons-local-thumb.a8/a/a8/S-train_service_C.svg/30px-S-train_service_C.svg.png" "tx3420e82339e349b3b1a55-0063cf7dce" "proxy-server 245178" 0.1186 "-" 1785 0 [12:29:29] I should probably just roll-restart the eqiad frontends [12:29:39] +1 [12:34:13] now, how to spell that cumin query [12:36:58] -b BATCH_SIZE -s SECONDS_OF_SLEEP_BETWEEN_BATCH [12:37:18] no, not that bit [12:37:24] ah! [12:38:07] cumin "O:swift::proxy AND *.eqiad.wmnet" works but cookbook sre.swift.roll-restart-reboot-proxies --query "O:swift::proxy AND *.eqiad.wmnet" doesn't [12:40:28] ah, got there [12:41:00] Emperor: Does "P{O:swift::proxy} AND A:eqiad" work? [12:41:34] claime: no, but I did try that :) the winning answer is --query "A:eqiad" because the cookbook already applies and A:swift-fe or A:thanos-fe [12:41:41] lol [12:42:27] I'm going to reboot ulsfo switches shortly, the site has been depooled for a while now, so just need to downtime services [12:44:16] oh, sadness, the cookbook has exploded :( [12:45:30] Oh, because it tries to restart nginx as well as swift-proxy and the thanos frontends don't run nginx any more :( [12:53:12] * Emperor opens T327783 for that [12:53:39] still, the 5xx rate is looking more healthy now, so lunch. [12:55:17] oh, I had converted T327681 already [12:55:18] T327681: High rate of upload 502 errors/timeouts (was FileBackendError: Iterator page I/O error.) - https://phabricator.wikimedia.org/T327681 [12:55:30] or you mean for the cookbook? [12:55:51] I see [12:56:39] feel free to resolve the other [13:14:34] ulsfo switches status update, only one of the two came back up after a basic reboot [13:15:48] I forced a power cycle with the PDUs, it's live on the console and I *think* it's checking the disks [13:19:05] yup, finished its boot and looks healthy [13:19:11] now to retry the upgrade [13:21:08] <_joe_> !incidents [13:44:23] alright, the new OS has been installed on the switches, now a reboot to boot on that new version [13:56:57] woot, they're fully upgraded :) [13:57:15] \o/ [13:58:10] I'll finish my cleanup, monitor it, let the downtime expire and repool if all good [13:58:47] robh was standing by in case he needed to go to the DC, but won't be needed [14:00:29] that one had a 4 years 120 days 18 hours 27 minutes uptime [14:02:27] Wow. Pretty good going. [16:59:17] let's say someone used "disable-puppet 'foo'" to disable puppet and you want to re-enable it but don't know the message. what would be the right way to force re-enabling it regardless? just not use the wrapper and back to "puppet agent --enable"? delete the lockfile directly? [16:59:41] mutante: good question, I always do the former :) [16:59:47] just agent --enable [16:59:47] or alternatively.. where is the message stored [17:00:08] ACK, thanks sukhe. So I want to create a timer::job that enables it if it's been disabled for too long [17:00:26] because now we allow deployers to do that on mwdebug.. and they should not forget to re-enable [17:00:34] meeting now though [17:00:36] oh so just for the mwdebug hosts? [17:00:37] ok [17:00:45] yea, would put it in the role for that [17:01:01] mutante: enable-puppet has --force [17:01:10] but I would not use it and would not set a timer that way [17:01:26] you can't know what maintenance is in progress and what would be the effects of re-enabling puppet automatically [17:02:12] at most it can alert, but we do already have alerts for that, we can lower the threshold for the mwdebug ones [17:02:54] sukhe: you too as you were interested, don't use 'puppet agent' directly, always the wrappers [17:03:37] it is stored here mutante: /var/lib/puppet/state/agent_disabled.lock [17:03:43] cat /var/lib/puppet/state/agent_disabled.lock [17:03:46] lol [17:04:02] volans: I haven't looked at the wrapper in detail or at all, but is there a reason for that to be preferred? [17:04:06] (noted the --force) [17:04:57] the wrappers are designed to overcome puppet's limitations, such as ensuring that a puppet run is actually executed, etc. [17:05:26] * sukhe reads [17:05:39] o, inside the lock file, that makes too much sense :) tx [17:05:39] volans: except for --test --noop right ? [17:05:58] volans: thanks! ack [17:06:38] I don't know how to change the alert if that's alertmanager now, but can follow-up on that [17:07:11] the lock file depends on puppet's configuration, that's why is exposed as PUPPET_DISABLEDLOCK when sourcing /usr/local/share/bash/puppet-common.sh [17:09:46] claime: we could add the noop option to the wrapper if that's useful to people [17:12:12] volans: I rarely use it, but it can be useful when your change is not covered by PCC output and you'd like to not break stuff. It's just that it's the only use of the native puppet agent command in https://wikitech.wikimedia.org/wiki/Puppet#Noop_test_run_on_a_node [18:35:42] https://mastodon.neilzone.co.uk/@neil/109745061330779963 <-- in case my dental persistence pun earlier wasn't bad enough ;-) [18:38:24] nooooooooooo [18:40:05] sigh, when is kormat back? She appreciates my art^Wpuns [18:40:10] :) [18:41:49] [citation needed] [18:43:10] touché [19:02:54] "Unixed that joke from someone else," response beats the original [19:15:48] lol [20:03:47] <_joe_> we really need a "tell me your favourite bad pun" question in interviews [20:04:39] <_joe_> Emperor: there's other punatics in SRE, the disease has spread. I won't name and shame them [20:05:06] Old mathematicians never die. They just disintegrate. [22:39:58] the esams-eqiad circuit is down since 3h, I opened ticket 25880368 with Lumen