[10:09:10] puppet-merge just had a bunch of errors (but did return zero) - a bunch of puppetservers were unable to run git fetch. [10:09:46] <_joe_> Emperor: gerrit update ongoing [10:09:52] <_joe_> and unscheduled I should say [10:10:03] <_joe_> so this is a problem, probably we have a de-syncing [10:10:20] Yes, I think the pull worked on some not others, just making a paste [10:10:22] <_joe_> I don't know what's the best way to repair that situation with the new setup [10:11:41] https://phabricator.wikimedia.org/P64817 [10:11:59] I could push a CR with a trivial typo-fix and re-run the puppet merge? [10:12:05] we can try the easy way first a commit right after gerrit is back that might re-trigger the update [10:12:22] yeah, just puppet-merging the next commit should bring things back in sync I think [10:12:25] if it's just git fetch that failed it should recover at the next puppet-merge [10:14:49] <_joe_> I kinda remember there was a catch-22 there [10:15:00] <_joe_> but we'll see [10:19:15] I offer https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042986 as a NFC CR we could use :) [10:19:41] (what do you mean I need to get out more?) [10:20:14] the sha1 of the repository state to sync to gets written to config-master and the sync scripts on each puppet server use it to fetch that state, as such the next puppet-merge will bring all nodes back to sync [10:21:09] well, assuming someone's happy to +1 that doc fixup, I'm happy to merge it through once the gerrit update is done, and then we'll be consistent again [10:22:08] gerrit update should be done, let's see... [10:23:33] yep, all the puppetservers updated OK (and the ones that failed last time picked up both change-sets OK). [10:23:48] good [10:29:27] Pah, I've screwed up my puppet code though :( [10:30:24] <_joe_> Emperor: L8 issues are hard to self-heal [10:30:47] T366387 makes it harder for me to test this sort of thing, too [10:30:51] T366387: PCC throws evaluation error on valid code - https://phabricator.wikimedia.org/T366387 [10:31:11] (anything related to the cephadm nodes PCC just explodes on) [10:36:29] I think it's my code to do fqdn => mgmt hostname that's faulty [10:43:31] Not quite sure how, or how to fix, though [10:43:33] https://phabricator.wikimedia.org/T366387#9888134 [10:45:32] Oh, got it, I think. [10:50:43] taavi: good spot [10:53:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042991 if someone feels like a +1 I'd be grateful (I think this is the cause of the puppet failure on moss-be1001) [10:54:25] I still think split()'s behaviour is a bit unhelpful [10:54:42] Ah, split, uses REs, no? [10:55:00] +1'd [10:55:21] takes string, or regex, or regex type. [10:55:35] Yeah, now I remember that wonderful automatic beartrap [10:55:57] allowing eithe plain or RE is fine, but guessing what the user meant is... not good [11:39:42] mmhh steady increase of 503s from mw-api-ext-ro since this morning, just now I guess we hit the threshold [11:40:02] Amir1: ^ FYI [11:41:20] yeah, about to bump the threshold [11:42:19] Amir1: cheers, and the 503s are known tho? is that the circuit breaking kicking in ? [11:42:43] oh actually circuit breaking is 500 [11:42:55] I don't know where 503s ares coming from [11:43:01] kk checking [11:49:18] I pinged -serviceops though doesn't seem like a huge deal / full outage [11:51:53] going through mw logs in logstash, can't find the smoking gun yet to correlate with 503 increase [11:58:00] ok still no joy in finding a smoking gun, will open a task in the meantime [12:03:48] as a side note I think we should be paging on backend error % / availability as opposed to a static threshold [12:06:24] task being https://phabricator.wikimedia.org/T367401 [12:06:44] of course it is bound to page again [12:07:22] did something happen on eqiad around 05:20 AM UTC? [12:07:37] the 503s increase is happening on eqiad based DCs but not on codfw ones [12:09:10] good point yeah [12:09:50] unclear to me though why mw-api-ext only seems affected [12:10:20] I don't see a correlation with mw exceptions on https://grafana.wikimedia.org/goto/euiKAF8Ig?orgId=1 [12:12:36] indeed [12:18:03] hmmm logstash shows quite some occurrences of JobQueueError: Could not enqueue jobs on mw-api-ext [12:21:34] interesting, which dashboard/link vgutierrez ? [12:23:11] godog: https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@47de29a&_a=h@510c69f [12:23:28] cheers [12:29:26] re: circuit breaking it seems to me when that kicks in as a mw exception the resulting status code is 503 (?) Amir1 ? [12:29:52] according to Amir1 the resulting status code should be 500 [12:29:54] nope, that's 500 [12:30:26] ack thank you, the backend errors on mw-api-ext are gone now btw [12:30:40] you can try it by setting the value to 3 in mwdebug (this value: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1043006) [12:30:48] ugh [12:30:56] started at 05:20 and finished at 12:20 UTC [12:31:04] I think the could not enqueue job is a different issue with kafka [12:31:11] that looks like some automated job for 7h messing with the API [12:31:26] oh what is the job? [12:31:37] no idea [12:32:09] but it looks weird the 7h span [12:37:13] ok updating the task, still a mystery from my POV :| [12:37:15] that's flamegraph for last hour https://performance.wikimedia.org/arclamp/svgs/hourly/2024-06-13_11.excimer-k8s-wall.RunSingleJob.svgz [12:38:43] I'm not seeing any major difference than yesterday: https://performance.wikimedia.org/arclamp/svgs/hourly/2024-06-13_11.excimer-k8s-wall.RunSingleJob.svgz [12:39:03] It can be low volume job that's making a lot of API calls, give me a sec [12:40:51] maybe one of these? https://performance.wikimedia.org/arclamp/svgs/hourly/2024-06-13_11.excimer-k8s-wall.RunSingleJob.reversed.svgz?x=1147.8&y=1381 [12:41:14] could be yeah [12:46:10] I'm out of ideas atm [12:48:40] I'll reboot cumin2002 in a few minutes [12:54:24] Could not enqueue job may be eventgate again [12:54:41] https://grafana.wikimedia.org/goto/4u2rxKUIg?orgId=1 [12:54:47] Still unsure what exactly causes this tbh [12:57:28] https://grafana.wikimedia.org/goto/MZ0z-F8IR?orgId=1 wth [12:59:10] Ah, it's counting everything that's not 201, and it's responding 202 [13:04:21] cumin2002 can be used again [13:09:52] 202 response is hasty success (fire and forget) [13:09:52] https://intake-analytics.wikimedia.org/?doc#/default/post_v1_events [13:50:08] claime: godog: Some errors seems to be from WikibaseQualityConstraint jobs? https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2024.06.13?id=XTnVEZABDD1VxFBSKtZp https://logstash.wikimedia.org/goto/acde3bf5c1c449a762b837f7093b4bf2 [13:50:08] The service it's callign is this https://grafana.wikimedia.org/d/wX9rDrIIk/service-resource-usage-breakdown?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s&var-namespace=shellbox-constraints&var-container=All [13:53:07] <_joe_> maybe we should roll-restart shellbox-constraints [13:53:09] <_joe_> :) [13:53:54] Sure [13:54:53] it's pretty busy in eqiad but not underwater [14:04:06] looks like it stopped the errors [15:44:34] cdanis: you mean that was the ganeti master node? [15:44:40] kamila_: I meant, kube-ctrl3 is still physically where it should be, to the best of your knowledge [15:44:54] it is either that or the ganeti node just died [15:44:56] effie: the codfw ganeti master (ganeti2020) took about 10-20 seconds to list its instances [15:45:04] excellent [15:45:26] effie: it's not anymore apparently [15:45:35] ok cool [15:45:49] seriously, my luck is getting ridiculous XD [15:46:55] kamila_: have you checked if the sacrifice daemon is configured correctly? [15:47:52] eqiad: sudo gnt-instance list 0.17s user 0.02s system 46% cpu 0.422 total [15:47:57] codfw: sudo gnt-instance list 0.20s user 0.02s system 0% cpu 22.301 total [15:47:58] cdanis: at least they all seem up [15:48:04] 22 seconds vs 0.4 s [15:48:16] that seems weird and probably explains why things like the netbox sync are also failing [15:48:53] I would like to help more, but erik will be here any minute [15:49:20] ganeti2028.codfw.wmnet us the one that seems to be working extra hard [15:50:15] speaking of luck: I had seven pages this week so far. Highest I've ever seen [15:50:37] [Thu Jun 13 15:38:41 2024] block drbd5: We did not send a P_BARRIER for 214832ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked? [15:50:48] there it is [15:50:56] this happened recently too [15:51:12] effie: https://phabricator.wikimedia.org/T348730 [15:51:26] I am also sure I debugged this recently too [15:51:41] cdanis: so we drain it ? [15:51:46] and reboot it? [15:53:03] effie: it is unlikely to finish draining [15:53:08] hold on I'm digging through IRC logs [15:53:54] that is true [15:53:55] I'd be +1 for rebooting ganeti2028, probably via force [15:54:02] taavi: this is the setup I use (but perhaps it has gone sacrifice-free for too long) https://usercontent.irccloud-cdn.com/file/qrwGYMJk/IMG_20200129_130208_800.jpg [15:54:53] oh sure, a star topology [15:56:06] it's missing blood of network engineer [15:56:11] *network [15:56:17] *a network [15:56:31] that's my sign to go touch grass. TTYL [15:56:33] oh no amir is saying network 3x into a mirror [15:57:07] xD [15:57:23] creating BGP loops out of words [15:57:37] soooo ganeti2028? thoughts on a hard reboot? [15:57:53] effie: last time I had to force reboot the node to get it back [15:58:02] cdanis: sounds like a plan [15:58:08] are you or shall I? [15:58:23] I will go ahead [16:00:49] cdanis: go ahead please, I have not installed gpg get on my new laptop :facepalm: [16:12:30] Disk 0 is degraded on target node, aborting failover [16:12:32] There were errors during the failover: [16:12:34] 4 error(s) out of 4 instance(s). [16:12:36] sigh [16:14:35] moritzm: are you around? [16:14:53] wtf what ? [16:15:11] I have to run, I am terribly terribly sorry I cant stay [16:16:17] cdanis: where did you see that message [16:17:24] cdanis: there is the sre.ganeti.drain-node for this kind of operation [16:17:43] volans: drdbd was deadlocked in the kernel [16:17:50] does the cookbook work for that state? [16:18:12] jhathaway: running gnt-node failover -f ganeti2028.codfw.wmnet [16:18:28] nod, thanks [16:18:50] it runs node migrate and node evacuate IIRC [16:19:15] it handles also the hosts without drbd btw [16:19:18] volans: in this state, those won't ever finish or succeed [16:19:29] also the Ganeti docs don't mention the cookbooks at all :) [16:19:30] then there is no way to migrate the hosts [16:19:58] sudo cookbook -lv sre.ganeti [16:20:55] possible to ignore the degraded check? [16:21:06] is ganeti2028.codfw.wmnet coming back up? [16:21:20] it's not [16:21:23] I powercycled it on mgmt [16:21:24] ah up now [16:21:27] just did [16:22:02] ok we're fine to not failover then I think [16:22:12] ganeti reporting all the instances running now too [16:22:18] helper command: /bin/true before-resync-target minor-1 [16:22:24] that's a relief [16:22:39] seems weird that all the helper commands are prefixed with /bin/true \o/ [16:22:45] haha [16:22:47] should we fail over [16:22:52] jhathaway: I think we're fine [16:22:56] nod [16:23:05] this is a sporadic kernel issue [16:23:13] it gets into a bad state and basically deadlocks [16:23:23] Moritz filed T348730 last year [16:23:26] T348730: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730 [16:23:34] judging from IRC logs this is happening about once every 2 months [16:23:36] glad to see some things in DRBD land never change ;) [16:23:41] god, right [16:23:49] I'm sure I saw ganeti2028 had a degraded raid as well but I closed the console already, but seems fine after reboot [16:23:53] ok that was the last service to recover too [16:23:57] herron: should be in syslog [16:24:11] I'll write up some notes + list of recent events on that kernel hang bug [16:24:15] after an appt and lunch [16:24:21] thanks cdanis [16:26:59] having a look around into drbd failure monitoring as well, maybe there's something we could do there near term there [16:27:32] yeah it would be nice to have an alert on a node being in this state, somehow, even if it was just based on logs [16:28:32] there is a drbd-reactor tool that does multiple things and I think also exposes prometheus metrics [16:28:44] but I've never used it first-hand [16:28:47] yeah that looks pretty nice [16:29:31] or parsing /prod/drbd too in a pinch [16:29:34] I had my share of drbd issues in the past but was 8+y ago, so not counting anymore [17:05:10] * bd808 sees "DRBD" in backscroll and wishes for IRC content warnings. [17:06:20] * bblack raises bd808 a PARTMAN [17:30:44] anyone here knows how to register an application with CAS (idp.wikimedia.org)? [17:31:05] we renamed an existing service and now it's not authorized [17:31:18] but I already edited the idp.yaml with the service name [17:31:43] mutante: which service? [17:32:13] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041750/5/hieradata/role/common/idp.yaml [17:32:26] gitlab-replica-a and gitlab-replica-b [17:32:29] maybe it's caching [17:33:07] f.e. the login: https://gitlab-replica-b.wikimedia.org/users/sign_in?redirect_to_referer=yes [17:35:28] that redirects to https://idp.wikimedia.org/oidc/oidcAuthorize?client_id=gitlab_replica_oidc [17:35:57] which matches https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/idp.yaml#169 and not the section you modified [17:36:21] did someone forget to cleanup the CAS config after gitlab moved from the cas protocol to OIDC? [17:36:42] oh, interesting [17:36:58] I guess that line didn't match my grepping.. [17:37:18] thanks, let me try to adjust it there [17:39:29] should also do idp_test.yaml [17:52:58] yes, that fixed the login [18:10:16] for such a failure powercycle is indeed fine, see https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node [18:10:51] we've installed the new magru ganeti already with bookworm, so 6.1, hopefully that fixes the DRBD kernel bug (given it affects multiple servers) [18:11:33] the new refresh batch incoming will also use bookworm from the start (only need to validate beforehand that live migration between 5.10 and 6.1 works fine) [18:26:44] so moritzm this host went offline and took several minutes to reboot -- eventually it came back up but if it hadn't I would have had to pass --ignore-consistency [18:31:15] yes (or if e.g. the host for whatever reason doesn't come back after the reboot) [18:32:23] ah, I you update the docs already, thanks for that [18:32:49] thank you!