[09:07:09] does anyone know whether there is a puppet command to process and pretty print `/var/lib/puppet/state/last_run_report.yaml` and/or `/var/lib/puppet/state/last_run_summary.yaml` [09:08:13] I often look at /var/log/puppet.log and felt that maybe I could have a puppet cli command to give me the output of the last run [09:09:20] As far as CLI goes that's about it [09:09:53] (AFAIK) [09:10:00] hashar: That sounds like a nice idea. I don't know of anything, but I tend to use puppetboard.wikimedia.org for referring back to reports. [09:10:07] You can puppet agent run -t --noop --summarize [09:10:37] But yeah as btullis said, puppetboard is the pretty print for reports, there's not really a CLI solution for this [09:11:07] OH puppet board! [09:11:15] I guess that one relies on puppetDB isn't it? [09:11:28] it does [09:11:45] then I can't access it cause well I guess it has way too many secrets ;) [09:12:06] 👀 [09:12:21] I guess it is SRE only which surely makes sense [09:12:36] It does, considering that it does contain secrets [09:12:43] And puppet is bad at hiding secrets [09:20:59] the log files are good enough for me meanwhile :] [09:21:58] I have another ruby/Puppet related question. My system has ruby 2.7.4 and when running spec tests I get a bunch of ruby warnings such as: `.gem/gems/puppet-5.5.10/lib/puppet/util.rb:472: warning: URI.escape is obsolete` [09:22:26] which is due to Puppet 5.5.10 using ruby api deprecated in ruby 2.7.4 [09:22:50] thus I am wondering if there is a way to maybe monkey patch those "useless" warnings in our Rakefile [09:23:00] or if someone encountered those warnings as well [09:25:29] ah I found jbond had a moneky patch for URI.escape already https://gerrit.wikimedia.org/r/c/operations/puppet/+/799982 will dig in that [09:49:12] hashar: yes the ruby patch should make things work but there are still a bunch of warnings i genrally run with RUPYOPT=-W0:q [09:49:26] RUBYOPT=-W0 [09:49:56] I gave it a try but the monkey patch of URI.escape does not work when running `bundle exec rake spec` from within a module (ex: from `./modules/gerrit`). [09:50:10] I guess I can craft a bash alias with RUBOPT, that is a good idea [09:54:32] hashar: hmm looks like we need to require 'monkey_patch_early' in rake_modules/module_rake_tasks.rb [09:55:27] yeah I tried to require the monkey patch at different place but that did not quite work ;D [09:55:39] ack ill take a look [09:56:01] then eventually I had an error about DEFAULT_PARSER not existing and that has lead me here [09:56:27] alternative is I could use rvm to use an older ruby version [09:57:44] hashar: no we shuld be able to get it working [09:58:27] hasher what command are you running? [09:58:34] cd modules/gerrit [09:58:37] bundle exec rake spec [09:58:54] hmm strange WFM :P [09:59:13] I iguess my local install is broken so! [09:59:24] can you send me a paste of the full output [09:59:41] i can test n a clean vm/container in a bit [10:00:27] https://phabricator.wikimedia.org/P34776 [10:01:16] thanks ill take a look and get back to you :) [10:02:17] hashar: ahh, the URI monkey patch was actully to fix people using ruby 3 which drops that function [10:02:31] i have not created a patch to get rid of the deprecation warnings [10:03:20] fyi i also get a bunch of deprecation warnings for [10:03:20] file_impl.rb:80: warning: Using the last argument as keyword parameters is deprecated [10:04:44] oh so that is a default monkey patch fun [10:04:58] maybe I should upgrade to ruby 3 [10:04:59] :D [10:05:36] lol ill take a look to see if we can improve the patch to silence the warnings [10:08:33] awesome thanks! [10:14:12] I'll be testing reimaging of a sessionstore in 30 minutes, not expecting anything major to happen but it is an in-place reimage keeping data so [10:18:12] jbond: and I forgot I have ruby 2.7.4p191 from Bullseye [10:18:37] hashar: yes same here [11:16:56] almost solved :) [11:23:07] reimage starting, sessionstore as a service currently seems fine with a node down a [11:39:02] teeny tiny fix review if someone has a sec https://gerrit.wikimedia.org/r/c/operations/puppet/+/832481/ [11:41:54] hnowlan: Looks good to me. [11:42:26] thanks! (and jynus also) <3 [12:26:47] perhaps somewhat surprisingly the sessionstore images went without much issue, I'd like to remove the manual step to reimaging: https://gerrit.wikimedia.org/r/c/operations/puppet/+/832490/ [12:32:03] great to hear! just +1d the patch [12:32:19] same [12:32:26] thanks! [12:38:06] I was wrong, seems there's an issue with the reimaged hjost [12:38:32] as a precaution I'm stopping cassandra on the new host [12:41:05] kask couldn't get quorum for writes, seems to have been stopped by stopping cassandra on the reimaged host but I don't understand what's happened with the new host to cause this [12:45:51] yep, I definitely just caused this https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13&from=now-3h&to=now [12:55:31] hnowlan: anything that we can do to help? [12:56:57] elukey: if you fancied digging into the logs on sessionstore1001 or any of the other sessionstore hosts to see why they couldn't connect it'd be handy :) [12:57:10] I guess we should start an incident doc for this [12:57:49] hnowlan: yeah I think so.. I checked on sessionstore1001's cassandra logs and didn't find anything weird, was the instance in a good state? (I mean in nodetool status) [12:58:11] yeah it was "UN" in nodetool [12:58:20] but if you check the logs for all other instances they couldn't communicate [13:02:34] I am checking, there seems to be a successful handshake with the 1001-a instance, that is reported UP and then DOWN [13:03:17] then I see again UP, and then a lot of hinted handoffs [13:03:29] and then shutdown (you manual action) [13:04:20] so the cassandra cluster was UP, all nodes in UN, and kask wasn't able to communicate with 1001 [13:04:31] hnowlan: --^ does it make sense? [13:05:05] I am wondering if the 1001-a instance suffered right after the bootstrap due to the hinted handoffs from other instances [13:05:28] (not sure where to find the kask logs, I guess k8s logs?) [13:05:40] elukey: that sounds reasonable [13:06:02] kask logs are here https://logstash.wikimedia.org/goto/c63dc23f38a3b7bf93e87fc8e1d447b2 [13:07:45] it seems that the issue was that all other nodes couldn't write to 1001 [13:07:54] hnowlan: how does kask communicates to cassandra? [13:08:13] is there a lvs endpoint in front, or are the instances all hardcoded in its config? [13:08:38] because I am wondering if we need to removed 1001-a from kask reachability for some time after the bootstrap [13:08:42] elukey: directly, they're hardcoded in the config [13:10:14] https://grafana.wikimedia.org/d/000000497/cassandra-read-repair?orgId=1&var-datasource=eqiad%20prometheus%2Fservices&var-cluster=sessionstore&from=now-3h&to=now [13:10:26] the "blocking" repairs sound related [13:11:09] oof, possibly. I'm surprised they would block so hard as to make the node drop communication with other nodes in the cluster though [13:13:26] hnowlan: IIUC the other nodes of the cluster marked 1001-a as UN, sending hints etc.. but kask tried to use quorum, every now and then, with the newly 1001-a (now reachable) getting timeouts (probably due to 1001-a being very busy in writing down what other nodes were streaming to it) [13:13:41] in theory if we remove 1001-a from kask and retry to start it, it should work fine [13:14:11] but we should probably wait for urandom's opinion :) [13:14:19] (mine are all speculations) [13:14:32] yeah, I've raised it with him already and he'll be on in a bit [13:15:06] I see active hinted handoff from sessionstore1002 to 1001-a up to a few seconds before I stopped the node [13:15:17] here now! [13:15:41] ah I see now the storm of InetAddress /10.64.0.144 is now UP in 1003-a's logs [13:15:41] hey :) [13:15:46] not in 1002-a logs though [13:15:50] hello urandom :) [13:17:39] hnowlan: also no sign of distress in jvm metrics for 1001-a afaics [14:54:39] brett: shall I merge 'BCornwall: varnish/tests: Remove extraneous test checks (047a056ca1)' ? [14:57:31] * andrewbogott merged it [16:39:49] moving here due to icinga spam [16:40:11] looks like the socket timeouts are due to worker saturation, still working on a cause [16:40:28] cwhite: over here [16:40:46] o/ [16:40:51] switch down is possible but I'm betting no, given only the appservers seem affectec [16:40:53] *ed [16:41:25] whatever it was seems transient, request rates are high now probably due to retries [16:41:54] latency is still up but I think that's a trailing indicator, active workers are still high but sub-100% and dropping [16:42:26] claime: a good chance to practice with the logs we were just talking about, let me know if you find anything :) [16:42:45] cwhite: do you mind taking IC and starting a doc while I dig? [16:43:35] sure, I found that most errors were raised by maps [16:45:04] eh, scratch that, fairly small subset. I was looking at the trailing edge [16:45:19] nod [16:45:32] maps is definitely unhappy but I'm not sure if that's a cause or an effect [16:46:27] of note, both eqiad and codfw are equally unhappy so I think we can rule out anything in the DC infrastructure [16:47:22] I don't see any commonalities in host or even DB section (which makes sense, the DBs seem healthy) [16:54:01] hm, maybe DB-related after all, https://logstash.wikimedia.org/goto/3b22bafac578efde058995a9343a30da tracks closely [16:56:54] and also of note, that dashboard shows we're maybe disproportionately looking at db2129, db1173, and db2117? checking to see if that tracks what we're seeing elsewhere [16:57:39] impact-wise it looks like this was indeed transient and nothing ongoing, but still checking on that too [16:58:36] oh, except possibly maps impact, which may be ongoing [16:59:14] and not sure if those thanos alerts are related -- offhand I can't see how but they're time-correlated at least [16:59:20] there's a big error spike on x2 [16:59:47] x1 too but a bit less [17:00:27] but it happens after the spike in open connections [17:00:33] claime: oh man, the DB errors graph on the appserver RED dashboard is broken and I didn't realize [17:00:38] thank you [17:00:52] I'm on https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=All&var-role=master&from=now-1h&to=now [17:01:20] yep [17:01:21] db2117:9104 [17:01:23] 1 [17:01:25] db1189:9104 1 [17:01:27] db1173:9104 1 [17:01:29] Unpollable in the last 30 min [17:01:54] marostegui: happen to be around? [17:01:59] Yes [17:02:27] something databasey broke around 16:34 to 16:40 and I need a grownup :) [17:03:08] looks like it was likely x2 related, which tracks as both eqiad and codfw were affected, but I don't have a good root cause yet [17:03:37] So from your logstash link, it looks all related to ruwiki [17:03:58] So s6 [17:04:23] yeah -- jawiki and frwiki too but I see now those are also s6 [17:04:30] yep [17:04:56] all three of those hosts are 10.4 as well, fwiw [17:05:30] Yeah, I was checking that, it just looks like a spike in connections [17:06:46] So yeah, something definitely hit s6 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1131&var-port=9104&from=1663259211606&to=1663260687188&viewPanel=16 [17:08:11] hm yeah, and that won't just be a symptom of appserver workers being saturated, right? [17:08:43] the dip in queries I mean [17:08:57] Yeah, it could be if all the DBs got a bit of lag I would assume [17:09:07] That's the master, which would have gone to read-only if theres lag everywhere within s6 [17:09:20] MW would have gone RO for a few seconds [17:09:37] nod [17:10:34] I need to install a 10.6 in s6 too so we can gather data from there grrrr [17:11:05] from what I can see everything is fine [17:11:09] Now, I mean [17:12:24] maps is still serving a bunch of 503s but I'm not sure what the story is there [17:12:52] does maps use swift? [17:13:27] I don't believe so, IIUC it's backed by cassandra and postgres [17:18:52] okay I do feel confident saying everything not-maps is fully resolved, but maps in codfw at least is definitely broken https://grafana.wikimedia.org/d/AWSBX1rMk/tegola-vector-tiles-app-metrics?orgId=1&refresh=30s&var-dc=codfw%20prometheus%2Fk8s&var-service=tegola-vector-tiles&var-db_dc=codfw%20prometheus%2Fops [17:19:46] in eqiad it looks healthy, I'm a little tempted to just depool it in codfw and leave it for h.nowlan or someone to poke around in the morning -- but I know it's been underprovisioned for that before, not sure of the current status [17:19:49] thanos is broken for some reason [17:20:47] thanos-store can't restart for some upstream issue: `internal server is shutting down" err="bucket store initial sync: sync block: BaseFetcher: iter bucket: 503 Service Unavailable` [17:21:32] hrm [17:21:58] it seems to indicate a problem with swift as that's what thanos-store is referencing [17:22:40] okay no, I think maps v2 *does* depend on swift, I didn't realize they had that dependency in place already but it sure explains the current state [17:25:25] swift's own dashboards look fine afaict, except for a sharp drop in deletes in eqiad that correlates with the incident and never recovered: https://grafana.wikimedia.org/goto/VHkVuZn4z?orgId=1 [17:27:13] but thanos-swift in both eqiad and codfw is ALL well-there's-your-problem graphs https://grafana.wikimedia.org/goto/vp-DXZn4z?orgId=1 [17:29:59] thanos uses a separate swift cluster though, right? I feel weird that maps would be affected [17:30:13] afaik, yes: thanos-be [17:31:33] hm, I was going to say especially given media serving isn't affected, but that's not true: https://grafana.wikimedia.org/goto/RYS1uW74z?orgId=1 [17:32:21] so whatever happened at 16:35ish caused *both* swift clusters to get into a bad state and stay there [17:42:30] cwhite: I'm not getting anywhere, are you? I don't want to tag in g.odog or e.mperor in their evenings but we might need to [17:42:44] I think we should. I'm not getting very far either. [17:42:59] still haven't located the source of the 503 [17:43:08] swift-object appears to be running and happy [17:43:34] but something between thanos-store -> swift-object is no good [17:44:22] nod [17:44:30] e.mperor was out sick so let's escalate to godog if we can [17:45:48] ~. [17:48:18] fired a page [17:49:32] checking [17:49:37] thank you <3 sorry [17:49:45] thanks <3, sorry about thing ping [17:49:46] no worries! [17:49:49] *the [17:50:03] reading backlog now [17:50:29] what's the current theory ? [17:50:45] thanos-store is down on thanos-fe2002 (unresponsive on the rest) and will not come back because 503 on buckent store initial sync [17:51:18] I'm guessing it's swift issuing a 503 to thanos-store, but I find no evidence that the request even reaches swift [17:51:21] we had a spike in DB errors in s6 around 16:35-16:40, and it appears 503s from swift, affecting thanos as well as media storage, started at the same time and haven't recovered [17:51:43] mmhh ok, so both swift for media storage and swift thanos seem to be affected (?) [17:51:52] I'm asking because those are two different clusters [17:51:56] yeah that surprised us too [17:52:07] but we're seeing errors in maps and varnish upload fetches, as well as in thanos [17:52:25] so I'm wondering if there's a common dependency somewhere [17:52:39] ok I'll start with swift media storage [17:54:00] yeah definitely a drop in DELETEs at 16:35 https://grafana.wikimedia.org/d/000000584/swift-4gs?orgId=1&var-DC=eqiad&var-prometheus=thanos [17:55:02] and a brutal drop in thanos-swift https://grafana.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1 maybe I'll focus on that first [17:56:39] mmhh and e.g. nothing in /var/log/swift/proxy-access.log on thanos-fe1001 since 16:37:03 [17:57:30] cwhite: could that line up with adding ipv6 records for thanos-fe plus dns ttl? [17:58:14] checking [17:58:47] dns was applied at 15:39 [17:59:38] it lines up pretty closely [18:00:25] siiigh ok envoy doesn't listen on ipv6 by default, so pybal can't talk to thanos-fe on ipv6 [18:00:38] I'd say revert the v6 change, seems easiest [18:00:50] or roll forward with envoy on v6, thoughts ? [18:01:04] ahhhh [18:01:05] T255568 [18:01:05] T255568: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 [18:01:06] let's revert imo [18:01:30] ok, gonna take some time to update netbox [18:01:32] although I guess it'll take another hour for the ttl to expire :/ [18:01:38] yeah I was gonna say [18:01:56] ok let me try the envoy v6 thing [18:03:09] okay yeah, if you're confident in the config change let's give it a shot, happy to review [18:03:35] fairly confident yeah [18:03:40] I hate to roll forward instead of back in an incident, but without having already shortened the TTL we don't have a ton of good choices [18:04:31] indeed, unless we also chase flushing the resolver cache, which might be worse in terms of blast radius [18:04:47] haha yeah [18:04:56] lgtm, let's try on one host first if we can [18:05:35] there's a cookbook for that [18:08:57] oh ok, neat I didn't know about the cookbook, but yeah pybal still has troubles fetching the healthcheck from thanos-fe [18:10:39] ok memcached doesn't listen on v6 and neither does swift-proxy :( I can see the rabbit hole from here [18:10:48] let's revert the v6 change I'd say [18:10:52] ok, rolling back ipv6 [18:11:25] sgtm [18:12:22] ack, opening a task to track v6 on swift in the meantime [18:16:54] question for later, did this also somehow cause the DB errors/appserver issues we saw at the same time, or was that a huge coincidence [18:17:18] neither of those seems very likely as far as I can tell?? [18:17:39] agreed it doesn't seem likely to me either [18:18:22] things seem to be coming back though, pybal is happy enough at least to route requests [18:21:02] and the deletes came back on swift ms, can't explain why off the top of my head now [18:21:58] oops, wrong channel: hmm, tegola graphs are recovering but varnish failed upload fetches are still elevated [18:21:59] ah I get it now, the dashboard considers ms and thanos, making a note to fix that [18:22:30] https://grafana.wikimedia.org/goto/s2ThqZnVz?orgId=1 and https://grafana.wikimedia.org/goto/Rouh3W7Vz?orgId=1 respectively [18:23:29] we only rolled back thanos-fe, right, did we make a v6 change for ms-fe as well that needs to be reverted/purged? [18:23:55] no change to ms-fe as part of that task, but graphite ipv6 records were added [18:25:03] hmmm [18:25:10] seems to be failed services on ms-be2035 [18:25:26] still poking at that but I'm inclined to give releng the go-ahead for the train in the meantime, now that thanos is fixed -- any objection? [18:25:27] but that's old (22d [18:25:32] I see karthoterian still failing its probes indeed [18:26:00] maybe roll back graphite change and see if maps recovers? [18:26:39] hah kartho is recovering too now, maybe waiting a few minutes is good enough [18:26:44] hm, it is recovering [18:26:48] haha [18:27:18] not sure if graphite can be significantly implicated in this case, in the sense that most/all traffic is statsd which is udp so it'd fail silently [18:27:37] yeah things are coming back [18:28:03] re: "good enough" above -> https://twitter.com/CSMFHT/status/1570336932894474240 [18:28:25] just so [18:30:07] we've successfully verified retries for thanos-swift dependencies work [18:32:07] I'll hang out for another 5/10 minutes just in case [18:33:19] really appreciate the help, sorry to mess up your evening [18:34:31] Thank you so much <3 [18:39:20] I'm going to close the incident unless there are objections. [18:39:20] thanks folks <3 no worries, I'm glad a revert worked as expected [18:39:35] SGTM cwhite [18:41:12] 👍 [18:41:47] this'll be a good one to review, some interesting takeaways both for architecture and tooling [18:42:39] agreed! [18:47:50] ok going afk, ttyl [18:49:05] good night!