[01:09:45] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 14 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:09:51] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 35.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:33] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:39] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [08:31:33] Emperor: looks like thanos-be2001 has its / filesystem filled up [08:42:59] so it does :( [08:45:26] 47G of /srv smells to me [08:45:55] /srv/log even [08:47:39] server.log and server.log.1 both ~20G [08:50:03] godog: is this node just very very busy? server.log.1 has 51,552,245 entries, cf 14,451,475 from a randomly-selected eqiad ms backend [08:50:45] godog: and it's that increase in log volume that's filled the filesystem [08:52:29] godog: in the mean time, I propose to truncate that log file to 2G if you agree? But server.log is already 21G so we're going to have further problems later [08:54:03] Emperor: yeah agreed to truncate, I'm looking at this dashboard and can't find any obvious sign of higher activity yet [08:54:07] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=thanos&var-instance=All&var-datasource=thanos&from=now-7d&to=now [08:57:34] godog: FWIW, an entirely eyeball-sample look at server.log finds a lot of entries like Feb 15 07:23:21 thanos-be2001 object-server: 10.64.16.176 - - [15/Feb/2023:07:23 [08:57:34] :21 +0000] "GET /sdm1/56647/AUTH_thanos/thanos/01EF20P7Q807ZY59ZH223KHGZ2/deleti [08:57:34] on-mark.json" 404 70 "GET http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/tha [08:57:37] nos/01EF20P7Q807ZY59ZH223KHGZ2/deletion-mark.json" "tx225e890384cf47fb9ff7a-0063 [08:57:41] ec8869" "proxy-server 3661449" 0.0008 "-" 1903 0 [08:59:05] indeed, looks like there's more logging activity concentrated on fewer hosts now with thanos-be1004 out of the rings [09:00:08] 31,416,144 of the lines in server.log are 404s which feels like a lot [09:01:33] heh, we could drop those on the floor, we have precedent / prior art for spammy clients/services [09:04:38] the other thing I noticed is that the object filesystems are now at about 88%, are you comfortable with that? IIRC you mentioned to expand next fiscal [09:06:06] godog: IIRC that's about where I expected we'd be once two nodes were drained, with a view to expansion next fiscal year. If it turns out that actually we fill / with logs before we fill the actual swift storage, I can restore the two nodes - keeping thanos working beats enabling MOSS :-/ [09:08:26] we can drop those spammy 404s on the floor if needed, I was thinking of "organic" growth of space used by objects [09:10:41] though just something I noticed, I don't know how much leeway there is in terms of weeks/months before uncomfortable levels of fs used [09:11:03] godog: possibly as well we should consider removing "delaycompress" from the logrotate configuration for swift? [09:11:34] godog: shall I undrain the two thanos backends, then? It's not like I'm ever going to have time to look at MOSS at this rate [09:12:24] uncomfortable> when I started at WMF there were quite a lot in the 90s of % full [09:14:51] heh re: undrain personally I would be uncomfortable with 88% and realistically ~5-6mo before new hardware is available [09:15:24] 88% and growing that is [09:31:36] godog: what do you think about nodelaycompress? [09:36:02] Emperor: seems fine assuming rsyslogd is doing the right thing and actually closing the old files "soon" [09:38:04] I think the postrotate script takes care of that by running /usr/lib/rsyslog/rsyslog-rotate [09:38:43] godog: it used to not work, we had to fix it back in https://phabricator.wikimedia.org/T301657#7718509 [09:39:57] doh of course, I forgot all about that [09:40:13] seems fine then [09:51:56] I'll get some CRs in shortly [09:53:12] godog: DYK if the "Revert" UI in Gerrit Just Does It, or if it makes a new change for review? [09:54:00] (I htink the latter) [09:55:13] indeed so. [09:59:14] indeed the latter [10:21:11] rebooting thanos-be2001, filling / has made all sort of things sad. [10:30:35] Hm, rsyslog still unhappy [10:32:18] it's just segving on startup :( [10:35:23] Looks to be SEGVing while tring to load old syslog entries in /var/spool/rsyslog, so going to try moving those out of the way [10:38:23] That seems to have helped [10:39:58] neat [10:45:10] godog: I note in passing that /srv/swift-storage/sda3 is pretty full on thanos-be2001 because it has two very large databases in tmp dating from 9th Feb... [10:45:55] [dunno if it's plausible that swift has forgotten it ought to be clearing those up] [10:47:27] could be yeah [11:13:56] godog: how would you feel about removing one or other? [11:15:12] (or both) [11:18:28] [I see the swift replicators should get to it eventually, but we might want to act first] [11:20:23] eventually - reclaim_age which I think is a week [11:20:32] so tomorrow [11:20:48] yeah I think that's fine to remove the untouched tmp dbs Emperor [11:21:11] will do [11:41:36] godog: I kept the contents of /var/spool/rsyslog in ~mvernon/rsyslogjunk on thanos-be2001 ; would they be OK to include in a bug report about rsyslog SEGVing? they're log entries relating to object-server on that host [12:30:45] I am testing the new packages for mediabackups on ms-backup1001 [12:31:23] next will be doing a production update of testwiki backup there [12:40:48] Emperor: yeah I think so, should be fine, i.e. no private data [12:41:04] I'll take a quick look just in case [12:42:57] yeah totally fine [15:35:29] Emperor: how to do this now? https://wikitech.wikimedia.org/wiki/Swift/How_To#Individual_Commands_-_interacting_with_Swift [15:36:31] I'm guessig this no longer will work? https://gerrit.wikimedia.org/r/c/operations/puppet/+/773298/4/hieradata/common/profile/swift.yaml [15:38:22] jynus: meeting now, sorry [15:38:47] no worries, going for lunch, we can talk later [15:56:51] jynus: instructions on wikitech are still current - log onto a stats_reporter_host, . the relevant credential file in /etc/swift [17:01:57] oh, maybe I was just on the wrong host [17:02:38] thank you [17:04:32] it's ms-fe1009 or ms-fe2009 depending on which cluster you want [17:06:04] gerrit> no, you'll need to update hieradata/common/profile/swift.yaml now [17:06:54] oh, no, ignore me, that is still correct, it's where the secret bits go that's changed [17:07:18] https://wikitech.wikimedia.org/wiki/Swift/How_To#Rollover_a_Swift_key should point you at the right places [17:16:13] as it is late today, I will make sure things are working fine with an existing identity, and will deploy the new account when you are around tomorrow, in case something goes wrong before starting the big backup [17:18:59] working with swift for me is the easy part (you may disagree) it is the mw logic what is driving me crazy [17:32:08] fair enough :) [20:58:53] Amir1: just to confirm, is it ok to delete the mariadb104-test project? if so, could you please file a task for that so I get a proper paper trail for it?