[08:06:45] marostegui: I'm looking at db1247 after the cloning and most metrics look ok, I'm just wondering why some hosts have this metric often erratic, sometimes flattening https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=%24__all&var-server=db1247&from=now-30d&to=now&timezone=utc&var-port=9104&refresh=1m&viewPanel=panel-14 [08:22:05] <_joe_> federico3: the innodb checkpoint "age" depends on traffic to/from the storage engine, IIRC. So a) when the server is turned off, nothing will be reported. When the server is depooled from traffic, you'd expect it to be flat [08:22:52] <_joe_> and the sawtooth nature of it in other situations, I guess depends on the frequency of checkpoints, but I'd check the manual [08:23:27] _joe_: the server was depooled since the restart but it still had high/erratic values then flattened with no conf change. I'm looking around and it seems to happen on other hosts making the distribution pretty much bimodal [08:24:11] <_joe_> oh that is indeed curious [08:24:24] <_joe_> I'd go look at the manual about what's exactly measured there [08:24:40] <_joe_> but also I guess nice to know but not alarming, at least [08:25:57] the checkpoint age measures the amount of "dirty writes" not yet written to the main tablespace on disk [08:26:16] if it was continuously full, it would create a bottlneck [08:26:36] but if it was continuisly empty it would create inefficient writes [08:26:57] different checkpointing algorithms have differen methodologies to handle it [08:28:01] we are not -generally- write, bound, so as long as it is not all the time full, we are ok [08:28:28] if it was all the time full (I mean the transaction log) we could increase it size or change the flushing method [08:29:41] if it was all the time low, either there are no writes and we are waisting disk space or we can tune the flushing algorithm [08:30:17] This is a good summary by Fred: https://lefred.be/content/a-graph-a-day-keeps-the-doctor-away-mysql-checkpoint-age/ [08:31:58] https://grafana-rw.wikimedia.org/d/a972e119-a791-4c4f-9de7-c6a6be58e1e2/federico-s-mariadb-status?orgId=1&from=now-24h&to=now&timezone=utc&var-query0&viewPanel=panel-7 [08:32:21] thanks jynus [08:32:53] Think of the redo log as a "disk buffer" to smooth actual writes, normally the transaction log is written syncrounously and the tablespace asyncrounously [08:36:20] there are 3 hosts with warnings due to high memory usage, BTW [08:36:25] es hosts [09:16:46] Hi folks, could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148280 please, to add apus-be1004 to the eqiad apus cluster? There are notes in the commit text, but it should be straightforward :) [09:20:19] thanks :) [09:45:23] I'm going to pool in db1247 - sounds good? [10:01:08] +1 [10:36:55] Can I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148296 please? This is a bit more complicated as it's updating an epp template, but the commit message should explain what it's trying to do and you can see from https://puppet-compiler.wmflabs.org/output/1148296/3938/moss-be1001.eqiad.wmnet/index.html that it correctly adds the NVMe label to the new host (and otherwise doesn't change the output) [11:52:19] marostegui: this means I can just wait few mins for the next run? https://phabricator.wikimedia.org/T393612#10838650 [11:52:32] Yeah [11:52:37] Or force the runs, up to you [12:51:30] federico3: you aren't on -operations? [12:52:08] federico3: There are some alerts about uncommitted dbctl changes [12:53:03] oh the IRC client failed auth and could not join back -_-' [12:53:27] it's probably the pooling in of db1247 [12:57:26] can you check and fix? [13:09:55] I'm afraid I'm also looking for a +1 of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148330 please? Removing thanos-fe100[1-3] from puppet. CR text shows how I checked they were depooled and no longer mentioned in puppet. [13:10:07] checking the new one [13:11:10] Thanks! [13:12:44] one question, wouldn't you want to seprate the ring manager and the removal in separate patches to have more granular control? Sorry, I many be showing I know very little about swift [13:13:51] I know one would be a trivial change, but wondering it could get into some race condition due to the asyncrounous nature of puppet [13:14:10] even if only for monitoring purposes [13:15:24] (maybe the master property is a noop, I lack that context) [13:16:27] The race comes from moving the stats_reporter host, actually, where I'll want to run puppet on the new and old in a short window to avoid alert noise. That will also move the ring manager code across to the new server (and have the puppet servers start to look there for the rings), then there's a manual process to setup the new ring manager (cf [13:16:27] https://wikitech.wikimedia.org/wiki/Swift/How_To#Reimage/provision_the_ring_manager_proxy_host ). Ring update and rollout is async anyway. [13:16:50] ok, so that's the context I lacked [13:17:09] I +1ed it [13:17:19] checked no leftover anywhere (at least on puppet) [13:17:36] Thanks. I have a bunch of meetings coming up, so I think I'll probably actually deploy the thanos changes tomorrow. [13:17:53] no worries :-D [14:07:24] FIRING: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:24] RESOLVED: SystemdUnitFailed: swift_dispersion_stats.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed