[09:02:22] To update on T357333 - those alerts will now fire every 24h instead of every 4h, which is a start [09:02:23] T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333 [11:55:01] PROBLEM - MariaDB sustained replica lag on s7 on db2122 is CRITICAL: 10 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [11:56:05] PROBLEM - MariaDB sustained replica lag on s7 on db1227 is CRITICAL: 3.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [11:57:03] RECOVERY - MariaDB sustained replica lag on s7 on db2122 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [11:57:05] RECOVERY - MariaDB sustained replica lag on s7 on db1227 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [12:01:15] s7 issues? [12:01:50] up to 35 seconds of lag [12:03:26] app errors were relativelly small: https://logstash.wikimedia.org/goto/54c375f65c0892c6c48dcbc327395e88 [12:03:41] motly affecting the jobqueue only [12:11:53] I don't see anything obvious on logs, checking graphs [12:14:27] interesting, it doesn't seem to be writes, but read_rnd_next [12:15:51] and the long running queries log shows nothing useful for s7 [12:17:32] we lack more fine-grained observability [12:22:57] looking at sys, my best guesses would be echo/notification at meta or centralauth [12:23:26] we have the metadata, but we lack the temporal dimension to be sure [13:28:16] Emperor: regarding moss (I don't remember the new name), what is the planned undelying technology for that? [13:28:38] will you keep swift for now? [13:35:09] jynus: no, plan is to run Ceph [13:35:18] nice [13:38:52] why do you ask? [13:40:10] I wondered in case some needed backups, if I should get ready for that [13:40:42] plus I am excited about knowing how ceph works for us [13:41:55] fair enough :) [13:44:08] I was also checking for options for compressed filesystems, and the only thing I found was btrs, which I don't trust for backups atm, so maybe ceph could be used (but only if ceph wasn't the canonical place to backup in the first place) [13:44:35] *Btrfs [13:45:08] so doing a lot of thinking, but nothing you should worry about :-D [13:46:44] we could back swift up to a ceph cluster and vice versa :) [13:47:21] we could, but should we? [13:47:26] XD [15:25:26] i'll downtime and depool db2121: s7, db2132: m1, db2145: s1, db2104: s2, db2153: s1, db2154: s8, db2175: S2, db2176: s1 (T355864) in 20min for about an hour (at least for the downtime, I'll be monitoring the phabricator task to repool hosts asap otherwise [15:25:26] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [15:50:06] arnaudb: thanks, appreciate that :) [15:50:46] anytime! [16:04:36] waiting on network maintenance to deploy the backup sources patch [16:16:07] arnaudb, jynus: network maintenance is done and hosts are reachable again [16:16:11] thanks for the help! [16:16:18] nice, so fast! [16:16:21] thanks topranks! [16:16:22] good work, topranks [16:16:42] You can thank JennH for the quick work :) [16:19:07] topranks: is puppet reenabled? [16:19:21] jynus: yes [16:19:25] thanks [20:51:16] PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [20:52:16] RECOVERY - MariaDB sustained replica lag on s4 on db1238 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [21:20:50] this seems to be a blip: https://grafana.wikimedia.org/goto/J94i0y2Ik?orgId=1 wdyt? [21:32:10] sure seems transient [22:54:13] PROBLEM - MariaDB sustained replica lag on s4 on db1246 is CRITICAL: 10.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1246&var-port=13314 [22:54:15] PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 5.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [22:54:15] PROBLEM - MariaDB sustained replica lag on s4 on db1244 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=13314 [22:55:15] RECOVERY - MariaDB sustained replica lag on s4 on db1244 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1244&var-port=13314 [22:57:15] PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [22:59:18] RECOVERY - MariaDB sustained replica lag on s4 on db1246 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1246&var-port=13314 [22:59:20] RECOVERY - MariaDB sustained replica lag on s4 on db1238 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [22:59:48] (PuppetFailure) firing: Puppet has failed on restbase1036:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure