[01:17:57] <Amir1>	 > Query OK, 1060788593 rows affected (9 hours 8 min 30.883 sec)
[09:22:29] <volans>	 what's the status of db2202 ( T355422 )? has puppet disabled since 22 days! Puppet should never be disabled for long periods, anad now it's gone from puppetdb/monitoring/everything, is a ghost host only reported by a Netbox report and also spamming daily root@ due to expired cert for the debmonitor client
[09:22:30] <stashbot>	 T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422
[09:28:39] <arnaudb>	 not sure, it's supposed to become a clone of db2102 but is still pending
[09:42:28] <volans>	 hosts should not be online in this status and ideally it should be reimaged, because even re-enabling puppet there is no guarantee that it has not skipped some removal done with 2 puppet patches (absent resources + code removal)
[09:56:38] <arnaudb>	 ack
[09:58:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db1213 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104
[09:59:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db1213 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104
[10:01:13] <arnaudb>	 related patch to avoid any mistake: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014650
[10:08:21] <jynus>	 A fatal error was detected on a component at bus 23 device 0 function 0. :-(
[10:08:38] <volans>	 same question for moss-be200[1-2], reported by Netbox as missing in puppetdb, but those I cannot access, so I don't know the status
[10:51:27] <Emperor>	 they're currently in need of reimaging; but I'm using moss-be100[1-3] for the dev work first
[10:51:53] <Emperor>	 I was using them for testing the RAID0-JBOD conversion a while back
[10:52:09] <volans>	 but why they are up but not in puppetdb and not able to login via mgmt?
[11:07:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:19:15] <Emperor>	 volans: I think because a reimage failed (because the conversion cookbook was bust)
[11:22:10] <Amir1>	 s7  	codfw 	snapshot 	wrong_size 	49 minutes ago 	1.2 TB 	-11.1 % 	The previous backup had a size of 1.4 TB, a change larger than 5.0%. 
[11:26:12] <volans>	 Emperor: ack, but hosts should not be powered on if not "managed", so I suggest to either power it off or reimage I guess
[11:31:07] <Emperor>	 reimaging is somewhere on the TODO, I promise...
[11:34:35] <Emperor>	 volans: not sure why mgmt would be unavailable (or, um, how easily to power off if it's awol)
[11:39:39] <volans>	 mgmt is available
[11:39:43] <volans>	 login from mgmt not
[11:39:53] <volans>	 doesn't give me  prompt for the password
[11:44:14] <Emperor>	 OIC
[11:46:20] <Emperor>	 volans: so serveraction powerdown would leave them in a state you'd be happier with?
[11:47:41] <volans>	 netbox report will still report it :D
[11:48:15] <volans>	 at least if it's in active status, I think we skip some statuses in the report
[11:49:23] <Emperor>	 is there a better state you'd like them in, assuming I'm not going to reimage them this week?
[11:51:57] <volans>	 do you know in which state they are? as in there is an OS running or are they stuck in some weird state?
[11:57:17] <Emperor>	 I think (it's been a while) the installer failed because the disks were in the wrong layout
[11:59:08] <volans>	 so if nothing is running the only diff is power consumption, that's up to you, as for the report, putting it in failed status in netbox would prevent them to show up
[12:00:00] <Emperor>	 is "failed" correct for "hardware is OK, OS needs work"? I'd naturally read that as "hardware failed"
[12:06:58] <volans>	 it's the only one available :)
[12:25:24] <Emperor>	 OK then
[12:33:54] <Emperor>	 volans: {{done}}
[13:17:05] <arnaudb>	 I'll mute alerts for db2100 for a week on karma
[13:18:21] <arnaudb>	 ↑ 6e2d3463-b485-4375-b8c0-8468d8927b52 ↑
[16:46:57] <arnaudb>	 I've created T361133 to track the current replication issue we're having on x1, the situation is a bit weird. jynus could I trouble you tomorrow to see if you have an idea about what's going on? Both Amir1 and I checked with no idea yet
[16:46:58] <stashbot>	 T361133: replication failure on db2115 and db2215 - https://phabricator.wikimedia.org/T361133
[18:28:20] <jynus>	 I am going on vacations 5 hours ago, but I have mitigated to the best of my abaility
[18:28:51] <jynus>	 x1 master on codfw requires a restart to fix the soft-lock, but avoiding doing it for now
[18:29:23] <jynus>	 I have stopped replication on both db2115 and db2215, do not start it until a reboot or a failover is done
[18:30:50] <jynus>	 can someone put https://phabricator.wikimedia.org/T361133 on the topic, I am not an op on this channel
[18:31:26] <jynus>	 and lukas is not around, nor manuel, etc