[01:17:57] > Query OK, 1060788593 rows affected (9 hours 8 min 30.883 sec) [09:22:29] what's the status of db2202 ( T355422 )? has puppet disabled since 22 days! Puppet should never be disabled for long periods, anad now it's gone from puppetdb/monitoring/everything, is a ghost host only reported by a Netbox report and also spamming daily root@ due to expired cert for the debmonitor client [09:22:30] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [09:28:39] not sure, it's supposed to become a clone of db2102 but is still pending [09:42:28] hosts should not be online in this status and ideally it should be reimaged, because even re-enabling puppet there is no guarantee that it has not skipped some removal done with 2 puppet patches (absent resources + code removal) [09:56:38] ack [09:58:18] PROBLEM - MariaDB sustained replica lag on s5 on db1213 is CRITICAL: 5.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104 [09:59:18] RECOVERY - MariaDB sustained replica lag on s5 on db1213 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104 [10:01:13] related patch to avoid any mistake: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014650 [10:08:21] A fatal error was detected on a component at bus 23 device 0 function 0. :-( [10:08:38] same question for moss-be200[1-2], reported by Netbox as missing in puppetdb, but those I cannot access, so I don't know the status [10:51:27] they're currently in need of reimaging; but I'm using moss-be100[1-3] for the dev work first [10:51:53] I was using them for testing the RAID0-JBOD conversion a while back [10:52:09] but why they are up but not in puppetdb and not able to login via mgmt? [11:07:01] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:15] volans: I think because a reimage failed (because the conversion cookbook was bust) [11:22:10] s7 codfw snapshot wrong_size 49 minutes ago 1.2 TB -11.1 % The previous backup had a size of 1.4 TB, a change larger than 5.0%. [11:26:12] Emperor: ack, but hosts should not be powered on if not "managed", so I suggest to either power it off or reimage I guess [11:31:07] reimaging is somewhere on the TODO, I promise... [11:34:35] volans: not sure why mgmt would be unavailable (or, um, how easily to power off if it's awol) [11:39:39] mgmt is available [11:39:43] login from mgmt not [11:39:53] doesn't give me prompt for the password [11:44:14] OIC [11:46:20] volans: so serveraction powerdown would leave them in a state you'd be happier with? [11:47:41] netbox report will still report it :D [11:48:15] at least if it's in active status, I think we skip some statuses in the report [11:49:23] is there a better state you'd like them in, assuming I'm not going to reimage them this week? [11:51:57] do you know in which state they are? as in there is an OS running or are they stuck in some weird state? [11:57:17] I think (it's been a while) the installer failed because the disks were in the wrong layout [11:59:08] so if nothing is running the only diff is power consumption, that's up to you, as for the report, putting it in failed status in netbox would prevent them to show up [12:00:00] is "failed" correct for "hardware is OK, OS needs work"? I'd naturally read that as "hardware failed" [12:06:58] it's the only one available :) [12:25:24] OK then [12:33:54] volans: {{done}} [13:17:05] I'll mute alerts for db2100 for a week on karma [13:18:21] ↑ 6e2d3463-b485-4375-b8c0-8468d8927b52 ↑ [16:46:57] I've created T361133 to track the current replication issue we're having on x1, the situation is a bit weird. jynus could I trouble you tomorrow to see if you have an idea about what's going on? Both Amir1 and I checked with no idea yet [16:46:58] T361133: replication failure on db2115 and db2215 - https://phabricator.wikimedia.org/T361133 [18:28:20] I am going on vacations 5 hours ago, but I have mitigated to the best of my abaility [18:28:51] x1 master on codfw requires a restart to fix the soft-lock, but avoiding doing it for now [18:29:23] I have stopped replication on both db2115 and db2215, do not start it until a reboot or a failover is done [18:30:50] can someone put https://phabricator.wikimedia.org/T361133 on the topic, I am not an op on this channel [18:31:26] and lukas is not around, nor manuel, etc