[00:05:32] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 72.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:14:48] FIRING: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (2m 10s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [00:14:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (2m 10s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [00:19:32] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [00:19:48] RESOLVED: MysqlReplicationLag: MySQL instance db1206:9104@s1 has too large replication lag (2m 10s). Its replication source is db1163.eqiad.wmnet. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [00:19:48] RESOLVED: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db1206:9104 has too large replication lag (2m 10s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1206&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [02:20:32] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 14.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [02:22:34] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [03:34:34] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 15.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [03:35:34] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [03:47:37] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:02:32] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 25.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [04:10:34] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [06:03:34] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 19 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [06:06:34] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [07:47:37] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:35] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 38.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [08:19:35] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [08:28:55] Emperor: o/ I see that the patch is merged, going to provision ms-be2081 with uefi and then see if I can flip disks to JBOD via the BMC's webui [08:29:14] then I'll ping dcops to complete the work [08:29:32] I want to do some tests with ms-be2083, like rebooting etc.. [08:29:39] just to be sure that nothing weird comes up [08:36:23] rebooted it a couple of times, so far nothing weird (like PXE kicking in etc..) [08:37:05] * Emperor continues to cross fingers [09:03:35] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 18.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [09:05:35] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [09:09:51] Emperor: it seems working, but I have the sensation that sometimes (still not clear why/when) reimage does PXE boot two times, ending up two times in d-i [09:10:15] not a huge blocker but not great either, maybe something uefi/supermicro specific [09:18:50] Mmm [09:38:28] ah yes it may have happened, since reimage is stuck while trying to generate the puppet cert [10:16:35] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 25.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [10:16:58] Emperor: we may have a lead, I'll test a path with other nodes [10:18:27] Cool :) [10:18:35] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [10:39:49] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal..check' raised: Not all services are recovered: ms-be2081:Dell PowerEdge RAID Controller [10:39:58] Emperor: lol --^ [10:40:29] le sigh [10:40:30] we may need to fix that for the new hosts [10:40:38] but the reimage worked afaics [10:41:23] btullis: an-redacteddb1001.eqiad.wmnet broke replication because of the schema change, I will revert it and we should be good (it will take around 2 days) [10:41:24] jolly good [10:42:21] marostegui: OK, thanks. Will it need reapplying after that? [10:43:06] btullis: Yes, but we will have to do it through the intermediate master, so it will work. We were basically trying to do it to avoid a delay that can go up to 10 days or so (as there are many layers in between) [10:43:49] OK, got it. Thanks. As long as we can try to avoid downtime around 1st/2nd of month, this should be fine. [11:01:16] Emperor: ok I confirm that we cannot manually set jbod via the BMC's Web UI, so I'll ping dcops [11:07:02] btullis: yeah, in any case there will not be downtime. If replication happens to be lagging behind, there's no downtime, it is "just" not up-to-date data. I don't know if that makes any difference. But in any case, I think I will be ready to run the schema change on monday, so even if it takes 10-12 days, we should still be good for the Dec run [11:22:39] elukey: is that booting BIOS or EFI? [11:33:41] marostegui: Ack. Many thanks. [11:37:07] Can someone give me a general overview of what is the current status of parsercache spares? [11:37:22] I am failing to understand the current situation and what's pending [11:38:15] For instance, pc1013 is no where [11:55:32] Emperor: EFI :) [11:58:36] elukey: huh, I thought setting all the disks to JBOD was meant to be doable under EFI booting :( [12:51:41] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 91.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [12:58:41] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [13:30:28] arnaudb: https://phabricator.wikimedia.org/T377276#10298957 this task was reopened because there are still notes on a host that is pooled and it can be confusing [13:31:10] ah ack! [13:31:15] arnaudb: https://phabricator.wikimedia.org/T374026 this was was reopened because the task mentioned as a blocker at https://phabricator.wikimedia.org/T374026#10157472 is resolved [13:59:44] Emperor: sorry my bad, didn't explain myself correctly - via the BMC Web UI is not possible, not even with EFI, but via "boot" + special-key + utility it is possible (but Papaul knows how to do it, I don't :D) [14:43:28] elukey: for future improvements it would be nice to see if there is a way to do it via redfish [14:43:46] now that we're in close contact with the vendor :) [14:55:22] elukey: Mmm, that's kindof sad, since that presumably means it's not possible to hot-swap a drive if you have to reboot to be able to set it to JBOD [14:55:56] [FTAOD the thanos capacity is getting bad enough that this isn't a blocker if the thanos-be nodes ever actually show up] [15:14:41] volans: yes yes definitely, one step at the time! [15:14:53] sure :) [15:22:10] Emperor: as Riccardo said there is the Redfish option to explore, maybe there is a way to do it via HTTP API that can be added to spicerack etc.. [15:39:56] here's hoping... [15:40:17] hardware is a bad idea [15:41:07] so to update you, we still have the issue that with UEFI reimage does debian install two times [15:41:20] due to the pxe-boot settings (apply only for one time) is executed two times [15:41:36] apparently it is nor cleared when d-i reboots the first time [15:43:57] OK, that's annoying but not a deal-breaker [15:44:19] (future me will hate this when I have to reimage for the next distro upgrade 😂) [15:44:57] well it is since the reimage doesn't finish correctly, we inject something for puppet during d-i and we don't do it the second time [15:45:12] so reimage is not able to correctly sign the new puppet client cert etc... [15:49:21] :sadpanda: [15:57:35] but also sooner we will be able to remove that bit as all hosts will be puppet7 [15:57:41] until next puppet migration [15:58:03] (not version bump, real migration of infra, so hopefully far in the future) [16:35:32] volans: OOI, is this puppet-injection something that could be done away with if the default shifted to be puppet 7 not 5? [excuse a probably very stupid question] [16:40:03] Emperor: yes, once we get rid of puppet 5 we can simplify both the reimage code and the late_command.sh to avoid any injection [16:41:44] volans: presumably flipping the default to 7 (and needing something injected to get puppet 5) would be too disruptive? [16:42:46] that's already the case [16:43:26] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/files/autoinstall/scripts/late_command.sh#24 [16:43:30] Oh, OK, I misunderstood what elukey said about injecting something for puppet [16:44:32] I actually had forgot that line of code, so yeah maybe we should re-check the error on the second d-i run [16:45:39] it does have a delay of 30s times 10 retries (same file few lines above)