[01:25:25] FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be2051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:48] FIRING: PuppetFailure: Puppet has failed on ms-be2051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:51:14] PROBLEM - MariaDB sustained replica lag on s4 on db2219 is CRITICAL: 216.5 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2219&var-port=9104 [05:59:14] RECOVERY - MariaDB sustained replica lag on s4 on db2219 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2219&var-port=9104 [07:50:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on ms-be2051:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:51] that's going to be more dead hardware [08:55:08] I have so many things to do that I created a task yesterday and I don't know why and what for XD [09:02:49] create a task today to review existing tasks :-P [09:08:06] and assign it to v.olans ;p [09:24:48] FIRING: PuppetFailure: Puppet has failed on ms-be2051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:27:38] volans: XDDD [09:28:00] Seriously I have no idea why I created it [10:30:34] PROBLEM - MariaDB sustained replica lag on s7 on db2221 is CRITICAL: 58.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2221&var-port=9104 [10:31:34] RECOVERY - MariaDB sustained replica lag on s7 on db2221 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2221&var-port=9104 [10:42:29] I need to change x1 to be SBR for a bit [10:42:31] Amir1: ^ [10:42:52] Sure@ [10:43:15] I have nothing on x1 [10:43:20] you do [10:43:22] https://phabricator.wikimedia.org/T385645 [10:43:23] XD [10:51:11] I am going to hjave to reclone the broken x1 host, damn RBR [10:51:54] marostegui: I was on PTO yesterday, I can do 1017 today [10:52:28] btw I see there is some lag on clouddb1013 that I upgraded on Friday, not sure what caused it [10:52:41] marostegui: want me to recover it from backups? [10:53:20] dhinus: Probbably schema changes or indexs rebuilds [10:53:23] jynus: no, thanks it is fine [10:53:26] > s7 eqiad snapshot fresh 5 hours ago 916.9 GB -2.7 % — [10:53:48] marostegui: I mean, I'm not running it :P [10:56:05] the s2 wikireplicas are lagging because the sanitarium master is being rebuilt so that's fine, I don't know which deamon of clouddb1013 is lagging [10:56:29] replag on clouddb1013 is going down so I will ignore it for now [10:57:03] something is wrong on db1156 (s2) too https://orchestrator.wikimedia.org/web/cluster/alias/s2 [10:58:55] dhinus: That is index being rebuilt as Amir1 said [10:59:06] [11:53:20] dhinus: Probbably schema changes or indexs rebuilds [10:59:07] And that ^ [10:59:17] We are rebuilding many hosts/indexes [10:59:33] ok thanks :) [10:59:52] I will upgrade+reboot clouddb1017, and the other clouddbs after [11:00:19] great thanks [11:00:25] s2 [11:00:50] I want to run a schema change on s2 for a half a week now but you're just rebuilding everything 😡 [11:01:06] "s2": sorry that was a search string I didn't mean to post :P [11:01:07] Amir1: I am trying to avoid more crashes [11:01:36] I know :P I'm just grumpy [11:02:49] Amir1: "which daemon of clouddb1013" < s1 is lagging there, but getting better [11:03:11] dhinus: that comes from index rebuilds [11:03:29] all good [11:04:07] marostegui: I think you might want to stop T385645 [11:04:07] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [11:04:15] Reedy: It is stopped [11:05:06] It was only done in a host, which broke and it's been depooled for a while now [11:05:37] was still erroring ~20 mins ago [11:06:07] Ah yes, the stupid script was still running [11:06:08] My bad [11:06:11] Depooling and killing it [11:06:14] Thanks for the heads up [11:06:21] heh, np :) [11:06:39] I should really not work on other stuff when I am on clinic duty [12:50:04] Amir1: https://phabricator.wikimedia.org/T385645#10525207 [12:50:47] ah [12:51:07] damn. I think it's NULL but explicitly? [12:51:36] yup [12:51:49] let me backport the patch to remove that [12:51:54] Yeah [12:51:58] I am going to revert this on the host [12:52:01] aka: reclone [12:52:03] (again) [12:52:39] sorry :( [12:52:50] It is fine, that's why we run them on replicas first [12:52:53] To catch these things [13:10:35] Can I get +1s for 2 linked changes, please, to prep the old and now drained ms-be2* nodes for decom? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117535 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117536 [13:18:01] Thanks :) [13:21:03] Amir1: not needing a full review, just an ok to proceed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1116846/4 [13:22:25] +1'ed! [13:22:28] thanks [13:22:43] will deploy the removals on m1 primary [14:35:29] FYI, I'm running the file migration on s5, basically only dewiki has a lot of files (100K) [14:35:38] you might see a lot of writes there [14:53:11] hi folks - any objections / concerns if I release conftool soon to enable the min pooled parsercache sections safety check? (T383324) [14:53:11] T383324: Prevent too many parsercache sections from being depooled - https://phabricator.wikimedia.org/T383324 [14:54:06] none on my side [14:54:14] only thank you for picking it up! [14:54:39] no problem, and thanks for your patience with getting it live taking a while :) [15:50:15] all done. verified with `dbctl config generate` and `dbctl config diff` which, despite being non-mutating operations, exercise the same checks as a `commit` would [15:50:33] I'll update the wikitech docs about depooling parsercache sections in a bit [18:42:38] regarding Bacula backups: I seem to remember there was something to mark a backup::set as "keep around forever", as in ..archive it and never rotate it away. Is that right? What was it? Or would that be default behaviour if I have an existing regular backup and then just remove it from puppet. [18:43:59] what I want is "one last full backup, keep it forever and stop make new ones"