[00:00:13] T402247 [00:00:13] T402247: rsyslog is segfaulting non-stop on ms-be1071 - https://phabricator.wikimedia.org/T402247 [00:04:55] FIRING: [2x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:48] FIRING: PuppetFailure: Puppet has failed on ms-be1071:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:05:10] FIRING: [2x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:55] FIRING: [3x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:03] FIRING: PuppetFailure: Puppet has failed on ms-be1071:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:36:23] Amir1: good to go for codfw DC switchover of db2204? [09:05:10] FIRING: [3x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:03] FIRING: PuppetFailure: Puppet has failed on ms-be1071:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:16:40] federico3: go for it [10:45:49] flip done, running upgrade [11:16:06] federico3: for when you have time, do you think you could pick up T401906? [11:16:06] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:16:21] ok [11:16:25] Thanks! [12:09:55] FIRING: [3x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:55] FIRING: [3x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:03] FIRING: PuppetFailure: Puppet has failed on ms-be1071:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:29:17] federico3: hi. Why are you running my schema change? https://phabricator.wikimedia.org/T402010 [15:29:41] I asked about https://phabricator.wikimedia.org/T401906 [15:32:33] Please stop it since I'm running it and it will cause all sorts of issues [15:34:53] oh sorry, stopping it [15:43:54] Amir1: if you want to try ./schema_change_helper.py it should be in a basic good shape and provide locking on sections being updated [15:46:58] can I run a change on s4 eqiad? [15:51:18] Which one? [15:53:32] 2025/add_cl_timestamp_id_T399249.py [15:57:18] Hi all, hope you are all doing well. Just FYI - I need little more time than I thought to settle down my daughter before I fly back home tomorrow evening. I will be taking PTO tomorrow (Wednesday). I will have to cancel/move my sync meetings for tomorrow. Please feel free to message me on IRC or Slack if there is anything urgent. [16:02:48] federico3: isn't it already done there? I think it is except the master [16:03:29] And we can't run it on master live. Needs a switchover [16:03:46] Which I'm planning to do soon [16:04:25] according to the check it's not done [16:05:00] Then definitely [16:05:22] ok [16:06:09] kavitha: have fun and hope you have a nice flight [16:06:52] Amir1: thank you! [16:08:54] Amir1: I thought you were going afk? [16:11:23] On phone 😁 [16:11:36] So can't ssh into ms-be1071 [16:11:45] But can annoy you all [16:15:10] FIRING: [3x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:33] not annoyed, just curious... and wanting to make sure I'm not an 'extra cook', as I wade into this swift thing [16:17:07] (which is all confusing enough as it is...) [16:18:25] the FermMSS alerts seem to have stopped firing (I guess it's expected that those are only logged to #-traffic?) [16:30:01] I'm guessing that whatever that was, it's probably unrelated to whatever is going on with ms-be1071 [16:49:55] FIRING: [3x] SystemdUnitFailed: rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:57:46] A.mir1: Ok, rsyslog is running again , Puppet is happy (I expect the alerts will clear momentarily), so the only thing remaining is that sick disk (`sdg`). I'll open a separate ticket for that momentarily. [17:09:49] RESOLVED: PuppetFailure: Puppet has failed on ms-be1071:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:04:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:01:12] I created T402346 so that hopefully we can start the process of determining whether or not we have a spare drive, and getting one ordered if not. [20:01:13] T402346: hw troubleshooting: disk (sdg) errors on ms-be1071 - https://phabricator.wikimedia.org/T402346