[01:29:05] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db2209:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [01:29:18] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2209:9104 has too large replication lag (12h 12m 9s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [05:29:05] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db2209:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [05:29:18] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2209:9104 has too large replication lag (16h 12m 9s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [10:38:01] marostegui: hi, let me know once you're done with T385084. I need to optimize text table there [10:38:01] T385084: Upgrade and rebuild s2 - https://phabricator.wikimedia.org/T385084 [10:38:29] I will! [10:38:44] Thanks <3 [10:56:52] dhinus: Any chances you can upgrade clouddb1017 today? thanks! [11:03:41] PROBLEM - MariaDB sustained replica lag on es7 on es2040 is CRITICAL: 188.5 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2040&var-port=9104 [11:04:25] FIRING: SystemdUnitFailed: ferm.service on es2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:41] RECOVERY - MariaDB sustained replica lag on es7 on es2040 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2040&var-port=9104 [11:09:25] RESOLVED: SystemdUnitFailed: ferm.service on es2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:11] Emperor: I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114015, I'll disable puppet on ms-fe nodes and proceed on ms-fe1009 manually to see that everything goes as expected, is it a good time now? :) [11:30:29] 👀 [11:31:17] vgutierrez: do go ahead, but could you pick a different first-victim node, please? ms-fe1009 is the stats-reporter-host so a little "special" [11:34:07] Emperor: of course, ms-fe1014 sounds good? [11:35:22] yes, that should be fine [11:35:56] thanks :D [11:41:09] Emperor: puppet finished as expected on ms-fe1014, pybal healthchecks were happy during the whole process [11:41:50] puppet re-enabled on ms-fe* [11:42:17] Cool, thanks. [12:00:26] who do you contact for analytics dbs, marostegui ? [12:00:58] I would like to flag T385565 to someone on D.E. [12:00:59] T385565: Some analytics_meta databases are not being backed up - https://phabricator.wikimedia.org/T385565 [12:06:20] jynus normally btullis [12:07:51] Looking now. Thanks for the ping. [12:07:57] not urgent [12:08:08] it was more of a FYI when you had the time [12:08:30] important but not urgent [12:13:19] I will be applying the grant changes on es hosts next, last host to review [12:18:18] A:db-section-es1 to A:db-section-es7 doesn't seem to work [12:18:33] it selects 0 hosts [12:22:02] I used 'A:db-core and P{es*}' instead [12:26:10] That's so weird :-/ [12:26:41] it is not important for me, but do you want me to file a ticket? [12:27:20] yeah please [12:28:07] I am wondering if that ever worked though [12:28:27] I don't know, but I would guess so, just puppet changed at some point [12:28:52] what's weird is that usually it alerts if an alias has 0 hosts [12:29:12] I am not seeing a definitiion for that in regex.yaml [12:29:14] That's why I am asking [12:29:31] we have extstorage_eqiad [12:29:51] so I grepped /etc/cumin/aliases.yaml [12:29:59] some of those are generated programatically [12:30:31] and if it would not exist, it would complain, so not a typo [12:30:38] yeah, but there not one there either [12:31:30] Anyway, please create a task and we can check! [12:31:52] the db-section-esN aliases works for me [12:32:39] I literally get "No hosts found that matches the query" [12:33:04] I see, I was missing the A: [12:33:13] so my mistake [12:34:00] :) [12:38:06] I will drop es4 and es5 dump grants and update es6 and es7 ones [13:12:23] I think I got disconnected [17:30:51] [non-urgent] hello data persistence - any objections / concerns if I release conftool on Wednesday, some time between 15:00 and 16:00 UTC? this would enable the pooled parsercache sections safety check for T383324. [17:30:52] T383324: Prevent too many parsercache sections from being depooled - https://phabricator.wikimedia.org/T383324 [19:37:25] FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:25] RESOLVED: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:40] FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:40] RESOLVED: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:25] FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:29] https://grafana.wikimedia.org/d/000000378/ladsgroup-test?from=now-90d&orgId=1&to=now&viewPanel=26