[01:29:05] <jinxer-wm>	 FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db2209:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[01:29:18] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2209:9104 has too large replication lag (12h 12m 9s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[05:29:05] <jinxer-wm>	 FIRING: MysqlReplicationThreadCountTooLow: MySQL instance db2209:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow
[05:29:18] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2209:9104 has too large replication lag (16h 12m 9s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2209&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[10:38:01] <Amir1>	 marostegui: hi, let me know once you're done with T385084. I need to optimize text table there
[10:38:01] <stashbot>	 T385084: Upgrade and rebuild s2 - https://phabricator.wikimedia.org/T385084
[10:38:29] <marostegui>	 I will!
[10:38:44] <Amir1>	 Thanks <3
[10:56:52] <marostegui>	 dhinus: Any chances you can upgrade clouddb1017 today? thanks!
[11:03:41] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on es7 on es2040 is CRITICAL: 188.5 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2040&var-port=9104
[11:04:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on es2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:07:41] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on es7 on es2040 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es2040&var-port=9104
[11:09:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ferm.service on es2040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:11] <vgutierrez>	 Emperor: I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114015, I'll disable puppet on ms-fe nodes and proceed on ms-fe1009 manually to see that everything goes as expected, is it a good time now? :)
[11:30:29] <Emperor>	 👀
[11:31:17] <Emperor>	 vgutierrez: do go ahead, but could you pick a different first-victim node, please? ms-fe1009 is the stats-reporter-host so a little "special"
[11:34:07] <vgutierrez>	 Emperor: of course, ms-fe1014 sounds good?
[11:35:22] <Emperor>	 yes, that should be fine
[11:35:56] <vgutierrez>	 thanks :D
[11:41:09] <vgutierrez>	 Emperor: puppet finished as expected on ms-fe1014, pybal healthchecks were happy during the whole process
[11:41:50] <vgutierrez>	 puppet re-enabled on ms-fe*
[11:42:17] <Emperor>	 Cool, thanks.
[12:00:26] <jynus>	 who do you contact for analytics dbs, marostegui ?
[12:00:58] <jynus>	 I would like to flag T385565 to someone on D.E.
[12:00:59] <stashbot>	 T385565: Some analytics_meta databases are not being backed up - https://phabricator.wikimedia.org/T385565
[12:06:20] <marostegui>	  jynus normally btullis 
[12:07:51] <btullis>	 Looking now. Thanks for the ping. 
[12:07:57] <jynus>	 not urgent
[12:08:08] <jynus>	 it was more of a FYI when you had the time
[12:08:30] <jynus>	 important but not urgent
[12:13:19] <jynus>	 I will be applying the grant changes on es hosts next, last host to review
[12:18:18] <jynus>	 A:db-section-es1 to A:db-section-es7 doesn't seem to work
[12:18:33] <jynus>	 it selects 0 hosts
[12:22:02] <jynus>	 I used 'A:db-core and P{es*}' instead
[12:26:10] <marostegui>	  That's so weird :-/
[12:26:41] <jynus>	 it is not important for me, but do you want me to file a ticket?
[12:27:20] <marostegui>	 yeah please
[12:28:07] <marostegui>	 I am wondering if that ever worked though
[12:28:27] <jynus>	 I don't know, but I would guess so, just puppet changed at some point
[12:28:52] <jynus>	 what's weird is that usually it alerts if an alias has 0 hosts
[12:29:12] <marostegui>	 I am not seeing a definitiion for that in regex.yaml
[12:29:14] <marostegui>	 That's why I am asking
[12:29:31] <marostegui>	 we have extstorage_eqiad
[12:29:51] <jynus>	 so I grepped /etc/cumin/aliases.yaml
[12:29:59] <jynus>	 some of those are generated programatically
[12:30:31] <jynus>	 and if it would not exist, it would complain, so not a typo
[12:30:38] <marostegui>	 yeah, but there not one there either
[12:31:30] <marostegui>	 Anyway, please create a task and we can check!
[12:31:52] <volans>	 the db-section-esN aliases works for me
[12:32:39] <jynus>	 I literally get "No hosts found that matches the query"
[12:33:04] <jynus>	 I see, I was missing the A:
[12:33:13] <jynus>	 so my mistake
[12:34:00] <volans>	 :)
[12:38:06] <jynus>	 I will drop es4 and es5 dump grants and update es6 and es7 ones
[13:12:23] <jynus>	 I think I got disconnected
[17:30:51] <swfrench-wmf>	 [non-urgent] hello data persistence - any objections / concerns if I release conftool on Wednesday, some time between 15:00 and 16:00 UTC? this would enable the pooled parsercache sections safety check for T383324.
[17:30:52] <stashbot>	 T383324: Prevent too many parsercache sections from being depooled - https://phabricator.wikimedia.org/T383324
[19:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:52:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:52:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:17:40] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:25:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-journal-flush.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:46:29] <Amir1>	 https://grafana.wikimedia.org/d/000000378/ladsgroup-test?from=now-90d&orgId=1&to=now&viewPanel=26