[01:13:28] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:23:46] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:45:16] <marostegui>	 I am switching pc1 master now
[06:47:05] <jynus>	 I think specially logging when/if doing pc2 and pc3 will be useful, as I think it is when cache misses increase temporarilly
[06:47:27] <marostegui>	 yeah, but I am planniung to move move the spare to pc2 and pc3 for a few days
[06:47:34] <marostegui>	 So the misses are less impacting :)
[06:47:43] <jynus>	 last time I saw weirdness on graphs until I realized it was maintenance
[07:14:22] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:00:24] <moritzm>	 would the Wednesday in two days be a good time to reboot cumin1001 wrt DBA/backup activity?
[08:01:31] <marostegui>	 works for me, Amir1 jynus ^
[08:01:54] <jynus>	 wednesday but shouldn't be early
[08:02:09] <jynus>	 let me give you an approximate hour
[08:06:33] <moritzm>	 ack, just let me know what works best for you
[08:06:47] <jynus>	 sorry, got distracted
[08:06:55] <jynus>	 while looking at current schedule
[08:08:34] <jynus>	 moritzm: would starting from 7hUTC + be ok? backups should be done by then
[08:08:46] <jynus>	 that's 9am CEST
[08:09:54] <jynus>	 normally backups finish around 6am UTC, to be ready for marostegui :-) (+ buffer, just in case)
[08:10:05] <marostegui>	 XDD
[08:10:36] <moritzm>	 sounds good to me, I'll throw if another half hour of grace period and send a mail for 7:30 UTC to ops@, then
[08:10:41] <moritzm>	 "throw in"
[08:13:02] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:16:21] <jynus>	 yeah, no issue from then until 19UTC, when the next backup batch would start
[08:16:38] <jynus>	 and any operation needed could use cumin2002
[08:18:18] <moritzm>	 ack!
[08:23:13] <marostegui>	 I am going to truncate pc1014 and move it to pc2
[09:01:09] <Amir1>	 moritzm: I keep it in mind, I have some stuff running 
[09:01:37] <Amir1>	 the revision alter in s1 is the most complicated one that I'm not sure will be done b then
[09:03:44] <moritzm>	 do you have an idea when the revision alter for s1 is complete? we can easily postpone to a better time?
[09:08:10] <marostegui>	 Amir1: Any chances it can be run from cumin2002?
[09:08:16] <marostegui>	 Once it has finished the current host?
[09:10:55] <Amir1>	 marostegui: yeah but the chance finding out the time in between is low because it takes a day to do the alter and one hour to repool it
[09:11:11] <marostegui>	 yeah I know :(
[09:11:23] <Amir1>	 moritzm: it is mostly done, let me run a check to see how much left
[09:12:15] <moritzm>	 ok
[09:12:59] <Amir1>	  ["db1169", "db1184", "db1128", "db1132"] needed in four dbs, assuming it takes a day to finish. It should be done by end of the week
[09:13:46] <marostegui>	 Amir1: you can do db1132 as we speak, it is depooled and will remain depooled until we have figured this out  https://phabricator.wikimedia.org/T311106
[09:14:05] <Amir1>	 awesome
[09:16:07] <Amir1>	 started
[09:18:07] <Amir1>	 FWIW my availability tomorrow will be limited, I have several appointments tomorrow 
[09:37:14] <Amir1>	 marostegui: https://logstash.wikimedia.org/goto/bf4485e22d2105ddc2932bfe61d8ef15 ^^
[09:38:16] <marostegui>	 uh? what happened?
[09:38:33] <marostegui>	 I mean, what was pushed that fixed it?
[10:01:38] <Amir1>	 marostegui: I did a couple of fixes T306636
[10:01:38] <stashbot>	 T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636
[10:07:35] <volans>	 jynus: I didn't got the time to add that AAAA record for backup2002 in the end last week. Let me know when is a good time to add it (again, no hurry at all)
[11:01:53] <jynus>	 any time is ok
[11:02:20] <volans>	 ack, thx
[11:02:21] <moritzm>	 Amir1: can you keep me posted when the "revision alter for s1" is done, then I'll sync up a new date
[11:02:39] <Amir1>	 moritzm: sgtm, sorry for this
[11:03:16] <moritzm>	 not at all, always glad to see changes on the DBs :-)
[11:26:45] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:13:39] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:10:05] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:59:16] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[16:00:43] <jynus>	 oh, not again :-(
[16:13:53] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:59:41] <mutante>	 w
[17:17:26] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:20:52] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:59:31] <jinxer-wm>	 (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed
[20:15:26] <btullis>	 I'm checking and restarting the prometheus-mysqld-exporter on an-coord1001 again. Not sure yet why it failed again.
[20:25:17] <mutante>	 ack,thx
[21:17:37] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:16:33] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers