[01:13:28] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:23:46] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:45:16] I am switching pc1 master now [06:47:05] I think specially logging when/if doing pc2 and pc3 will be useful, as I think it is when cache misses increase temporarilly [06:47:27] yeah, but I am planniung to move move the spare to pc2 and pc3 for a few days [06:47:34] So the misses are less impacting :) [06:47:43] last time I saw weirdness on graphs until I realized it was maintenance [07:14:22] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:00:24] would the Wednesday in two days be a good time to reboot cumin1001 wrt DBA/backup activity? [08:01:31] works for me, Amir1 jynus ^ [08:01:54] wednesday but shouldn't be early [08:02:09] let me give you an approximate hour [08:06:33] ack, just let me know what works best for you [08:06:47] sorry, got distracted [08:06:55] while looking at current schedule [08:08:34] moritzm: would starting from 7hUTC + be ok? backups should be done by then [08:08:46] that's 9am CEST [08:09:54] normally backups finish around 6am UTC, to be ready for marostegui :-) (+ buffer, just in case) [08:10:05] XDD [08:10:36] sounds good to me, I'll throw if another half hour of grace period and send a mail for 7:30 UTC to ops@, then [08:10:41] "throw in" [08:13:02] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:16:21] yeah, no issue from then until 19UTC, when the next backup batch would start [08:16:38] and any operation needed could use cumin2002 [08:18:18] ack! [08:23:13] I am going to truncate pc1014 and move it to pc2 [09:01:09] moritzm: I keep it in mind, I have some stuff running [09:01:37] the revision alter in s1 is the most complicated one that I'm not sure will be done b then [09:03:44] do you have an idea when the revision alter for s1 is complete? we can easily postpone to a better time? [09:08:10] Amir1: Any chances it can be run from cumin2002? [09:08:16] Once it has finished the current host? [09:10:55] marostegui: yeah but the chance finding out the time in between is low because it takes a day to do the alter and one hour to repool it [09:11:11] yeah I know :( [09:11:23] moritzm: it is mostly done, let me run a check to see how much left [09:12:15] ok [09:12:59] ["db1169", "db1184", "db1128", "db1132"] needed in four dbs, assuming it takes a day to finish. It should be done by end of the week [09:13:46] Amir1: you can do db1132 as we speak, it is depooled and will remain depooled until we have figured this out https://phabricator.wikimedia.org/T311106 [09:14:05] awesome [09:16:07] started [09:18:07] FWIW my availability tomorrow will be limited, I have several appointments tomorrow [09:37:14] marostegui: https://logstash.wikimedia.org/goto/bf4485e22d2105ddc2932bfe61d8ef15 ^^ [09:38:16] uh? what happened? [09:38:33] I mean, what was pushed that fixed it? [10:01:38] marostegui: I did a couple of fixes T306636 [10:01:38] T306636: UserOptionsManager: DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction ([db])Function: MediaWiki\User\UserOptionsManager::saveOptionsInternalQuery - https://phabricator.wikimedia.org/T306636 [10:07:35] jynus: I didn't got the time to add that AAAA record for backup2002 in the end last week. Let me know when is a good time to add it (again, no hurry at all) [11:01:53] any time is ok [11:02:20] ack, thx [11:02:21] Amir1: can you keep me posted when the "revision alter for s1" is done, then I'll sync up a new date [11:02:39] moritzm: sgtm, sorry for this [11:03:16] not at all, always glad to see changes on the DBs :-) [11:26:45] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:13:39] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:10:05] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:59:16] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [16:00:43] oh, not again :-( [16:13:53] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:59:41] w [17:17:26] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:20:52] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:59:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [20:15:26] I'm checking and restarting the prometheus-mysqld-exporter on an-coord1001 again. Not sure yet why it failed again. [20:25:17] ack,thx [21:17:37] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:16:33] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers