[02:17:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:55] Amir1: I am going to start with s4 with pagelinks and templatelinks [05:55:59] Started with codfw for now [06:05:21] maaybe i could be removed from the deployment calender entries? :) [06:17:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:30:14] kormat: will do [06:31:53] ty :D [08:47:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:52] (that alert isn't very useful) [09:03:02] PROBLEM - MariaDB sustained replica lag on s2 on db2125 is CRITICAL: 86.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2125&var-port=9104 [09:04:02] RECOVERY - MariaDB sustained replica lag on s2 on db2125 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2125&var-port=9104 [09:05:56] PROBLEM - MariaDB sustained replica lag on s2 on db2175 is CRITICAL: 137.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [09:06:00] PROBLEM - MariaDB sustained replica lag on s2 on db2148 is CRITICAL: 119.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [09:06:02] PROBLEM - MariaDB sustained replica lag on s2 on db2138 is CRITICAL: 82.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2138&var-port=9104 [09:07:00] RECOVERY - MariaDB sustained replica lag on s2 on db2148 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2148&var-port=9104 [09:07:02] RECOVERY - MariaDB sustained replica lag on s2 on db2138 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2138&var-port=9104 [09:07:56] RECOVERY - MariaDB sustained replica lag on s2 on db2175 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [09:11:02] PROBLEM - MariaDB sustained replica lag on s2 on db2204 is CRITICAL: 155.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2204&var-port=9104 [09:11:43] * arnaudb looks at his next OKR with impatience [09:12:01] arnaudb: These have been real issues, not just noise [09:12:02] RECOVERY - MariaDB sustained replica lag on s2 on db2204 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2204&var-port=9104 [11:47:38] rebooting [12:49:21] altering templatelinks in s4 can easily take 1 month [12:56:47] per instance ? 😱 [12:57:44] no no, in total [12:57:58] it is going to take around 24h for each host and we have a bunch [12:57:58] ah, still huge but less frightening ^^ [16:52:24] Amir1: https://phabricator.wikimedia.org/T364069#9838869 [16:52:49] 100GB? nice and not nice [17:00:24] > zhwiki: Completed normalization of pagelinks, 119398845 rows updated. [17:41:38] Amir1: can I do s2 codfw switchover tomorrow then? [17:42:00] Started running the script, I don't know if it finishes by tomorrow honestly [17:42:21] ah sorry I got confused cause I saw you marked s2 as done [17:43:11] yeah, I have to change the pk there, I put codfw first, so it should get there soon. Depends on how long it takes, let me give you an estimate after first replica is done [17:43:34] yeha no problem [17:43:39] not in a rush