[04:49:37] PROBLEM - MariaDB sustained replica lag on s5 on db1213 is CRITICAL: 1765 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104 [04:52:37] RECOVERY - MariaDB sustained replica lag on s5 on db1213 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1213&var-port=9104 [08:18:15] Morning :) I come looking for +1s again, please. This is a workaround to an upstream packaging bug, that gets us the relevant ceph manager module for rgw setup (for apus): https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1038223 [08:23:28] and done! [08:23:41] TY :) [10:07:26] regarding s2 codfw switchover: I'm done with my schema change. Feel free to do the schema change [10:21:54] I always forget how long running build-production-images takes :-/ [11:17:35] Emperor: You can use docker-pkg directly /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build --select "*ceph*" images to build only your image and not a rebuild of everything [11:21:31] claime: useful to know; would it be reasonable to note that under https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images ? [11:32:35] Emperor: Sure, why not [11:55:31] I thought we periodically re-built everything anyway? I'm a bit surprised at it always seeming to need to re-build quite so many images whenever I have a ceph update to do... [11:59:39] We rebuild everything on sundays [12:19:42] so why do so many need rebuilding come Monday, then? [12:26:12] (a number of which with version numbers suggesting "yesterday") [13:57:01] the other thing I ogled is a potential regression on the newer kernel preventing the resize for ext4, but I am not sure it is that, as I would have cross check exact kernel versions [13:57:34] in any case, I think I will wipe the fs and recreate it anyway [14:28:39] woo, production-images build finished [14:31:20] 🎉 [15:15:58] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1038372 <-- another +1 request, please? need systemctl available inside the container for some OSD operations [15:18:51] weird to see systemctl in a container :D [15:21:37] I think ceph-volume lvm zap is using it to check which devices are associated with which OSD [16:03:10] checking db2212 [16:06:20] host is not pooled, marostegui Amir1 I'll have to leave pretty soon and I'm not sure I'll be able to debug it. it's s1 candidate master [16:06:42] I've tried to restart replication → failed, restart w/ semi_sync disabled → failed as well [16:07:04] seems stuck on `Slave_SQL_Running_State: Waiting for table metadata lock` [16:07:25] arnaudb: did you run show processlist as we discussed a few days ago? [16:07:37] oh no I forgot [16:07:53] :( [16:08:47] the pattern seemed off that track marostegui as it fired an alert so I did not thought of schema change sorry :-( [16:09:48] That Prometheus alert...:( [16:55:56] I've now worked out why my upgrade isn't working - every mgr needs ssh access to targets. Anyone for a late +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038391 please? PCC still buggy here.