[09:07:04] I made a mistake yesterday when pooling db2245 ( this is the fix https://phabricator.wikimedia.org/P84219) but it is nice to see that the HW had no issues to handle it: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-24h&to=now&timezone=utc&var-job=$__all&var-server=db2245&var-port=9104&refresh=1m [09:12:29] did the server really receive 10x more traffic? Looking at the charts I can't spot a 10x bump [09:16:09] federico3: it did https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2025-10-21T09:29:16.112Z&to=2025-10-21T20:20:02.266Z&timezone=utc&var-job=$__all&var-server=db2245&var-port=9104&refresh=1m&viewPanel=panel-16 [09:21:34] I noticed that but it looks more like a 5x jump [09:22:24] Amir1 has been planning to make the load balancer smarter / adaptive so I'm curious [09:23:37] Load monitor absorbs some of it yes [09:24:21] I can find the graph of the "real" weight adjusted based on load monitor checks [09:26:53] uh? is the LB generating metrics in prometheus? [09:28:20] Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198005 please? Remove two now-drained nodes so they can get their disk controllers swapped [09:52:59] done [10:00:25] I am thinking, now that transfer.py has been battle tested on cumin1003, use cumin1002 for production testing of the new version, but only if people here can migrate to cumin1003 their transfers (codfw will be untouched) [10:05:16] I am fine with that [10:08:36] no objection here [10:23:16] I'm sure it used to be in https://grafana.wikimedia.org/d/G9kbQdRVz/mediawiki-rdbms-loadbalancer?orgId=1&from=now-24h&to=now&timezone=utc&var-section=s4 but I think it's removed now [10:34:38] Amir1: ah, TIL [10:38:28] Thank you ,people no matter how much testing I am doing now on development, I think more stuff will happen when I try stuff on production [10:50:14] zabe: Thanks! [10:50:40] yw [12:34:45] o/ [12:35:00] hi guys [12:36:41] hi kwakuofori [12:36:54] hey Emperor [12:37:00] hi kwakuofori [12:37:07] hey marostegui [12:38:03] catching up on work so do not hesitate to flag stuff I need to be aware of or that you need help with [12:39:37] welcome back [12:42:08] kwakuofori: just checking we're not having a 121 today? It was uncancelled and now cancelled again (this is fine, just don't want to stand up you : ) [12:42:41] thanks, jynus [12:43:01] Emperor, yeah, minimal 121s this week [12:47:08] 👍 [13:33:25] FIRING: SystemdUnitFailed: mariadb.service on db1300:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:29] that is me [13:38:25] RESOLVED: SystemdUnitFailed: mariadb.service on db1300:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:25] FIRING: SystemdUnitFailed: mariadb.service on db1300:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:35] yup it's still me [14:40:00] this would fix it [14:40:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198077 [14:53:27] resolve should be coming soon [14:58:25] RESOLVED: SystemdUnitFailed: mariadb.service on db1300:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:11] I may have deleted, thinking they were mine es2057 checksums on cumin1003 [18:04:17] I have recreated them [18:04:44] but I cannot be 100% sure checksum will not fail [18:04:52] data trasnfer shouldn't be affected, though [18:06:21] and most likely it shouldn't also affect checksums, but just in case [20:40:37] jynus: if needed I can redo the transfer [20:41:24] (so far it's still running)