[05:37:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:47] I am going to switch s8 codfw master [07:35:03] s6, sorry [09:31:30] marostegui: let me know if/when you're done with the old primary [09:31:36] 🥺 [09:31:42] Amir1: it is the dc master [09:32:16] yeah, codfw? I want to run pagelinks schema change on the old master [09:32:25] by primary I meant master :D [09:32:47] Amir1: I was repooling it, let me depool and then youi take care of repooling? [09:32:56] sure thing! [09:33:12] Amir1: db2129 depooled [09:33:22] thank you [09:37:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:25] FIRING: [2x] SystemdUnitFailed: mariadb.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:40] ^ me [10:18:06] shoudl recover soon [10:18:07] Loaded: loaded (/lib/systemd/system/mariadb.service; disabled; preset: enabled) [10:18:07] Active: active (running) since Mon 2024-05-27 10:13:36 UTC; 4min 22s ago [11:51:21] arnaudb: who will handle the es hosts on those two switches tasks you created? [12:14:49] FIRING: PuppetFailure: Puppet has failed on db1125:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:16:00] marostegui: idk how it's handled, are you saying that I should create a task per maintenance instead of assigning dp servers to their service owners? [12:16:13] arnaudb: what do you mean? [12:16:23] arnaudb: es are handled by us [12:16:59] I mean, they are not handled by Emperor and I created dedicated tasks for the few ms hosts there were [12:17:09] I was not sure how to proceed for the next ones so I asked [12:17:19] arnaudb: But what about the rest of the hosts of those racks? [12:17:40] I was going to create a bunch of phab tasks [12:18:01] ah your idea is to separate hosts within data-persistence owners? [12:18:06] it was indeed [12:18:11] ah ok, got it [12:18:22] so, my question was: is it overkill ? :) [12:18:56] (the main goal was to avoid spam for service owners ^^) [12:19:00] arnaudb: I don't really have any strong opinions really, whatever works best for you, eric and matthew - just wanted to make sure no hosts are forgotten [12:19:10] ack! [12:24:49] RESOLVED: PuppetFailure: Puppet has failed on db1125:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:17:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:13] flamegraph of time spent for db queries in mediawiki broken down by the query type: https://people.wikimedia.org/~ladsgroup/rdbms.svg [14:50:14] e.g. 2.92% of our db queries are just fetching replag, 10.65% is wikidata term store look ups, 1.14% is for checking if users are blocked or not, etc. etc. [14:50:14] Let me get for write queries [15:00:05] This is juicy https://people.wikimedia.org/~ladsgroup/rdbms-write5.svg <- for write queries, it seems there is a lock contention [15:00:10] *lot of lock [18:17:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed