[02:02:09] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:02:09] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:43] that looks like rclone twice lost a race with an upload of an updated version. [08:35:35] and in both cases, at least one of the stored images isn't correct :( [08:39:26] I'll dig through that in a bit and open a ticket [10:02:09] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:14] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:54] hello, does wikimedia have autocommit set to 0 or something similar to prevent a major cockup when somebody mistypes a sql statement when directly connecting to the database? asking because i couldn't grep for (relevent results of) "autocommit" in operations/puppet [13:41:58] BlankEclair: We don't allow connections directly to the database. In any case, we cannot afford autocommit=0 due to performance [13:42:37] Amir1: Are you doing some maintenance on s1 or can I go ahead? [13:42:48] that's a much simpler answer, thanks ^^; [13:42:58] BlankEclair: You are welcome [13:43:08] i'm curious about the performance point [13:44:53] BlankEclair: Mostly IO, but also, with setting it to 0 you have very high risks o creating long holds and blocking lots of other connections (and increase contention and hence deadlocks). In some other environments I've experience huge ibdata1 grows, to hundreds of GBs too [13:45:35] Setting autocommit=1 is way better for perfoamnce (also keep in mind we use RAID controllers with BBU) so we have that there with WB policy [13:46:22] alright, thanks again ^_^ [13:47:18] No problem! [15:05:23] marostegui: do you have any other suggestions for preventing people doing major cockups while connected to the database? [15:06:09] And yes this cause 2 people for us (full ops on Miraheze) have managed to cock up big time [15:06:23] One with a bad delete and a bad update [15:07:19] RhinosF1: I'd suggest not letting anything to be run directly there but via MW or whatever other software is in place and (hopefully) with good measures to prevent issues (eg: do not allow deletes without a where, limit etc) [15:08:15] RhinosF1: Eventually having a good testing environment or CI that can catch common queries errors (and of course progressive rollouts, backups, etc) [15:08:47] marostegui: these were humans writing bad queries that have full sudo and didn't check what they had typed [15:08:48] RhinosF1: It is impossible to stop human errors, but if you can limit them to the most, that's something. But the root user will always exist of course [15:09:03] RhinosF1: Why did they have sudo on the first place? :) [15:09:30] marostegui: because they are miraheze's equivalent of wmf ops is the short answer [15:09:57] RhinosF1: Still no one should update the database just like that without going through some proper software in the middle [15:10:08] Ok fair [15:10:12] What do wmf use [15:10:14] RhinosF1: I'd never ever touch any row in production [15:10:17] RhinosF1: MW? [15:10:26] marostegui: always use sql.php ? [15:10:37] Does that auto create transactions so you can rollback? [15:11:18] RhinosF1: First I simply don't do it, and second, it needs to go though whatever MW uses, because me updating a row in production probablky doesn't trigger another million things that a proper script going through MW does (eg: maybe cache needs to be invalidated?) [15:12:38] marostegui: oh right so you suggest updating rows should always be done via a maint script or something then [15:12:47] RhinosF1: Absolutely [15:12:52] ah right [15:13:07] RhinosF1: If we are talking about the admin part,that's different of course [15:13:46] marostegui: one of these could have been done via a maint script [15:14:04] One was someone who had wrong thing on clipboard when wanting a select [15:14:09] And pasting it in [15:14:21] I find touching production data via mysql cli a total no-go [15:14:25] Which could be managed through having a read only tool for read only stuff [16:26:01] Amir1: don't suppose you've got some CFT to look at T381891 and T381893 please? Two losses from rclone losing races on Monday rather like T380738 from the other week [16:26:03] T381891: Interieur - 's-Gravenhage - 20085391 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381891 [16:26:04] T381893: Interieur - 's-Gravenhage - 20089866 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381893 [16:26:04] T380738: Schuur - Nieuwerbrug - 20164513 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T380738 [16:27:06] [presumably I could do the new uploads myself (with my volunteer account rather than staff?), but I feel like at least a second opinion first would be wise] [16:30:29] and I now understand why this race-losing is resulting in truncated uploads, but this is all sadness and weeping [17:23:48] FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2158:9104 has too large replication lag (11m 20s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2158&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [17:33:40] ^ i fixed that already [20:26:51] PROBLEM - MariaDB sustained replica lag on s3 on db1198 is CRITICAL: 48.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104 [20:32:51] RECOVERY - MariaDB sustained replica lag on s3 on db1198 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104