[02:02:09] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:02:09] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:33:43] <Emperor>	 that looks like rclone twice lost a race with an upload of an updated version.
[08:35:35] <Emperor>	 and in both cases, at least one of the stored images isn't correct :(
[08:39:26] <Emperor>	 I'll dig through that in a bit and open a ticket
[10:02:09] <jinxer-wm>	 FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:24:14] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be2069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:17:54] <BlankEclair>	 hello, does wikimedia have autocommit set to 0 or something similar to prevent a major cockup when somebody mistypes a sql statement when directly connecting to the database? asking because i couldn't grep for (relevent results of) "autocommit" in operations/puppet
[13:41:58] <marostegui>	 BlankEclair: We don't allow connections directly to the database. In any case, we cannot afford autocommit=0 due to performance
[13:42:37] <marostegui>	 Amir1: Are you doing some maintenance on s1 or can I go ahead?
[13:42:48] <BlankEclair>	 that's a much simpler answer, thanks ^^;
[13:42:58] <marostegui>	 BlankEclair: You are welcome 
[13:43:08] <BlankEclair>	 i'm curious about the performance point
[13:44:53] <marostegui>	 BlankEclair: Mostly IO, but also, with setting it to 0 you have very high risks o creating long holds and blocking lots of other connections (and increase contention and hence deadlocks). In some other environments I've experience huge ibdata1 grows, to hundreds of GBs too
[13:45:35] <marostegui>	 Setting autocommit=1 is way better for perfoamnce (also keep in mind we use RAID controllers with BBU) so we have that there with WB policy
[13:46:22] <BlankEclair>	 alright, thanks again ^_^
[13:47:18] <marostegui>	 No problem!
[15:05:23] <RhinosF1>	 marostegui: do you have any other suggestions for preventing people doing major cockups while connected to the database?
[15:06:09] <RhinosF1>	 And yes this cause 2 people for us (full ops on Miraheze) have managed to cock up big time
[15:06:23] <RhinosF1>	 One with a bad delete and a bad update
[15:07:19] <marostegui>	 RhinosF1: I'd suggest not letting anything to be run directly there but via MW or whatever other software is in place and (hopefully) with good measures to prevent issues (eg: do not allow deletes without a where, limit etc)
[15:08:15] <marostegui>	 RhinosF1: Eventually having a good testing environment or CI that can catch common queries errors (and of course progressive rollouts, backups, etc)
[15:08:47] <RhinosF1>	 marostegui: these were humans writing bad queries that have full sudo and didn't check what they had typed
[15:08:48] <marostegui>	 RhinosF1: It is impossible to stop human errors, but if you can limit them to the most, that's something. But the root user will always exist of course
[15:09:03] <marostegui>	 RhinosF1: Why did they have sudo on the first place? :)
[15:09:30] <RhinosF1>	 marostegui: because they are miraheze's equivalent of wmf ops is the short answer
[15:09:57] <marostegui>	 RhinosF1: Still no one should update the database just like that without going through some proper software in the middle
[15:10:08] <RhinosF1>	 Ok fair
[15:10:12] <RhinosF1>	 What do wmf use
[15:10:14] <marostegui>	 RhinosF1: I'd never ever touch any row in production
[15:10:17] <marostegui>	 RhinosF1: MW?
[15:10:26] <RhinosF1>	 marostegui: always use sql.php ?
[15:10:37] <RhinosF1>	 Does that auto create transactions so you can rollback?
[15:11:18] <marostegui>	 RhinosF1: First I simply don't do it, and second, it needs to go though whatever MW uses, because me updating a row in production probablky doesn't trigger another million things that a proper script going through MW does (eg: maybe cache needs to be invalidated?)
[15:12:38] <RhinosF1>	 marostegui: oh right so you suggest updating rows should always be done via a maint script or something then
[15:12:47] <marostegui>	 RhinosF1: Absolutely
[15:12:52] <RhinosF1>	 ah right
[15:13:07] <marostegui>	 RhinosF1: If we are talking about the admin part,that's different of course
[15:13:46] <RhinosF1>	 marostegui: one of these could have been done via a maint script
[15:14:04] <RhinosF1>	 One was someone who had wrong thing on clipboard when wanting a select
[15:14:09] <RhinosF1>	 And pasting it in
[15:14:21] <marostegui>	 I find touching production data via mysql cli a total no-go
[15:14:25] <RhinosF1>	 Which could be managed through having a read only tool for read only stuff
[16:26:01] <Emperor>	 Amir1: don't suppose you've got some CFT to look at T381891 and T381893 please? Two losses from rclone losing races on Monday rather like T380738 from the other week
[16:26:03] <stashbot>	 T381891: Interieur - 's-Gravenhage - 20085391 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381891
[16:26:04] <stashbot>	 T381893: Interieur - 's-Gravenhage - 20089866 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381893
[16:26:04] <stashbot>	 T380738: Schuur - Nieuwerbrug - 20164513 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T380738
[16:27:06] <Emperor>	 [presumably I could do the new uploads myself (with my volunteer account rather than staff?), but I feel like at least a second opinion first would be wise]
[16:30:29] <Emperor>	 and I now understand why this race-losing is resulting in truncated uploads, but this is all sadness and weeping 
[17:23:48] <jinxer-wm>	 FIRING: MysqlReplicationLagPtHeartbeat: MySQL instance db2158:9104 has too large replication lag (11m 20s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2158&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[17:33:40] <marostegui>	 ^ i fixed that already
[20:26:51] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db1198 is CRITICAL: 48.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104
[20:32:51] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db1198 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1198&var-port=9104