[00:39:14] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:14] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:15] federico3: https://phabricator.wikimedia.org/T419635 is causing issue on the views for imagelinks, can you update on which sections this has been ran and is it currently running? [05:14:44] federico3: https://phabricator.wikimedia.org/T422459 [07:16:59] checking the rclone thing, looks like swift was somewhat unhappy about 3pm yesteryda? Apr 6 15:07:54 ms-be1069 swift-rclone-sync[806853]: ERROR : wikipedia-en-local-deleted.nr/n/r/h: error reading destination directory: Get "/wikipedia-en-local-deleted.nr?delimiter=%2F&format=json&limit=1000&prefix=n%2Fr%2Fh%2F": unsupported protocol scheme "" [07:18:58] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:30] @marostegui: OK updating in a bit [08:05:48] federico3: mmm, you've marked them as done (eg: s2) but the masters aren't done, right? [08:06:19] no, the masters will be done later on with flips etc [08:06:32] I'd suggest you don't mark them as done then [08:06:41] It can be confusing [08:08:22] ok, updated. Maybe if we need to show the progress on the replicas only in future we could split the checkboxes in 2 groups [08:08:23] federico3: And I also think you can run it on most masters (maybe not s4 - I've not checked the size) as drop column is an online operation, so it should be safe to run on intermediate masters and later on primary [08:08:42] federico3: I simply add: "pending masters" and do not mark it as done [08:09:16] federico3: With your last edit, it shows nothing as done. See above ^ [08:12:36] I checked on mariadb's documentation and the index and column drop seemed to be expected to be really quick but Amir recommended downtiming for 12 hours https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/62 [08:13:06] federico3: Sure, but that's indepedent from being able to run it on masters [08:14:13] you mean it's not going to impact their performance (e.g. I/O load)? [08:14:25] federico3: it shouldn't [08:14:32] it may take long, but it should be doable [08:14:51] For most wikis, definitely s2 and s6 should be fine [08:18:02] yeah, don't run it on master of s4 unless you like editors screaming at you and s1 to be on safe side too but other sections should be fine [08:20:00] federico3: Going back to the previous topic of the progress, if a section is fully done and just pending masters, just say so I'd say, otherwise we don't know where the section stands (with the current version of the task) [08:21:24] ok for now I'll go through the sections and flag the as "done except masters" [08:21:38] that works yep [08:21:40] thanks [08:22:18] remember you can also just run --dc-masters on the CLI to get the masters done, and I suggest you do intermediate master before primary [08:42:33] federico3: The whole conversation about marking/not marking things as in progress comes from: https://phabricator.wikimedia.org/T422459 where I had to dig thru the schema change task to see what was done or not hence my request to keep things up to date on the schema change tasks [09:15:29] @marostegui I can run that tool after each section but these runs are pretty slow and there's an amount of time before the section is done. How and where is the tool run? E.g. directly on the clouddb/redacted hosts? [09:18:23] federico3: Yeah, you need to run it on each clouddb/redacted for the given section. Ideally as soon as you can before the change has been executed on the clouddb* hosts otherwise, that view remains broken [09:19:50] you mean immediately after the section is done or during the auto_schema run? [09:20:37] federico3: as soon as the change is finished on the clouddb*/redactted host [09:20:46] it doesn't matter if the rest of the section is done or not [09:29:10] I'm reviewing the timing and the scripts: with some large sections auto_schema takes a long time and some changes would happen out of working hours [09:29:55] federico3: yes, and that's fine, it doesn't have to be at the minute, it can wait till the following office day if that happens over night [09:30:41] I ran it everywhere so it should be clean now for the section you've already finished, but just keep that task in mind as your schema change progresses [09:32:07] perhaps can we run maintain-views --all-databases --replace --table imagelinks multiple times at intervals e.g. every hour or does it has an impact? [09:36:03] no, we can't run it like that [10:17:17] happy to help with running maintain-views on clouddbs [10:28:00] There have been reports like T422206 and T422200, so running maintainviews on the shards where dropping il_to is done would be nice:) [10:28:01] T422206: fiwiki_p.imagelinks view is broken on Toolforge replica (ERROR 1356), while enwiki_p works - https://phabricator.wikimedia.org/T422206 [10:28:01] T422200: 1356: Internal Server Error (ruwiki_p.imagelinks) - https://phabricator.wikimedia.org/T422200 [11:06:09] hi, i've just joined DPE SRE last week and going thru monitorings/alerts [11:08:12] i noticed aqs1015.eqiad.wmnet mdadm alerts, is there any actions for it? my teammates told me it is kind of shared responsibility machine and it is better to coordinate it here [11:20:33] atsukoito: I'd ping u.random about that, I think, but he won't likely be online for a few hours yet [12:14:54] thanks Emperor [12:57:01] atsukoito: it has a couple of failed SSDs, but is slated to be replaced by a routine refresh Real Soon Now™ — see: T412830 [12:57:02] T412830: Hardware refresh of aqs101[0-2,4-5] w/ aqs102[3-7] - https://phabricator.wikimedia.org/T412830 [12:57:12] (so no action needed)