[06:37:00] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:40:10] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:40:43] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:46:47] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [06:51:19] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Addshore) [07:01:15] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) It's a bit hard to implement this as systemd timers are not concurrent and the crons here are designed to be three at the same... [07:04:09] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Legoktm) >>! In T288175#7262223, @Ladsgroup wrote: > It's a bit hard to implement this as systemd timers are not concurrent and the crons... [07:06:46] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) This is going to get replaced with jobs soon (maybe in a couple of months) so I wouldn't put too much work in it. Having three... [07:11:33] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [07:17:51] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:03:14] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:05:19] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) Failover was done. Read only time times: Start: 08:00:29 AM UTC Stop: 08:00:47 AM UTC Tota... [08:06:15] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:13:00] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [08:13:12] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Recommendation-API, and 3 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) 05Open→03Resolved [09:10:29] I got no answer here so I will ask observability- it is quite possible this is the wrong channel/team [09:13:05] jynus: I had miseed the question [09:13:18] oh, sorry about that [09:13:28] AFAIK we do have monitoring for mdadm and if it doesn't catch this case it whould be fixed [09:13:29] again, this may not be the forum [09:13:37] question was at: https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-sre-foundations/20210803.txt [09:13:57] as for checking if it's happening I guess it's a cumin-command away [09:14:01] my question is more of- is this something I should file a ticket about (in your opinion)? [09:14:26] do you think is a real concern? [09:14:51] do you have a host where this is still happening? [09:15:18] I corrected all the backup ones, but I can check if there are more [09:15:25] cumin 'F:raid ~ "md"' [09:15:30] so let me do that [09:15:34] should give you all the mdadm raid hosts [09:15:39] and file a ticket if I see it reocurring? [09:15:50] does that sound like a good plan? [09:15:50] sure, let's check it [09:15:54] thanks [09:15:58] +1 for the plan [09:15:58] thanks to you! [09:16:17] sorry for missing the question earlier this week [09:16:33] sorry, as sometimes it is unclear to me if you are the right team to consult [09:16:49] as it is right in the border between foundational stuff and monitoring :-) [09:17:27] no prob, we should answer anyway, even if just saying it's not us ;) [09:21:11] I gave it a run with batch of 5 [09:25:55] volans, output doesn't look great https://phabricator.wikimedia.org/P16960 [09:30:12] I'm jumping in a meeting [09:30:16] can check it later on [09:30:22] I'm wondering if those partitions are used at all [09:30:34] it should switch automatically to RW on the first write IIRC [09:30:38] yeah, definitely not something urgent, but I would like someone's second look at it [09:30:40] or maybe those had a disk replaced? [09:31:51] I think I will create a task with all the context- the worst case scenario, it is a no issue and we close it as invalid [09:32:40] maybe someone familiar with md internals can say "that's normal behaviour" [10:00:04] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [10:01:34] 10netops, 10Infrastructure-Foundations: Lumen eqiad-codfw link down - https://phabricator.wikimedia.org/T288218 (10ayounsi) p:05Triage→03High [10:01:44] 10netops, 10Infrastructure-Foundations: Lumen eqiad-codfw link down - https://phabricator.wikimedia.org/T288218 (10ayounsi) [10:02:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10cmooney) [10:02:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10cmooney) [10:15:28] FYI, I got some feedback and I think it was what you said ""it should switch automatically to RW on the first write IIRC T288212 [10:15:28] T288212: A few hosts on production with software raid (md) have partitions in resync=PENDING status - https://phabricator.wikimedia.org/T288212 [10:29:58] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [10:30:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T288036 (10cmooney) 05Open→03Resolved [11:06:51] jynus: glad it was nothing :) [11:06:55] thanks for looking into that [11:23:46] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:24:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T288037 (10cmooney) 05Open→03Resolved [11:25:58] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [13:49:40] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Your replacement part associated with RMA R200361905 Item # 100 has been successfully shipped. Details of which are provided below. Replacement Serial Number: R... [15:05:27] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) To replicate what I had found in the task description ([[ https://wiki.postgresql.org/wiki/Disk_Usage | source ]]) here is the same data... [15:25:31] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) I've moved netbox details (console and ethernet connection, IP addressing) from old device to the replacement device now, reflecting t... [15:30:01] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10cmooney) [15:34:07] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) I've also found this [[ https://tickets.puppetlabs.com/browse/PDB-4830 | PuppetDB issue ]] that might be related, even if not per-se dir... [15:41:52] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) I think that we should definitely filter the more spamming facts outlined above, that should reduce the size of the table and complexity... [16:34:44] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10brennen) cc: @Muehlenhoff as I think John's AFK for a bit. [16:34:48] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) I did some additional investigation on the edges table and so far I found this: - did query a random host for this API: `curl -vvo test... [16:36:51] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: Open gitlab.wikimedia.org to all users with Wikimedia developer accounts - https://phabricator.wikimedia.org/T288162 (10brennen)