[04:27:23] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) Thanks John! All looking good: ` root@db1129:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Prima... [04:32:38] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [04:33:42] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) eqiad master has been upgraded to 10.4+Buster. @jcrespo you can probably proceed with db1139 as you wish. [04:48:42] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) m2-master failed over from dbproxy1013 to dbproxy1015. Once the maintenance is done we need to revert this. [04:49:10] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [05:07:39] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) [05:08:26] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 3 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) [05:18:29] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) @fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs. My idea would be... [05:23:22] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) @jcrespo am I good to go with codfw candidate master or you prefer to work out the backup sources there first? Let me know! [05:23:53] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) [06:40:13] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) [07:26:06] 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) The MW side of things to be done during the RO time for wikitech would be: - Change s6.dblists to add labswiki - Change s10.dblists t... [07:30:33] 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Ladsgroup) I'd be more than happy to help in mw side of things, I think we should mark it RO in mediawiki before the change. It should be rather... [07:34:54] 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Yeah, the idea is to: - Set wikitech as RO - Do all the changes - Set wikitech back to RW and make sure everything works We want to... [07:37:53] Amir1: wouldn't composerDiff job show a strange change if it was [07:39:12] Amir1: a quick search only shows https://github.com/wikimedia/operations-mediawiki-config/blob/c3c79b4da80fb9a638cd57e5fb40514ed844787c/docroot/noc/db.php#L123 [07:39:17] Which is a comment [07:40:55] RhinosF1: good, we should fix the comment but obviously not urgent [07:41:17] Amir1: conf tool makes various references though [07:41:56] https://gerrit.wikimedia.org/g/operations/puppet/+/464e3021ed79fa18e131055a389d74e3de685cf6/conftool-data/dbconfig-section/sections.yaml / https://gerrit.wikimedia.org/g/operations/software/conftool/+/fe06f4da22b1669bcd0d4108edcc09f4b8784c89/conftool/tests/fixtures/dbconfig/integration/dbconfig-section/sections.yaml / [07:41:56] https://gerrit.wikimedia.org/g/operations/software/conftool/+/fe06f4da22b1669bcd0d4108edcc09f4b8784c89/conftool/tests/integration/test_dbconfig.py [07:41:58] that's SRE's domain [07:42:29] That's the only other mention relevant of s10 in all of operations/* as far as I can see [07:43:15] But MediaWiki side we should be fine [07:48:20] Feel free to send ammends to my initial patch by the way! I just wanted a place where we could discuss things :) [07:48:54] Patch looks fine [07:49:11] \o/ [07:49:52] I can probably try and create ones for conftool later as it looks easy [07:52:39] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) >>! In T287230#7245452, @Marostegui wrote: > @jcrespo am I good to go with codfw candidate master or you prefer to work out the backup sources there first? > Let me know! Yes, that can be don... [07:55:55] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [08:02:21] Added you to the puppet patch, I'll do the software when I get a break [08:26:26] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10elukey) [08:28:15] RhinosF1: oh sweet, thanks [08:28:25] Np [08:28:35] so if I understand correctly, we can do it in September? we have a month then [08:37:34] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10LSobanski) Adding @nskaggs and @Bstorm for visibility. [08:44:26] marostegui, sorry for ansering re:s6 on codfw so late, should I reimage dbprov2002 today or wait? [09:05:59] jynus: s6? [09:06:02] you mean s2? [09:06:22] this one T287230 [09:06:22] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [09:06:24] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) [09:06:26] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [09:06:27] right, s2 [09:06:44] I can do s2 codfw candidate master today if you want yeah [09:06:51] Waiting on whatever works best for you [09:07:18] I adapt to you :-) my intention is doing dbprov a day before you do the full section [09:08:03] so s2 primary master won't be done this week, it is scheduled for 10th [09:08:07] oh, I see [09:08:29] then you can do it at any time before then 10th, no dependency on me :-) [09:08:53] I will be doing dbprov/source the 9th [09:09:08] So I can do the candidate master today? [09:09:15] if you want, sure! [09:09:19] marostegui: https://gerrit.wikimedia.org/r/c/operations/software/conftool/+/708632 too. I think Chris is on baby leave so no idea who is best to review as he's every commit since volan.s in 2019 [09:09:30] ok, will do it now then [09:09:31] thanks [09:09:52] I will be prepping my switch, but not merge it yet [09:09:52] RhinosF1: I would suggest v0lans or _joe_ [09:10:25] marostegui: vo.lans is off according to their nick so I'll try joe [09:11:21] <_joe_> RhinosF1: that is an integration test, can be removed whenever we want [09:11:32] <_joe_> no reason to wait for mediawiki to be done :) [09:11:44] oh you're there [09:11:58] if it's safe. good point that it's a test. [09:12:02] RhinosF1, he is everywhere, watching! [09:12:06] * RhinosF1 has 0 idea about conftool [09:12:07] <_joe_> I mean I don't see a point in removing it either tbh [09:12:49] RhinosF1: this is not urgent, it won't happen before September anyways, so we can add him and he can see it whenever he'd back [09:13:14] It seems strange to reference something that won't exist but ye, no rush on this [09:13:33] <_joe_> RhinosF1: does "dcA" and "dcB" exist as datacenters? [09:13:36] <_joe_> :) [09:13:45] <_joe_> those are test data [09:13:45] true [09:14:00] i suppose tests are nonsense data anyway [09:14:01] <_joe_> it's ok if they don't look like actual data 1:1 [09:14:23] <_joe_> not "nonsense", but I try not to use stuff that would come out of codesearch to confuse people [09:14:27] <_joe_> (including this) [09:14:43] i uploaded a patch for the prod conftool sections. I found it using codesearch for s10. [09:15:40] <_joe_> yep my point, if we'd have used s100, s101, etc in the integration tests, you wouldn't have found it [09:15:50] <_joe_> avoiding confusion [09:16:05] agreed [09:16:40] alexa just randomly woke up [09:51:51] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) >>! In T287574#7245440, @Marostegui wrote: > @fgiunchedi we'd need to coordinate this in a way as this would arrive to all h... [10:32:38] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2104.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202107291032_mar... [10:43:27] sorry to bother you, but if I could have your ok to merge this patch, specially this fix of cluster of mysqls: https://gerrit.wikimedia.org/r/c/operations/puppet/+/708473/10/hieradata/regex.yaml [10:43:37] checking [10:44:06] it is not super important, but it is otherwise making compelx to merge changes from other patches, as it touches very common hiera files [10:44:36] Te context is some mysql hosts appearing on the misc cluster, according to prometheus [10:45:15] es202X for example in: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=codfw&var-cluster=misc&var-instance=All&var-datasource=thanos [10:46:07] and the controversial part is considering "source backup" hosts as "mysql"s, but "dbprov" hosts as "backup" hosts [10:46:18] It looks good, but can you run PCC just in case? [10:46:22] sure [10:46:50] worse case scenario it should not affect mysql, only the grafana dashboards, but I will [10:47:49] I guess also cumin aliases [10:58:26] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2104.codfw.wmnet'] ` and were **ALL** successful. [10:59:37] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db2104 (candidate) reimaged - checking its tables before pooling it back [11:07:41] FYI, I got "Notice: /Stage[main]/Mariadb::Config/File[/etc/mysql/my.cnf]/ensure: removed (corrective)" on es2020 [11:07:48] that was not my patch [11:08:51] uh? [11:09:19] I got a change after puppet run on es2020, the above [11:09:28] let me check another host [11:09:28] a strange one, but that is on puppet normally [11:09:47] the normal file is /etc/my.cnf [11:09:53] I know [11:10:03] and there is a directive to remove that other, but not sure why it run now [11:10:11] I got that message on es2021 too [11:10:44] for the record, I don't think that is my patch, but something weird may had happened to add that file recently [11:12:04] Doesn't happen on db1169 [11:13:02] lrwxrwxrwx 1 root root 24 Jul 29 11:05 my.cnf -> /etc/alternatives/my.cnf [11:13:08] that's a very recent file [11:13:17] some package update or something? [11:13:30] that is es2022 [11:14:22] mmm, but no one logged at that time [11:14:34] not even for an ssh upgrade [11:23:46] yeah, I don't recall anything touching it [11:23:53] I have been serching in gerrit [11:23:55] and nothing there either [11:24:13] maybe it was installed by puppet as a spare with a common mysql lib or something [11:24:23] so it only appears on newer hosts [11:24:30] what puppet did was correct [11:24:31] [11:47:34] !log installing Mariadb 10.3.29 updates from Buster point release (as packaged in Debian, not the WMF DB packages) [11:24:36] that's the only thing I can see [11:24:41] from today [11:24:54] ah, maybe the package date is missleading [11:25:09] could be that it autocreates through common libs a my.cnf [11:25:16] and the puppet clean it is up [11:25:23] yeah and then puppet takes care of it [11:25:24] yeah [11:25:30] most likely that [11:25:39] and puppet doung the safe thing [11:25:46] could be yes [11:26:00] this has been for long: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mariadb/manifests/config.pp$155 [11:26:11] I can check whether there's a way to prevent this for future mariadb updates from Debian? [11:26:19] something maybe to remember that is a thing [11:26:30] I think our pupept prevents further issues [11:26:53] not sure there is actionables, more like, I was deploying and wanted to rise anything unexpected [11:27:32] I think it is fine yeah [11:27:38] ok [11:27:39] It was a bit weird to see that, but now it all makes sense [11:27:44] everything looks ok now [11:27:55] es* cluster classification being fixed [11:28:01] icinga is happy, etc. [11:28:02] We can simply blame moritzm and issue solved! [11:28:08] let's vote, I agree [11:28:14] :-D [11:28:14] +1 [11:29:18] after some time, all real mysql hosts should appear on the mysql cluster on grafana [11:29:27] instead some random ones on misc [11:30:19] and there is now a new backup cluster being populated [11:31:02] which will make me easier to provision and calculate disk needs [12:08:43] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:13:36] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:14:46] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:20:40] marostegui: today I'm going to get pierogi. I thought I should mention it to piss you off [12:21:04] Amir1: I am not jelaous, I had quite a few kgs in my freezer - we had visit a few days ago :) [12:21:36] I actually still have a lots of them! [12:21:38] Handmade too! [12:21:51] ugh, my plans foiled [12:22:32] next time [12:27:11] 10DBA, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) [12:38:44] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [12:53:56] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [12:54:47] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [13:28:59] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [14:22:30] 10DBA: Move db1124 and db1125 back to test-cluster section - https://phabricator.wikimedia.org/T286329 (10Marostegui) a:03Marostegui [14:38:38] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 4 host(s) and their services with reason: Eqiad row A maintenance ` cp[1075-... [14:45:23] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row A maintenance ` dns1001.... [14:47:32] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Vgutierrez) [14:48:52] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row A maintenance ` lvs1013.... [14:50:03] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Vgutierrez) [15:05:04] 10DBA: Rename dbstore1004 to db1183 and place it on m5 - https://phabricator.wikimedia.org/T284622 (10Marostegui) 05Stalled→03Open This can proceed - network maintenance was done. [15:05:10] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) [15:05:12] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [15:06:57] 10DBA: Rename dbstore1004 to db1183 and place it on m5 - https://phabricator.wikimedia.org/T284622 (10Marostegui) That is the second part, which is moving db1183 from m5 to s7 [15:07:20] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7245427, @Marostegui wrote: > m2-master failed over from dbproxy1013 to dbproxy1015. Once the maintenance is done we need t... [16:25:34] 10DBA, 10serviceops, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Andrew) Restarting haproxies in wmcs is fairly harmless, just ping when it's time. [17:47:54] o/ [17:49:09] so, parsing team uses testreduce1001 for running round trip tests ... part of that involves a local database that a node service connects to to track test runs and test results. all this while, we were connecting to it with a username and password .. the last run was earlier this week. but today, i am getting access denied errors. [17:49:26] "ERROR 1698 (28000): Access denied for user 'testreduce'@'localhost'" [17:50:26] majavah pointed me to https://sal.toolforge.org/log/aCCp8XoB1jz_IcWufNo8 in another channel ... i was wondering if that might have broken something on testreduce1001.eqiad.wmnet by disabling the authentication method with a plain text password .. [17:51:53] and i suppose my qn. is how do i fix this? i can file a phab task, but wanted to check if there is something obvious i am missing that i can fix / tweak. [18:35:52] the good old restart trick might have worked ... i restarted mysql service on the server .. testing [18:35:56] _joe_, ^ [18:36:35] <_joe_> subbu: hah so it was tied to the upgrade probably [18:36:52] <_joe_> I was looking at the 'mysql' database and things seemed to all be there [18:37:15] yup, fixed it for the mysql client .. and for the node.js service,i had to upgrade the mysql library since the old library broke with the upgrade. [18:37:51] yes, i did some wild-goose chase myself before deciding to restart mysql and see if that did the trick ... :) [18:38:19] anyway, thanks for the pointers! pro tip: restart service before trying anything else. :) [19:52:08] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db2104 check was ok, started replication again. [19:52:18] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui)