[09:50:22] https://phabricator.wikimedia.org/T357100 is a duplicate (edited the description) [09:56:26] yeah, usually happens after reimages [10:17:50] (PuppetDisabled) firing: Puppet disabled on ms-backup2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=backup&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:18:27] jynus: is that expected / does downtime need extending? [10:18:51] oh, I must have forgotten reenable it after maintenance, fixing [10:19:20] actually, wait, that shouln't be disable [10:19:51] yeah, the mistake was disabling it on the first place [10:20:04] I must have confuse it with a backup host [10:25:38] no, I know what it was, I was about to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/995188 but then the network maintenance blocked me [10:26:52] should be fixed now [10:28:09] ta :) [10:32:51] (PuppetDisabled) firing: (2) Puppet disabled on ms-backup1001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=backup&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:33:07] ^that's outdated [10:33:20] as I just ran it [10:52:50] (PuppetDisabled) resolved: Puppet disabled on ms-backup2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=backup&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [11:48:01] Noting that I'm running schema change on s5 and s3 [11:48:18] (https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance) [11:48:23] Changing PK only [11:53:02] "only" [11:58:52] and "only" pagelinks, one of the biggest tables in every wiki :P [13:21:58] marostegui: maybe I'm miscalculating things, but for s3, just changing pk is dropping 150 to 160GB. https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1150&var-datasource=thanos&var-cluster=mysql&viewPanel=28&from=1707439729207&to=1707484861318 [13:22:10] Same is showing up for db1140 as well [13:24:43] if it is a smaller PK and has a significant amount of secondary indexes total index size can be reduced a lot [13:25:16] although one should be careful, alerts technically do the equivalent to an optimize, so it should be checked in the long term, with more fragmentation [13:25:22] *alters [13:25:32] Yeah, I was going to say just that [13:25:37] That it might be "temporary" [13:28:25] I think Amir1 will know that, but sharing this here so everybody can learn: https://www.slideshare.net/jynus/query-optimization-with-mysql-80-and-mariadb-103-the-basics#129 [13:29:07] yeah optimize has some impact too [13:29:29] The drop of the old columns will be large as well [13:31:10] The other thing I have in mind is that s3 has a couple of really large botpedias that have a pretty large pagelinks table. arzwiki, warwiki, etc. [13:31:47] let me share a random though too, if that helps (feel free to ingnore) [13:33:21] s3 (or whatever is the default db) used to have lots of small objects, but they were rarelly accessed (new wikis, or event-focused wikis). At some point years ago I thought of creating an s0 section with very very low bandwith dbs, on a VM even [13:33:43] and when they get enough activity, move them to real hardware [13:33:56] that way object overhead was minimized [13:34:11] but of course it is not a big win, and wasn't a high priority [13:34:28] but wanted to throw it to you as an old idea [13:35:18] let me think about it [13:35:27] s3 itself doesn't have that many replicas [13:35:39] well, yeah, the size is not that big [13:35:57] but there is some overhead when backing them up and recovering them [13:36:10] because thousands of small objects [13:36:33] again, this is not even a suggestion [13:36:44] just something I thought about thinking, but never got to :-D [13:37:26] As far as I know s5 keeps being the least loaded db at the moment [13:38:15] yep, s5 is half the size of s3 [13:38:25] s5 has the smallest size in terms of total storage, like 400-500GB [13:38:42] but s3 probably has the lowest number of replicas, let me double chekc [13:38:59] so in a way it was done, but differently, splitting it into s5 [15:33:48] (PuppetFailure) firing: Puppet has failed on restbase1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:38:48] (PuppetFailure) firing: (3) Puppet has failed on restbase1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:39:30] I'm not sure we need an email and an IRC message for every puppet failure? [15:41:13] ugh... is it going to continue repeating? [15:41:35] * urandom investigates [15:43:48] (PuppetFailure) firing: (5) Puppet has failed on restbase1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:48:48] (PuppetFailure) firing: (7) Puppet has failed on restbase1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:58:48] (PuppetFailure) firing: (7) Puppet has failed on restbase1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:01:50] It should be good now. [16:03:32] or not... [16:03:53] or not, because it's not just restbase1034... [16:03:56] * urandom sighs [16:18:53] ok, now it should be good. [16:19:45] these were recently put up by dcops —server refreshes for restbase— and got caught in some puppet 5-7 limbo I guess [16:20:09] imaged as 7 and then given role insetup::data-persistence maybe? [16:54:18] (PuppetFailure) resolved: (2) Puppet has failed on restbase1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure