[01:44:48] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:18:46] PROBLEM - MariaDB sustained replica lag on s2 on db2175 is CRITICAL: 11.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [03:20:00] PROBLEM - MariaDB sustained replica lag on s2 on db1156 is CRITICAL: 13.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1156&var-port=9104 [03:20:02] PROBLEM - MariaDB sustained replica lag on s2 on db1182 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104 [03:21:46] RECOVERY - MariaDB sustained replica lag on s2 on db2175 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2175&var-port=9104 [03:22:00] RECOVERY - MariaDB sustained replica lag on s2 on db1156 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1156&var-port=9104 [03:22:02] RECOVERY - MariaDB sustained replica lag on s2 on db1182 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1182&var-port=9104 [05:45:03] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:45:04] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:34:59] I will be finishing my week soon, but looking intensely at the data growth on media backups [10:59:33] 851 Terabytes backedup so far [11:01:19] 160 milion files [11:30:58] <_joe_> Emperor, Amir1 when you have a finalized list of currently-canonical thumbnail sizes, we can think of starting to etch it in stone (and then peel away the stuff we don't want) [11:31:40] <_joe_> and we can even evaluate pretty reasonably the impact that rate-limiting other sizes would have, using requestctl's logging functions. [11:32:51] <_joe_> so, can the two of you carve that list somewhere, probably a hiera key in operations/puppet is ok, so that we can start the rest of the process? [11:38:34] I'm OoO until a week Monday; am still working on analysing the frequency at which sizes are actually being requested (which ought to be relevant to that process) [11:54:05] <_joe_> oh, enjoy :) [12:52:22] FWIW, you can use a query like this https://phabricator.wikimedia.org/T402792#11175720 [12:54:37] post mortem of matrix.org db outage that caused a 24-hour-long downtime https://matrix.org/blog/2025/10/post-mortem/ [13:38:50] query> I had one mostly-done (see phab paste from yesterday) [13:40:06] (possibly I refined it further, anyhow, will be back to this when not Ooo) [13:45:03] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:47:19] regarding cassandra access how does the architecture look like? should we be able to just. access it from the ml cluster in codfw and eqiad once we allow egress traffic on our end? what type of credentials are accepted? [15:58:47] nvm there was some followup in the ticket, will continue there [17:45:03] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:45:04] FIRING: [4x] PuppetFailure: Puppet has failed on ms-be2090:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure