[07:50:13] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9920330 (10JMeybohm) >>! In T362978#9919727, @Scott_French wrote: > That's everything that comes to mind from a quick scan of "published... [08:51:04] 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9920512 (10brouberol) I've reported the suggestion [[ https://github.com/cloudnative-pg/ch... [08:52:45] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366 (10elukey) 03NEW [08:53:06] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9920526 (10elukey) [08:58:14] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9920548 (10elukey) [08:58:15] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9920549 (10elukey) [09:50:30] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920844 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ca43ab0-579a-4f82-97aa-11720f300bd7) set by cgoubert@cumin1002 for 21 days, 0:00... [09:54:13] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920870 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=046a1781-9fad-454c-b26b-ad2c96d2d8b2) set by cgoubert@cumin1002 for 21 days, 0:00... [09:58:56] 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9920892 (10brouberol) I had a second look at the RBAC, and ||some|| non-namespaced resourc... [10:48:41] 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9921167 (10MoritzMuehlenhoff) [10:54:12] 06serviceops, 10iPoid-Service (iPoid 1.0), 10Trust and Safety Product Sprint (Sprint 13 (July 15 - July 26)): Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#9921182 (10kostajh) [11:33:53] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9921271 (10Jdforrester-WMF) [11:33:55] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Migrate mw-on-k8s base image from buster to bullseye - https://phabricator.wikimedia.org/T362981#9921272 (10Jdforrester-WMF) [14:11:02] folks I'm not sure who might be best to speak to about kafka-main? [14:11:36] kafka-main1010 is in rack E5 which we're hoping to upgrade today (T365986) [14:11:53] I can take that one [14:11:54] I got my wires crossed didn't realise service-ops looked after those [14:12:05] akosiaris: thanks [14:12:08] arguably, we are a bad team for looking after those [14:12:19] so, you wires got crossed for a good reason [14:12:32] anyway, what can I help with? [14:12:57] basically if we need to take any action before the switch maintenance to depool it? [14:13:18] or will the overall service be ok if that host is unavailable for 15 mins or so? [14:13:18] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9921873 (10elukey) [14:15:33] * akosiaris looking [14:15:45] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9921878 (10hashar) The images based on Stretch under `dev/` have been removed via T290532 For `releng/` namespaces, that is the image for Zuul/CI and we have phased o... [14:17:45] topranks: kafka-main1010 is insetup role still, you are good to go whatever you want [14:17:53] and do* [14:18:07] aksoiaris: ok well that makes it simple, I should probably check for that :) [14:18:17] akosiaris: even :) [14:18:44] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9921886 (10brennen) +1 for `dev/stretch*`. The only one I don't know anything about is `dev/stretch-scap-deps`. Doesn't turn up in codesearch or GitLab search, at an... [14:18:58] and there are no other kafka-mains in the racks we'll be working on over the coming weeks either [14:19:03] so all good there thanks! [14:40:34] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9922000 (10Scott_French) Thanks, @SGupta-WMF! Ahmon tends to be quite responsi... [14:49:35] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9922015 (10elukey) Thanks a lot for the feedback, all images in T367427#9921815 removed from the registry. [15:01:44] 06serviceops, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922064 (10Jdforrester-WMF) [15:01:50] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9922078 (10Jdforrester-WMF) [15:01:54] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Migrate mw-on-k8s base image from buster to bullseye - https://phabricator.wikimedia.org/T362981#9922077 (10Jdforrester-WMF) [15:02:15] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9922080 (10Jdforrester-WMF) [15:02:18] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Migrate mw-on-k8s base image from buster to bullseye - https://phabricator.wikimedia.org/T362981#9922062 (10Jdforrester-WMF) →14Duplicate dup:03T356293 [15:02:19] 06serviceops, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922079 (10Jdforrester-WMF) [15:03:03] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922086 (10Jdforrester-WMF) [15:05:07] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922082 (10Jdforrester-WMF) [15:18:06] Hi folks! Anyone here that can help me with the `php7-fatal-error.php` handler? I'm looking for an instance of statsd-exporter to point it at. [16:15:40] cwhite: mw-debug namespace has it listening on port udp/9125, 10.64.72.158 [16:15:46] is that sufficient for your tests? [16:17:38] Hi akosiaris! I'm looking to point all usages of this script in a production-stable manner. [16:18:05] This is towards migrating `mediawiki.fatal.errors` away from graphite. [16:18:21] (task for context: https://phabricator.wikimedia.org/T356814) [16:18:43] each namespace (mw-web, mw-api-ext, mw-api-int, mw-parsoid, etc) has it's own statsd-exporter [16:19:34] and last I checked we were saying that we aren't gonna do DNS for addressing these cause it would end up causing so many DNS requests it would kill the recursors [16:19:47] _joe_: correct me if I misremember [16:20:25] <_joe_> akosiaris: correct [16:20:28] Is it fair to say we ship a copy of `php7-fatal-error.php` separate from puppet these days? [16:20:37] <_joe_> no [16:20:40] <_joe_> :) [16:20:55] <_joe_> well, sort of [16:21:28] <_joe_> https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes/How_it_works#In_the_chart_2 [16:21:44] <_joe_> "/etc/wmerrors contains files defined via the mw.wmerrors value as filename:content yaml pairs. This value in production is fetched from /etc/helmfile-defaults/mediawiki/httpd.yaml, which is generated by puppet injecting the fatal-error.php file defined in puppet." [16:22:07] <_joe_> so changing it in puppet will eventually change it in mw on k8s [16:22:11] <_joe_> when it gets deployed [16:22:18] <_joe_> (and after puppet runs on deploy1002) [16:23:47] <_joe_> but at this point I think it's only used in k8s [16:23:53] <_joe_> so maybe we can move it over [16:25:05] ahh, ok. so my code changes are in the right place. I imagine this affects `error-params.php` as well? (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/mediawiki/error-params.php.erb) [16:26:00] If so, the question is what do I set $dogstatsd_(host|port) to so that k8s routes the metrics correctly? [16:28:32] <_joe_> cwhite: not sure, about error-params.php, I should actually check [16:28:55] <_joe_> cwhite: so... for k8s you can copy what we have in mediawiki-config [16:29:04] <_joe_> we read an env variable for the host [16:29:16] <_joe_> and we default to localhost where the env is not defined [16:29:21] <_joe_> which works on bare metal [16:31:24] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9922713 (10Scott_French) Ah, perfect! Thank you @JMeybohm - `search-grafana-dashboards.js` uncovered one more dashboard to migrate, and... [16:32:08] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922720 (10Jhancock.wm) swapped DIMM_B1 for DIMM_B2 to test. error has cleared. [16:33:12] _joe_: looking at `php7-fatal-error.php`, I'm not sure it uses anything from mediawiki-config? It looks fairly independent. [16:33:31] for reference: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/mediawiki/php/php7-fatal-error.php [16:33:44] <_joe_> cwhite: yeah but you can copy over the stuff we used to configure your stuff [16:33:50] <_joe_> in mediawiki cconfig [16:33:58] <_joe_> sorry, now in meeting :) [16:34:47] I got you now, thanks :) [16:59:54] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922861 (10BCornwall) 05Open→03Resolved Linux is happy, too. Thank you, @Jhancock.wm! [17:45:35] the mw-on-k8s apache access logs on logstash, is that a downsample, or is that everything? [18:03:40] hiiii [18:31:18] cdanis: I _think_ it's ~everything except for a few logs that don't make it because bugs :D [18:31:35] good to know thanks kamila_ [18:37:12] is there an easy way to get more request headers extracted ? [19:20:48] cdanis: I'm fairly sure it's a significant downsample: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/benthos/instances/mw_accesslog_sampler.yaml#23 [19:23:16] yeah you are right, thanks cwhite [19:23:27] I could have used kafkacat + jq probably [19:23:31] but in the meanwhile I used some other trickery [19:23:37] as far as request headers go, k8s has probably diverged a bit from cee_ecs_accesslog_170 apache LogFormat [19:23:45] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/httpd/files/defaults.conf#32 [19:40:14] 06serviceops: Alerting on under-scaled deployments - https://phabricator.wikimedia.org/T366932#9923572 (10Scott_French) Checking in on the last week of alerts, I'm seeing: 1. Occasional unavailability of a zotero pod in eqiad, as described in T366932#9894216 - i.e., one pod gets very busy (high CPU, etc.) and fa... [20:33:49] an appserver ran out of disk. mw1446 [20:33:55] looking at it it's a videoscaler [20:34:01] and there are almost 400G in /tmp [20:34:14] unsure if to delete anything or leave it alone [20:34:57] lots of shellbox* and localcopy* files [20:35:28] maybe that's why brion asked earlier on -wikitech