[07:50:13] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9920330 (10JMeybohm) >>! In T362978#9919727, @Scott_French wrote: > That's everything that comes to mind from a quick scan of "published...
[08:51:04] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9920512 (10brouberol) I've reported the suggestion [[ https://github.com/cloudnative-pg/ch...
[08:52:45] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366 (10elukey) 03NEW
[08:53:06] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9920526 (10elukey)
[08:58:14] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9920548 (10elukey)
[08:58:15] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9920549 (10elukey)
[09:50:30] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920844 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7ca43ab0-579a-4f82-97aa-11720f300bd7) set by cgoubert@cumin1002 for 21 days, 0:00...
[09:54:13] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Turn down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9920870 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=046a1781-9fad-454c-b26b-ad2c96d2d8b2) set by cgoubert@cumin1002 for 21 days, 0:00...
[09:58:56] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9920892 (10brouberol) I had a second look at the RBAC, and ||some|| non-namespaced resourc...
[10:48:41] <wikibugs>	 06serviceops, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9921167 (10MoritzMuehlenhoff)
[10:54:12] <wikibugs>	 06serviceops, 10iPoid-Service (iPoid 1.0), 10Trust and Safety Product Sprint (Sprint 13 (July 15 - July 26)): Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#9921182 (10kostajh)
[11:33:53] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9921271 (10Jdforrester-WMF)
[11:33:55] <wikibugs>	 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Migrate mw-on-k8s base image from buster to bullseye - https://phabricator.wikimedia.org/T362981#9921272 (10Jdforrester-WMF)
[14:11:02] <topranks>	 folks I'm not sure who might be best to speak to about kafka-main?
[14:11:36] <topranks>	 kafka-main1010 is in rack E5 which we're hoping to upgrade today (T365986)
[14:11:53] <akosiaris>	 I can take that one
[14:11:54] <topranks>	 I got my wires crossed didn't realise service-ops looked after those 
[14:12:05] <topranks>	 akosiaris: thanks 
[14:12:08] <akosiaris>	 arguably, we are a bad team for looking after those
[14:12:19] <akosiaris>	 so, you wires got crossed for a good reason
[14:12:32] <akosiaris>	 anyway, what can I help with?
[14:12:57] <topranks>	 basically if we need to take any action before the switch maintenance to depool it?
[14:13:18] <topranks>	 or will the overall service be ok if that host is unavailable for 15 mins or so?
[14:13:18] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9921873 (10elukey)
[14:15:33] * akosiaris looking
[14:15:45] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9921878 (10hashar) The images based on Stretch under `dev/` have been removed via T290532  For `releng/` namespaces, that is the image for Zuul/CI and we have phased o...
[14:17:45] <akosiaris>	 topranks: kafka-main1010 is insetup role still, you are good to go whatever you want 
[14:17:53] <akosiaris>	 and do*
[14:18:07] <topranks>	 aksoiaris: ok well that makes it simple, I should probably check for that :) 
[14:18:17] <topranks>	 akosiaris: even :)
[14:18:44] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9921886 (10brennen) +1 for `dev/stretch*`.  The only one I don't know anything about is `dev/stretch-scap-deps`.  Doesn't turn up in codesearch or GitLab search, at an...
[14:18:58] <topranks>	 and there are no other kafka-mains in the racks we'll be working on over the coming weeks either 
[14:19:03] <topranks>	 so all good there thanks!
[14:40:34] <wikibugs>	 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9922000 (10Scott_French) Thanks, @SGupta-WMF! Ahmon tends to be quite responsi...
[14:49:35] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch/Jessie - https://phabricator.wikimedia.org/T367427#9922015 (10elukey) Thanks a lot for the feedback, all images in T367427#9921815 removed from the registry.
[15:01:44] <wikibugs>	 06serviceops, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922064 (10Jdforrester-WMF)
[15:01:50] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9922078 (10Jdforrester-WMF)
[15:01:54] <wikibugs>	 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Migrate mw-on-k8s base image from buster to bullseye - https://phabricator.wikimedia.org/T362981#9922077 (10Jdforrester-WMF)
[15:02:15] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9922080 (10Jdforrester-WMF)
[15:02:18] <wikibugs>	 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Migrate mw-on-k8s base image from buster to bullseye - https://phabricator.wikimedia.org/T362981#9922062 (10Jdforrester-WMF) →14Duplicate dup:03T356293
[15:02:19] <wikibugs>	 06serviceops, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922079 (10Jdforrester-WMF)
[15:03:03] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922086 (10Jdforrester-WMF)
[15:05:07] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9922082 (10Jdforrester-WMF)
[15:18:06] <cwhite>	 Hi folks!  Anyone here that can help me with the `php7-fatal-error.php` handler?  I'm looking for an instance of statsd-exporter to point it at.
[16:15:40] <akosiaris>	 cwhite: mw-debug namespace has it listening on port udp/9125, 10.64.72.158
[16:15:46] <akosiaris>	 is that sufficient for your tests?
[16:17:38] <cwhite>	 Hi akosiaris!  I'm looking to point all usages of this script in a production-stable manner.
[16:18:05] <cwhite>	 This is towards migrating `mediawiki.fatal.errors` away from graphite.
[16:18:21] <cwhite>	 (task for context: https://phabricator.wikimedia.org/T356814)
[16:18:43] <akosiaris>	 each namespace (mw-web, mw-api-ext, mw-api-int, mw-parsoid, etc) has it's own statsd-exporter
[16:19:34] <akosiaris>	 and last I checked we were saying that we aren't gonna do DNS for addressing these cause it would end up causing so many DNS requests it would kill the recursors
[16:19:47] <akosiaris>	 _joe_: correct me if I misremember 
[16:20:25] <_joe_>	 akosiaris: correct
[16:20:28] <cwhite>	 Is it fair to say we ship a copy of `php7-fatal-error.php` separate from puppet these days?
[16:20:37] <_joe_>	 no
[16:20:40] <_joe_>	 :)
[16:20:55] <_joe_>	 well, sort of
[16:21:28] <_joe_>	 https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes/How_it_works#In_the_chart_2
[16:21:44] <_joe_>	 "/etc/wmerrors contains files defined via the mw.wmerrors value as filename:content yaml pairs. This value in production is fetched from /etc/helmfile-defaults/mediawiki/httpd.yaml, which is generated by puppet injecting the fatal-error.php file defined in puppet."
[16:22:07] <_joe_>	 so changing it in puppet will eventually change it in mw on k8s
[16:22:11] <_joe_>	 when it gets deployed
[16:22:18] <_joe_>	 (and after puppet runs on deploy1002)
[16:23:47] <_joe_>	 but at this point I think it's only used in k8s
[16:23:53] <_joe_>	 so maybe we can move it over 
[16:25:05] <cwhite>	 ahh, ok.  so my code changes are in the right place.  I imagine this affects `error-params.php` as well?  (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/mediawiki/error-params.php.erb)
[16:26:00] <cwhite>	 If so, the question is what do I set $dogstatsd_(host|port) to so that k8s routes the metrics correctly?
[16:28:32] <_joe_>	 cwhite: not sure, about error-params.php, I should actually check
[16:28:55] <_joe_>	 cwhite: so... for k8s you can copy what we have in mediawiki-config
[16:29:04] <_joe_>	 we read an env variable for the host
[16:29:16] <_joe_>	 and we default to localhost where the env is not defined
[16:29:21] <_joe_>	 which works on bare metal
[16:31:24] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9922713 (10Scott_French) Ah, perfect! Thank you @JMeybohm - `search-grafana-dashboards.js` uncovered one more dashboard to migrate, and...
[16:32:08] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922720 (10Jhancock.wm) swapped DIMM_B1 for DIMM_B2 to test. error has cleared.
[16:33:12] <cwhite>	 _joe_: looking at `php7-fatal-error.php`, I'm not sure it uses anything from mediawiki-config?  It looks fairly independent.
[16:33:31] <cwhite>	 for reference: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/mediawiki/php/php7-fatal-error.php
[16:33:44] <_joe_>	 cwhite: yeah but you can copy over the stuff we used to configure your stuff
[16:33:50] <_joe_>	 in mediawiki cconfig
[16:33:58] <_joe_>	 sorry, now in meeting :)
[16:34:47] <cwhite>	 I got you now, thanks :)
[16:59:54] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9922861 (10BCornwall) 05Open→03Resolved Linux is happy, too. Thank you, @Jhancock.wm!
[17:45:35] <cdanis>	 the mw-on-k8s apache access logs on logstash, is that a downsample, or is that everything?
[18:03:40] <cdanis>	 hiiii
[18:31:18] <kamila_>	 cdanis: I _think_ it's ~everything except for a few logs that don't make it because bugs :D 
[18:31:35] <cdanis>	 good to know thanks kamila_ 
[18:37:12] <cdanis>	 is there an easy way to get more request headers extracted ?
[19:20:48] <cwhite>	 cdanis: I'm fairly sure it's a significant downsample: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/benthos/instances/mw_accesslog_sampler.yaml#23
[19:23:16] <cdanis>	 yeah you are right, thanks cwhite 
[19:23:27] <cdanis>	 I could have used kafkacat + jq probably
[19:23:31] <cdanis>	 but in the meanwhile I used some other trickery
[19:23:37] <cwhite>	 as far as request headers go, k8s has probably diverged a bit from cee_ecs_accesslog_170 apache LogFormat
[19:23:45] <cwhite>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/httpd/files/defaults.conf#32
[19:40:14] <wikibugs>	 06serviceops: Alerting on under-scaled deployments - https://phabricator.wikimedia.org/T366932#9923572 (10Scott_French) Checking in on the last week of alerts, I'm seeing: 1. Occasional unavailability of a zotero pod in eqiad, as described in T366932#9894216 - i.e., one pod gets very busy (high CPU, etc.) and fa...
[20:33:49] <mutante>	 an appserver ran out of disk. mw1446
[20:33:55] <mutante>	 looking at it it's a videoscaler
[20:34:01] <mutante>	 and there are almost 400G in /tmp
[20:34:14] <mutante>	 unsure if to delete anything or leave it alone
[20:34:57] <mutante>	 lots of shellbox* and localcopy* files
[20:35:28] <mutante>	 maybe that's why brion asked earlier on -wikitech