[06:45:06] <hashar>	 hello, I have discovered we can populate scap dsh groups by querying the PuppetDB for the deployment targets. That got first introduced by j.o.e   and I'd like to adopt the same for a few more deploy.   So instead of having a list of nodes it is replaced by `Scap::Target[integration/docroot]`   which saves us from having to change the list whenever machines are added/removed etc.
[06:45:12] <hashar>	 I have three small changes starting at https://gerrit.wikimedia.org/r/c/operations/puppet/+/893483
[08:32:53] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[08:50:00] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Vgutierrez)
[09:13:40] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) a:03JMeybohm
[09:28:14] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey)
[09:56:33] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[09:56:42] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05Open→03In progress
[10:00:45] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10elukey)
[10:00:54] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:02:36] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:25:08] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:29:51] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:33:01] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[10:33:19] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 started.
[10:47:58] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 comple...
[11:38:09] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[11:38:19] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[11:42:01] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[11:42:19] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05Triage→03High
[12:08:44] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[12:15:30] <wikibugs>	 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10LSobanski)
[12:16:40] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[12:21:37] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Marostegui) I just wanted to mention that despite of the sudden spike on DB reads, our databases kept up just fine in general. We did have timeouts on some enwiki (s1) replicas...
[12:28:26] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap, 10Upstream: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10TheresNoTime) If I'm reading the output of a scap failure I just had (P45862) correctly, this warning is now triggering an "error", and causing a stage to rollback?...
[12:33:59] <hnowlan>	 a few changes for thumbor in k8s for review if you have time - most notable are the timeout queue increase which is an experiment rather than a permanent solution, and the increase in replicas https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/898728 
[13:29:29] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap, 10Upstream: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10JMeybohm) >>! In T329899#8691439, @TheresNoTime wrote: > If I'm reading the output of a scap failure I just had (P45862) correctly, this warning is now triggering an...
[14:09:27] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[14:09:32] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert)
[14:13:05] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:13:16] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:13:24] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert)
[14:13:38] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:14:02] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert)
[14:14:28] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) p:05Triage→03Medium
[14:14:50] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert)
[14:15:42] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05High→03Medium
[14:16:06] <wikibugs>	 10serviceops, 10SRE: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris)
[14:16:32] <wikibugs>	 10serviceops, 10SRE: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris)
[14:17:17] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert)
[14:18:49] <wikibugs>	 10serviceops, 10SRE: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris)
[14:19:09] <wikibugs>	 10serviceops, 10SRE: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris)
[14:19:49] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[14:20:35] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05In progress→03Resolved We ran into a powerdns configuration issue which meant that instead of traffic being spread over both datacenters, we completely switched...
[14:20:55] <wikibugs>	 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10akosiaris)
[14:21:07] <wikibugs>	 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10akosiaris)
[14:21:45] <wikibugs>	 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris)
[14:22:03] <wikibugs>	 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris)
[14:23:44] <wikibugs>	 10serviceops: Migrate poolcounter hosts to bullseye - https://phabricator.wikimedia.org/T332015 (10akosiaris)
[14:23:48] <wikibugs>	 10serviceops: Migrate poolcounter hosts to bullseye - https://phabricator.wikimedia.org/T332015 (10akosiaris)
[14:24:57] <wikibugs>	 10serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016 (10akosiaris)
[14:25:24] <wikibugs>	 10serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016 (10akosiaris)
[14:46:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Re-enable seccomProfile in cert-manager chart after k8s 1.23 migration completed - https://phabricator.wikimedia.org/T325620 (10JMeybohm) 05Open→03Resolved Will roll this out to all clusters via {T325292}
[14:46:47] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm)
[14:46:49] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update cert-manager to 1.10.x - https://phabricator.wikimedia.org/T325292 (10JMeybohm)
[14:55:21] <wikibugs>	 10serviceops, 10mwcli: Create /nonexistent directory for nobody user in golang images - https://phabricator.wikimedia.org/T331209 (10Addshore)
[14:59:19] <wikibugs>	 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10elukey)
[15:22:00] <wikibugs>	 10serviceops, 10SRE: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) The backports are complete and support Unicode 13 now!   ` jmm@jmm-mw-icu67:~$ php -r "var_dump(IntlChar::getUnicodeVersion());" array(4) {   [0]=>   int(13)   [1]=>   int(0)   [2]=>   int(0)...
[15:30:04] <wikibugs>	 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10herron)
[16:37:23] <wikibugs>	 10serviceops, 10Data-Persistence (work done), 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10akosiaris) Removing #SRE as the more specific team is tagged already.
[16:40:31] <wikibugs>	 10serviceops, 10Prod-Kubernetes: cert-manager created multiple CertificateRequest objects with the same certificate-revision - https://phabricator.wikimedia.org/T304092 (10JMeybohm) This is running in wikikube and ml-staging now. Will give it a couple of days (e.g. cert refreshes) and deploy to prod then.
[16:55:29] <wikibugs>	 10serviceops, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10akosiaris) Adding #serviceops, removing #SRE to triage this towards the more specific SRE team.
[16:56:24] <wikibugs>	 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10akosiaris) Adding #serviceops, removing #SRE as the more specific team that can drive it forward.
[17:08:23] <wikibugs>	 10serviceops, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:08:37] <wikibugs>	 10serviceops, 10ChangeProp, 10envoy, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:18:23] <wikibugs>	 10serviceops, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10akosiaris) Removing #SRE, has already been triaged to a m...
[17:19:56] <wikibugs>	 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:20:01] <wikibugs>	 10serviceops, 10observability, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:20:40] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE-tools, 10Sustainability (Incident Followup): Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam (2 o...
[17:21:26] <wikibugs>	 10serviceops, 10observability, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:21:30] <wikibugs>	 10serviceops, 10Thumbor, 10Sustainability (Incident Followup): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:23:05] <wikibugs>	 10serviceops, 10Platform Engineering Roadmap Decision Making, 10Traffic, 10MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam(2 o...
[17:24:33] <wikibugs>	 10serviceops, 10conftool, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10akosiaris) Triaging to serviceops because of conftool
[17:30:08] <wikibugs>	 10serviceops, 10observability, 10Sustainability (Incident Followup), 10User-Joe, 10User-jijiki: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10akosiaris) Removing #SRE, triaging to #serviceops. `redis_misc` is in our care as a team and...
[17:30:30] <wikibugs>	 10serviceops, 10Continuous-Integration-Config, 10Regression, 10Sustainability (Incident Followup): operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam
[17:52:57] <wikibugs>	 10serviceops, 10SRE, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris)
[17:53:11] <wikibugs>	 10serviceops, 10SRE, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create swift container-to-container synchronization metrics - https://phabricator.wikimedia.org/T229117 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any...
[17:53:20] <wikibugs>	 10serviceops, 10SRE, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris)
[17:53:34] <wikibugs>	 10serviceops, 10SRE, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create a docker_registry_codfw swift container backup - https://phabricator.wikimedia.org/T229118 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any more...
[18:00:05] <hnowlan>	 some scary unapplied version diffs in admin_ng for codfw - haven't applied them but just a heads-up 
[18:00:27] <hnowlan>	 cert-manager: 1.10.3 -> 1.10.5 
[18:00:39] <hnowlan>	 that's the chart
[18:23:36] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)
[18:24:25] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn)
[19:41:53] <wikibugs>	 10serviceops, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10Eevans) 05Open→03Resolved a:03Eevans This is complete; Closing
[19:44:17] <wikibugs>	 10serviceops, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10Eevans) >>! In T320401#8559202, @BCornwall wrote: > If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the s...