[06:45:06] hello, I have discovered we can populate scap dsh groups by querying the PuppetDB for the deployment targets. That got first introduced by j.o.e and I'd like to adopt the same for a few more deploy. So instead of having a list of nodes it is replaced by `Scap::Target[integration/docroot]` which saves us from having to change the list whenever machines are added/removed etc. [06:45:12] I have three small changes starting at https://gerrit.wikimedia.org/r/c/operations/puppet/+/893483 [08:32:53] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [08:50:00] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Vgutierrez) [09:13:40] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) a:03JMeybohm [09:28:14] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [09:56:33] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [09:56:42] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05Open→03In progress [10:00:45] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10elukey) [10:00:54] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:02:36] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:25:08] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:29:51] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:33:01] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [10:33:19] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 started. [10:47:58] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 comple... [11:38:09] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [11:38:19] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:42:01] 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [11:42:19] 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05Triage→03High [12:08:44] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [12:15:30] 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10LSobanski) [12:16:40] 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [12:21:37] 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Marostegui) I just wanted to mention that despite of the sudden spike on DB reads, our databases kept up just fine in general. We did have timeouts on some enwiki (s1) replicas... [12:28:26] 10serviceops, 10MW-on-K8s, 10Scap, 10Upstream: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10TheresNoTime) If I'm reading the output of a scap failure I just had (P45862) correctly, this warning is now triggering an "error", and causing a stage to rollback?... [12:33:59] a few changes for thumbor in k8s for review if you have time - most notable are the timeout queue increase which is an experiment rather than a permanent solution, and the increase in replicas https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/898728 [13:29:29] 10serviceops, 10MW-on-K8s, 10Scap, 10Upstream: Kubernetes configuration file is group-readable - https://phabricator.wikimedia.org/T329899 (10JMeybohm) >>! In T329899#8691439, @TheresNoTime wrote: > If I'm reading the output of a scap failure I just had (P45862) correctly, this warning is now triggering an... [14:09:27] 10serviceops, 10Data-Persistence, 10SRE: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [14:09:32] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:13:05] 10serviceops, 10Data-Persistence, 10SRE: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:13:16] 10serviceops, 10Data-Persistence, 10SRE: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:13:24] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [14:13:38] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:14:02] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) [14:14:28] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Clement_Goubert) p:05Triage→03Medium [14:14:50] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) [14:15:42] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) p:05High→03Medium [14:16:06] 10serviceops, 10SRE: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris) [14:16:32] 10serviceops, 10SRE: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10akosiaris) [14:17:17] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [14:18:49] 10serviceops, 10SRE: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris) [14:19:09] 10serviceops, 10SRE: Migrate dragonfly-supernodes to bullseye - https://phabricator.wikimedia.org/T332011 (10akosiaris) [14:19:49] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:20:35] 10serviceops, 10Data-Persistence, 10SRE: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) 05In progress→03Resolved We ran into a powerdns configuration issue which meant that instead of traffic being spread over both datacenters, we completely switched... [14:20:55] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10akosiaris) [14:21:07] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10akosiaris) [14:21:45] 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) [14:22:03] 10serviceops, 10SRE: Migrate chartmuseum to bullseye - https://phabricator.wikimedia.org/T331969 (10akosiaris) [14:23:44] 10serviceops: Migrate poolcounter hosts to bullseye - https://phabricator.wikimedia.org/T332015 (10akosiaris) [14:23:48] 10serviceops: Migrate poolcounter hosts to bullseye - https://phabricator.wikimedia.org/T332015 (10akosiaris) [14:24:57] 10serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016 (10akosiaris) [14:25:24] 10serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016 (10akosiaris) [14:46:42] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Re-enable seccomProfile in cert-manager chart after k8s 1.23 migration completed - https://phabricator.wikimedia.org/T325620 (10JMeybohm) 05Open→03Resolved Will roll this out to all clusters via {T325292} [14:46:47] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Post Kubernetes v1.23 cleanup - https://phabricator.wikimedia.org/T328291 (10JMeybohm) [14:46:49] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update cert-manager to 1.10.x - https://phabricator.wikimedia.org/T325292 (10JMeybohm) [14:55:21] 10serviceops, 10mwcli: Create /nonexistent directory for nobody user in golang images - https://phabricator.wikimedia.org/T331209 (10Addshore) [14:59:19] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10elukey) [15:22:00] 10serviceops, 10SRE: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10MoritzMuehlenhoff) The backports are complete and support Unicode 13 now! ` jmm@jmm-mw-icu67:~$ php -r "var_dump(IntlChar::getUnicodeVersion());" array(4) { [0]=> int(13) [1]=> int(0) [2]=> int(0)... [15:30:04] 10serviceops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10herron) [16:37:23] 10serviceops, 10Data-Persistence (work done), 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10akosiaris) Removing #SRE as the more specific team is tagged already. [16:40:31] 10serviceops, 10Prod-Kubernetes: cert-manager created multiple CertificateRequest objects with the same certificate-revision - https://phabricator.wikimedia.org/T304092 (10JMeybohm) This is running in wikikube and ml-staging now. Will give it a couple of days (e.g. cert refreshes) and deploy to prod then. [16:55:29] 10serviceops, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10akosiaris) Adding #serviceops, removing #SRE to triage this towards the more specific SRE team. [16:56:24] 10serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10akosiaris) Adding #serviceops, removing #SRE as the more specific team that can drive it forward. [17:08:23] 10serviceops, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:08:37] 10serviceops, 10ChangeProp, 10envoy, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:18:23] 10serviceops, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10akosiaris) Removing #SRE, has already been triaged to a m... [17:19:56] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:20:01] 10serviceops, 10observability, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:20:40] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE-tools, 10Sustainability (Incident Followup): Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam (2 o... [17:21:26] 10serviceops, 10observability, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:21:30] 10serviceops, 10Thumbor, 10Sustainability (Incident Followup): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:23:05] 10serviceops, 10Platform Engineering Roadmap Decision Making, 10Traffic, 10MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam(2 o... [17:24:33] 10serviceops, 10conftool, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10akosiaris) Triaging to serviceops because of conftool [17:30:08] 10serviceops, 10observability, 10Sustainability (Incident Followup), 10User-Joe, 10User-jijiki: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10akosiaris) Removing #SRE, triaging to #serviceops. `redis_misc` is in our care as a team and... [17:30:30] 10serviceops, 10Continuous-Integration-Config, 10Regression, 10Sustainability (Incident Followup): operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10akosiaris) Removing #SRE, has already been triaged to a more specific SRE subteam [17:52:57] 10serviceops, 10SRE, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) [17:53:11] 10serviceops, 10SRE, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create swift container-to-container synchronization metrics - https://phabricator.wikimedia.org/T229117 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any... [17:53:20] 10serviceops, 10SRE, 10Sustainability (Incident Followup): docker-registry: some layers has been corrupted due to deleting other swift containers - https://phabricator.wikimedia.org/T228196 (10akosiaris) [17:53:34] 10serviceops, 10SRE, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): create a docker_registry_codfw swift container backup - https://phabricator.wikimedia.org/T229118 (10akosiaris) 05Open→03Declined I am gonna close this as Declined. Many years later I don't remember any more... [18:00:05] some scary unapplied version diffs in admin_ng for codfw - haven't applied them but just a heads-up [18:00:27] cert-manager: 1.10.3 -> 1.10.5 [18:00:39] that's the chart [18:23:36] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [18:24:25] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Dzahn) [19:41:53] 10serviceops, 10Sustainability (Incident Followup): sessionstore: alert on rate of status 500 responses - https://phabricator.wikimedia.org/T327960 (10Eevans) 05Open→03Resolved a:03Eevans This is complete; Closing [19:44:17] 10serviceops, 10Sustainability (Incident Followup): Alert on Kask error rate - https://phabricator.wikimedia.org/T320401 (10Eevans) >>! In T320401#8559202, @BCornwall wrote: > If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the s...