[00:07:00] <icinga-wm>	 PROBLEM - Check unit status of clean-stale-certs on acmechief2002 is CRITICAL: CRITICAL: Status of the systemd unit clean-stale-certs https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:07:03] <wikibugs>	 (03PS1) 10Dzahn: add passwords::zuul::gerrit with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174851 (https://phabricator.wikimedia.org/T395938)
[00:07:23] <wikibugs>	 (03PS2) 10Dzahn: add passwords::zuul::gerrit with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174851 (https://phabricator.wikimedia.org/T395938)
[00:07:36] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] add passwords::zuul::gerrit with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174851 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[00:08:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174852
[00:08:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174852 (owner: 10TrainBranchBot)
[00:09:13] <wikibugs>	 (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174853
[00:09:16] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174854
[00:09:25] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1174850/6478/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1174850 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[00:10:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T400854)', diff saved to https://phabricator.wikimedia.org/P80403 and previous config saved to /var/cache/conftool/dbconfig/20250801-001055-ladsgroup.json
[00:11:03] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[00:11:12] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance
[00:11:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T400854)', diff saved to https://phabricator.wikimedia.org/P80404 and previous config saved to /var/cache/conftool/dbconfig/20250801-001119-ladsgroup.json
[00:13:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T400854)', diff saved to https://phabricator.wikimedia.org/P80405 and previous config saved to /var/cache/conftool/dbconfig/20250801-001345-ladsgroup.json
[00:18:16] <wikibugs>	 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11052623 (10Nemoralis)
[00:20:44] <wikibugs>	 (03CR) 10Dzahn: "ACK, I will follow-up on this if needed. Just out tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1174842 (https://phabricator.wikimedia.org/T394838) (owner: 10BryanDavis)
[00:28:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P80406 and previous config saved to /var/cache/conftool/dbconfig/20250801-002852-ladsgroup.json
[00:32:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174852 (owner: 10TrainBranchBot)
[00:37:18] <jinxer-wm>	 FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bbtrp - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[00:40:54] <wikibugs>	 (03CR) 10BCornwall: "Bad timing since big changes are currently under review" [puppet] - 10https://gerrit.wikimedia.org/r/1174853 (owner: 10Ncmonitor)
[00:41:06] <wikibugs>	 (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174853 (owner: 10Ncmonitor)
[00:41:08] <wikibugs>	 (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174854 (owner: 10Ncmonitor)
[00:44:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P80407 and previous config saved to /var/cache/conftool/dbconfig/20250801-004359-ladsgroup.json
[00:57:13] <wikibugs>	 06SRE, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11052679 (10AntiCompositeNumber) Yup, it's working now. Thanks!
[00:59:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T400854)', diff saved to https://phabricator.wikimedia.org/P80408 and previous config saved to /var/cache/conftool/dbconfig/20250801-005907-ladsgroup.json
[00:59:13] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[01:00:43] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[01:02:37] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic, 13Patch-For-Review: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052701 (10Bawolff) [Anyways, I adjusted the QuickInstantCo...
[01:10:13] <wikibugs>	 (03CR) 10Krinkle: "So.. it seems there isn't a keyword here for funneling a URL prefix to a fixed destination. It can only override an exact URL 1:1, or funn" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn)
[01:11:27] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 10m 44s)
[01:12:18] <jinxer-wm>	 RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bbtrp - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[01:15:02] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222)
[01:17:01] <icinga-wm>	 RECOVERY - Check unit status of clean-stale-certs on acmechief2002 is OK: OK: Status of the systemd unit clean-stale-certs https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:17:13] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222)
[01:19:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) (owner: 10Krinkle)
[01:21:01] <wikibugs>	 (03PS3) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222)
[01:35:02] <wikibugs>	 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11052710 (10aaron) >>! In T328872#10889545, @Ladsgroup wrote: > I understand...
[01:36:48] <jinxer-wm>	 FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-5m57k - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[01:42:04] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419)
[01:42:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall)
[01:43:12] <wikibugs>	 (03PS2) 10BCornwall: acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419)
[01:44:36] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6480/co" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall)
[01:46:42] <wikibugs>	 (03CR) 10RLazarus: "Please also add httpbb tests, at `production/modules/profile/files/httpbb/appserver/test_redirects.yaml`. You might want to test more than" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) (owner: 10Krinkle)
[02:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:04:19] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:12:54] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#11052724 (10BCornwall) a:05BCornwall→03wiki_willy Assigning to @wiki_willy as he's taking over communications for this.
[02:13:30] <wikibugs>	 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11052738 (10BCornwall) a:05BCornwall→03RobH Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us?
[03:03:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:03:33] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:04:19] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:06:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11052755 (10Papaul) I login to the the Nokia switches in row E to check the transceivers in place 1 transceiver on each switch is showing unspecified> I will have to troubleshoot this when  I...
[03:07:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[03:07:59] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[03:09:29] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:56:48] <jinxer-wm>	 RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-5m57k - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[04:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:03:33] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:06:18] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11052789 (10Joe) >>! In T400119#11051059, @Alien333 wrote: > Where does UAs like `MediaWiki-JS/1.45.0-wmf.12`, the defaults used by a plain `new mw.Api()` in an on-wiki script,...
[04:10:34] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052791 (10Joe) >>! In T400881#11050371, @Bawolff wrote: > Are you suggesting inc...
[04:12:27] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052794 (10Joe) >>! In T400881#11052701, @Bawolff wrote: > [Anyways, I adjusted t...
[04:25:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:26:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[04:30:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:31:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:00:10] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052795 (10A_smart_kitten) >>! In T400881#11052791, @Joe wrote: >>>! In T400881#1...
[05:09:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:13:02] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] Enable sitemaps API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling)
[05:14:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling)
[05:14:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable sitemaps API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling)
[05:14:39] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]]
[05:14:45] <stashbot>	 T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023
[05:16:41] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[05:59:03] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Continuing with sync
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T0600)
[06:00:16] <wikibugs>	 06SRE, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11052828 (10Joe) Please note, this solution is temporary: bots working from clouds will break repeatedly if they're not properly identified with...
[06:04:39] <logmsgbot>	 !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] (duration: 49m 59s)
[06:04:44] <stashbot>	 T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023
[06:09:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:31:04] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351)
[06:39:08] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174971
[06:49:12] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252)
[06:49:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto)
[06:50:48] <wikibugs>	 (03PS2) 10Jelto: Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T0700)
[07:07:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:07:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[07:07:59] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[07:08:45] <wikibugs>	 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968 (10tappof) 03NEW
[07:09:17] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969 (10tappof) 03NEW
[07:09:29] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:09:55] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^2 "acme-chief: Add batch of pay-for-edit domains" [puppet] - 10https://gerrit.wikimedia.org/r/1174984
[07:14:26] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^2 "acme-chief: Add batch of pay-for-edit domains" [puppet] - 10https://gerrit.wikimedia.org/r/1174984 (owner: 10Vgutierrez)
[07:28:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup)
[07:28:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: Drop flaggedrevs_tracking job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup)
[07:29:47] <wikibugs>	 (03PS3) 10Filippo Giunchedi: profile::thanos::recording_rules: add two rules for the EditCheck SLO [puppet] - 10https://gerrit.wikimedia.org/r/1174748 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey)
[07:29:48] <jinxer-wm>	 FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-8w9f8 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[07:30:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, just reformatted for legibility" [puppet] - 10https://gerrit.wikimedia.org/r/1174748 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey)
[07:30:07] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11052931 (10TheDJ) >>! In T400119#11052789, @Joe wrote: > Case in point, I can't find any request with that UA in the logs for the past few days. Indeed it's not in the list of...
[07:31:26] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz)
[07:32:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:33:46] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174990 (https://phabricator.wikimedia.org/T398936)
[07:36:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174990 (https://phabricator.wikimedia.org/T398936) (owner: 10Brouberol)
[07:38:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::thanos::recording_rules: add two rules for the EditCheck SLO [puppet] - 10https://gerrit.wikimedia.org/r/1174748 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey)
[07:39:13] <wikibugs>	 (03PS1) 10Vgutierrez: acme-chief: Remove nc domains with DNSSEC enabled [puppet] - 10https://gerrit.wikimedia.org/r/1174993 (https://phabricator.wikimedia.org/T400731)
[07:40:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] acme-chief: Remove nc domains with DNSSEC enabled [puppet] - 10https://gerrit.wikimedia.org/r/1174993 (https://phabricator.wikimedia.org/T400731) (owner: 10Vgutierrez)
[07:41:41] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[07:42:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[07:43:03] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:51:16] <wikibugs>	 (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz)
[07:53:00] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz)
[07:55:45] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[08:01:56] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[08:06:29] <logmsgbot>	 !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[08:07:40] <wikibugs>	 (03PS1) 10Kosta Harlan: UserInfoCard: Add config var for making UIC available [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627)
[08:15:05] <wikibugs>	 (03PS1) 10Kosta Harlan: CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681)
[08:15:49] <wikibugs>	 (03PS2) 10Kosta Harlan: CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681)
[08:17:43] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) (owner: 10Kosta Harlan)
[08:17:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mwmaint: decommission mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1174753 (https://phabricator.wikimedia.org/T400442) (owner: 10Jasmine)
[08:19:40] <wikibugs>	 (03PS3) 10Elukey: profile::pyrra::filesystem::slos: add edit-check ratio [puppet] - 10https://gerrit.wikimedia.org/r/1174749 (https://phabricator.wikimedia.org/T395444)
[08:26:46] <wikibugs>	 (03CR) 10Mszwarc: [C:03+1] UserInfoCard: Add config var for making UIC available [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) (owner: 10Kosta Harlan)
[08:31:07] <wikibugs>	 (03PS1) 10Jelto: gitlab: enable nftables throttling again in monitoring mode [puppet] - 10https://gerrit.wikimedia.org/r/1175043 (https://phabricator.wikimedia.org/T400971)
[08:34:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slos: add edit-check ratio [puppet] - 10https://gerrit.wikimedia.org/r/1174749 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey)
[08:34:48] <jinxer-wm>	 FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-8w9f8 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[08:37:43] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: enable nftables throttling again in monitoring mode [puppet] - 10https://gerrit.wikimedia.org/r/1175043 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto)
[08:41:31] <wikibugs>	 (03PS7) 10Ayounsi: Nokia ZTP [puppet] - 10https://gerrit.wikimedia.org/r/1174725
[08:43:04] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:55:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye
[09:02:46] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye
[09:09:16] <wikibugs>	 (03CR) 10Elukey: "LGTM, I just left a suggestion for a little code refactor that would help to DRY a bit the code (lemme know if I got it correctly or not)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi)
[09:12:58] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053153 (10Joe) >>! In T400119#11052931, @TheDJ wrote: >>>! In T400119#11052789, @Joe wrote: >> Case in point, I can't find any request with that UA in the logs for the past fe...
[09:15:08] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053166 (10Joe)  To give a bit of context, over the last day we saw: * 62 million valid requests with no user-agent * 24.5 million valid requests with user agent `okhttp/*` * 1...
[09:16:20] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053167 (10Alien333) Ok, thanks for the precisions!
[09:18:50] <wikibugs>	 (03CR) 10Elukey: "Added a couple of comments to better understand, lemme know!" [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi)
[09:32:51] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye
[09:38:23] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye
[09:38:46] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm
[09:40:18] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11053227 (10MatthewVernon) @Jhancock.wm thanks for doing ms-be2088 soon :) I'm afraid the others will need doing on a rather longer timescale (I'll have t...
[09:44:40] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm
[09:44:48] <wikibugs>	 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11053235 (10MatthewVernon) "swift-repl" (it's not actually that any more, bu...
[09:49:48] <jinxer-wm>	 FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-8w9f8 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[09:52:35] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm
[09:57:06] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238)
[10:00:29] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11053246 (10elukey) I finally found a way to make the Debian Installer to see the two OS disks, namely using Bookworm:  ` ~ # ls /dev/sd* /dev/sda   /dev/sda1  /dev/sdb...
[10:01:05] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm
[10:01:10] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238)
[10:05:51] <wikibugs>	 (03PS3) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238)
[10:08:42] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11053270 (10elukey) The dashboards are up!  * Rolling window: [[ https://slo.wikimedia.org/objectives?expr=%7B__name__=%22edit-check...
[10:10:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-1] "Overall LGTM, with one detail of the logic in the implementation that doesn't convince me." [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[10:14:40] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm
[10:18:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] python: Include virtualenv packages in python base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy)
[10:18:51] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] python: Include virtualenv packages in python base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy)
[10:18:56] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm
[10:26:28] <wikibugs>	 (03PS4) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238)
[10:26:48] <wikibugs>	 (03CR) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[10:27:20] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] "`python3` images rebuilt for `bullseye` and `bookworm`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy)
[10:29:03] <wikibugs>	 (03PS5) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238)
[10:30:05] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[10:30:49] <wikibugs>	 (03CR) 10Vgutierrez: "syntax has been validated with `operations/puppet/modules/profile/files/cache/haproxy/tests$ ./docker_run.sh cp6016.drmrs.wmnet 1175056`" [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[10:33:30] <wikibugs>	 (03PS6) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238)
[10:38:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[10:43:05] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11053322 (10Clement_Goubert) This host was set aside for `mw-experimental` work by @jijiki, I'll silence the alert for a month.
[10:45:41] <wikibugs>	 (03PS1) 10STran: Use tempaccounts.dblist to enable temporary accounts for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672)
[10:49:57] <wikibugs>	 (03CR) 10STran: "I wasn't sure if we wanted to use a dblist as the canonical list so I've split the difference unfortunately and named the dblist generical" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[10:50:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: define opensearch-test kubeconfigs in dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1174720 (https://phabricator.wikimedia.org/T400898) (owner: 10Brouberol)
[10:50:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-eqaid: define an opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174721 (https://phabricator.wikimedia.org/T400898) (owner: 10Brouberol)
[10:53:52] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1159.eqiad.wmnet with reason: Maintenance
[10:54:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80412 and previous config saved to /var/cache/conftool/dbconfig/20250801-105400-ladsgroup.json
[10:54:04] <wikibugs>	 (03CR) 10Ladsgroup: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup)
[10:54:06] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[10:54:29] <wikibugs>	 (03PS10) 10Tiziano Fogli: nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446)
[10:54:29] <wikibugs>	 (03CR) 10Tiziano Fogli: "Sample rules generated on pontoon: https://pastebin.com/ZjDS2Tnd." [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[10:54:48] <jinxer-wm>	 RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-jq58c - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[10:56:32] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[10:56:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80413 and previous config saved to /var/cache/conftool/dbconfig/20250801-105631-ladsgroup.json
[10:58:02] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174444 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli)
[10:58:12] <wikibugs>	 (03CR) 10FNegri: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[10:58:29] <wikibugs>	 (03Abandoned) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri)
[10:59:46] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174444 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli)
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T0700)
[11:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T1100). nyaa~
[11:00:54] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11053372 (10jijiki) 05Open→03Stalled sorry folks, host's number is up for retirement, my bad. tx @Clement_Goubert
[11:01:09] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11053380 (10jijiki) p:05Triage→03Low
[11:01:15] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[11:07:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[11:07:59] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[11:09:29] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:09:31] <wikibugs>	 (03CR) 10Harroyo-wmf: [C:03+1] Use tempaccounts.dblist to enable temporary accounts for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[11:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:11:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P80414 and previous config saved to /var/cache/conftool/dbconfig/20250801-111139-ladsgroup.json
[11:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:18:40] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[11:21:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1051.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1123.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, 
[11:21:18] <icinga-wm>	 -worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1313.e
[11:21:18] <icinga-wm>	 et, wikikube-worker1056.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1 https://wikitech.wikimedia.org/wiki/PyBal
[11:21:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, 
[11:21:20] <icinga-wm>	 -worker1050.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1119.e
[11:21:20] <icinga-wm>	 et, wikikube-worker1289.eqiad.wmnet, wikikube-worker1135.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1002.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-worker1 https://wikitech.wikimedia.org/wiki/PyBal
[11:21:28] <hnowlan>	 erm
[11:21:35] <hnowlan>	 that's me, looking 
[11:22:04] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:22:22] <effie>	 hnowlan: I will reroute the call to your phone then 
[11:22:26] <godog>	 ack hnowlan 
[11:22:30] * vgutierrez orders a t-shirt
[11:22:47] <hnowlan>	 there's a bad change deployed for it cc effie 
[11:23:05] <effie>	 I have not deployed the change yet to prod
[11:23:21] <effie>	 ah you did?
[11:23:27] <hnowlan>	 I roll restarted
[11:24:27] <wikibugs>	 (03PS1) 10Hnowlan: Revert "thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175077
[11:24:41] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[11:24:56] <hnowlan>	 shouldn't have been merged on a friday probably 
[11:25:58] <hnowlan>	 it's rolled back, will hopefully resolve
[11:26:45] <hnowlan>	 it's recovering
[11:26:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P80415 and previous config saved to /var/cache/conftool/dbconfig/20250801-112647-ladsgroup.json
[11:26:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:27:14] <effie>	 it is still quite early in the day and I wanted to test on staging
[11:27:18] <hnowlan>	 apologies oncallers 
[11:27:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[11:27:44] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[11:28:13] <hnowlan>	 we should toggle on staging if we're merging on a friday 
[11:28:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:28:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:32:12] <godog>	 hnowlan: all good, no worries
[11:35:10] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:37:18] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:38:28] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:40:20] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:41:00] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:41:08] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54369 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:41:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80417 and previous config saved to /var/cache/conftool/dbconfig/20250801-114155-ladsgroup.json
[11:41:59] <wikibugs>	 (03CR) 10Kosta Harlan: "IMO it would be less confusing to include all the wikis we've already deployed to in the new dblist. If we do that, during deployment we s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[11:42:07] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[11:42:13] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[11:42:31] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:42:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T400854)', diff saved to https://phabricator.wikimedia.org/P80418 and previous config saved to /var/cache/conftool/dbconfig/20250801-114238-ladsgroup.json
[11:45:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T400854)', diff saved to https://phabricator.wikimedia.org/P80419 and previous config saved to /var/cache/conftool/dbconfig/20250801-114511-ladsgroup.json
[11:49:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[12:00:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P80420 and previous config saved to /var/cache/conftool/dbconfig/20250801-120019-ladsgroup.json
[12:01:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087
[12:03:49] <wikibugs>	 (03PS8) 10Ayounsi: sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407
[12:03:58] <wikibugs>	 (03CR) 10Ayounsi: sre.network.tls: add Nokia SR-Linux support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi)
[12:07:33] <wikibugs>	 (03CR) 10Ayounsi: "thanks, reply inline." [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi)
[12:09:41] <wikibugs>	 (03CR) 10Dreamy Jazz: "+1. I would prefer that all the DBs are in the list, in case at a later stage we use the dblist for something else (e.g. running maintenan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[12:09:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Discrepencies with cableid & ports on some msw in c/d  <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159#11053589 (10Jclark-ctr) 05Open→03Resolved
[12:12:56] <wikibugs>	 (03PS2) 10Effie Mouzeli: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087
[12:14:04] <wikibugs>	 (03PS3) 10Effie Mouzeli: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087
[12:15:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P80421 and previous config saved to /var/cache/conftool/dbconfig/20250801-121526-ladsgroup.json
[12:19:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 (owner: 10Effie Mouzeli)
[12:19:50] <wikibugs>	 (03Abandoned) 10Hnowlan: Revert "thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175077 (owner: 10Hnowlan)
[12:23:02] <wikibugs>	 (03PS1) 10Ladsgroup: recountCategories: Avoid escpaing column name [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175095 (https://phabricator.wikimedia.org/T400987)
[12:23:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] recountCategories: Avoid escpaing column name [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175095 (https://phabricator.wikimedia.org/T400987) (owner: 10Ladsgroup)
[12:24:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 (owner: 10Effie Mouzeli)
[12:26:40] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 (owner: 10Effie Mouzeli)
[12:26:52] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Nokia ZTP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi)
[12:27:54] <wikibugs>	 (03CR) 10Elukey: [C:03+1] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi)
[12:30:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T400854)', diff saved to https://phabricator.wikimedia.org/P80422 and previous config saved to /var/cache/conftool/dbconfig/20250801-123034-ladsgroup.json
[12:30:38] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[12:30:50] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[12:30:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T400854)', diff saved to https://phabricator.wikimedia.org/P80423 and previous config saved to /var/cache/conftool/dbconfig/20250801-123057-ladsgroup.json
[12:37:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053640 (10Jclark-ctr) ` jclark@ssw1-f1-eqiad> show chassis environment Class Item                           Status     Measurement Power FPC 0 Power Supply 0           OK         41 degrees C / 10...
[12:37:38] <wikibugs>	 (03Merged) 10jenkins-bot: recountCategories: Avoid escpaing column name [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175095 (https://phabricator.wikimedia.org/T400987) (owner: 10Ladsgroup)
[12:39:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T400854)', diff saved to https://phabricator.wikimedia.org/P80424 and previous config saved to /var/cache/conftool/dbconfig/20250801-123928-ladsgroup.json
[12:39:32] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[12:40:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053660 (10ayounsi) Thanks, weird that the alarms are still active :( Can you follow up with JTAC ?
[12:43:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi)
[12:46:20] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Fix JWT exp date ACL [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238)
[12:46:22] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1175095|recountCategories: Avoid escpaing column name (T400987)]]
[12:46:25] <stashbot>	 T400987: Regression: Category member counts broken in German Wikipedia  - https://phabricator.wikimedia.org/T400987
[12:46:51] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:47:06] <wikibugs>	 (03PS8) 10Ayounsi: Nokia ZTP [puppet] - 10https://gerrit.wikimedia.org/r/1174725
[12:47:11] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:47:12] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175100
[12:48:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[12:48:30] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1175095|recountCategories: Avoid escpaing column name (T400987)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:49:31] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[12:49:43] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi)
[12:50:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] "Thank you! It was good timing: I deployed it 10 minutes the dumps v1 DAGs kicked in :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup)
[12:50:20] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Nokia ZTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi)
[12:51:05] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11053691 (10CDanis)
[12:52:45] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11053706 (10CDanis)
[12:53:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::haproxy: Fix JWT exp date ACL [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[12:53:58] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 5400
[12:54:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P80426 and previous config saved to /var/cache/conftool/dbconfig/20250801-125436-ladsgroup.json
[12:54:58] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175095|recountCategories: Avoid escpaing column name (T400987)]] (duration: 08m 36s)
[12:55:01] <stashbot>	 T400987: Regression: Category member counts broken in German Wikipedia  - https://phabricator.wikimedia.org/T400987
[12:55:30] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5400
[12:55:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053726 (10Jclark-ctr) I do not believe I have login access to JTAC, but I will coordinate with RobH when he returns to get access. I made some adjustments to the airflow in rack F1. Changes are us...
[12:55:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix JWT exp date ACL [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez)
[12:56:10] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 37662
[12:56:44] <logmsgbot>	 !log jiji@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[12:56:54] <logmsgbot>	 !log jiji@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[12:57:06] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37662
[12:57:10] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263252
[12:57:31] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263252
[12:57:37] <Amir1>	 !log re-running recountCategories.php on all wikis except s4 and s1 (T400987)
[12:57:37] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 274685
[12:57:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:05] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 274685
[12:59:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053732 (10ayounsi) Thanks. Nothing out of the ordinary in the logs.
[13:04:10] <wikibugs>	 06SRE, 07Epic, 05Goal: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527#11053740 (10CDanis)
[13:04:57] <wikibugs>	 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11053743 (10Jelto)
[13:05:43] <wikibugs>	 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11053744 (10Jelto)
[13:09:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P80427 and previous config saved to /var/cache/conftool/dbconfig/20250801-130943-ladsgroup.json
[13:24:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T400854)', diff saved to https://phabricator.wikimedia.org/P80428 and previous config saved to /var/cache/conftool/dbconfig/20250801-132451-ladsgroup.json
[13:24:56] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[13:25:07] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[13:25:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T400854)', diff saved to https://phabricator.wikimedia.org/P80429 and previous config saved to /var/cache/conftool/dbconfig/20250801-132514-ladsgroup.json
[13:27:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T400854)', diff saved to https://phabricator.wikimedia.org/P80430 and previous config saved to /var/cache/conftool/dbconfig/20250801-132745-ladsgroup.json
[13:33:50] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.provision for device lsw1-e2-codfw.mgmt.codfw.wmnet
[13:33:52] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[13:37:37] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e2-codfw - ayounsi@cumin1003"
[13:37:42] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e2-codfw - ayounsi@cumin1003"
[13:37:42] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:37:48] <wikibugs>	 (03PS2) 10STran: Use tempaccounts.dblist to manage rollout wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672)
[13:37:48] <wikibugs>	 (03PS1) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672)
[13:38:23] <wikibugs>	 (03PS1) 10Hashar: gerrit: add daemons ssh host key to known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401)
[13:38:59] <wikibugs>	 (03CR) 10STran: "Great, thanks! In that case I think I would prefer to split this up. iirc rollout is scheduled for a tuesday so I could feasibly test the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[13:39:48] <wikibugs>	 (03PS2) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672)
[13:40:24] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "Seems reasonable IMO." [puppet] - 10https://gerrit.wikimedia.org/r/1174842 (https://phabricator.wikimedia.org/T394838) (owner: 10BryanDavis)
[13:41:47] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "I have tried it on `gerrit1003` by disabling the Puppet agent and manually amending the file, that fixed the host key issue 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar)
[13:42:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119)
[13:42:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P80431 and previous config saved to /var/cache/conftool/dbconfig/20250801-134253-ladsgroup.json
[13:47:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:58:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P80432 and previous config saved to /var/cache/conftool/dbconfig/20250801-135800-ladsgroup.json
[14:05:42] <elukey>	 !log upgrade redis-server and tools package on idm nodes for security upgrades
[14:05:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:53] <wikibugs>	 (03PS1) 10MVernon: thanos: drain thanos-be1005 for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1175120 (https://phabricator.wikimedia.org/T400877)
[14:13:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T400854)', diff saved to https://phabricator.wikimedia.org/P80433 and previous config saved to /var/cache/conftool/dbconfig/20250801-141308-ladsgroup.json
[14:13:12] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[14:13:13] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[14:13:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T400854)', diff saved to https://phabricator.wikimedia.org/P80434 and previous config saved to /var/cache/conftool/dbconfig/20250801-141320-ladsgroup.json
[14:13:58] <wikibugs>	 (03PS1) 10Krinkle: In sitemap responses set CC: public [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175121 (https://phabricator.wikimedia.org/T400023)
[14:15:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T400854)', diff saved to https://phabricator.wikimedia.org/P80435 and previous config saved to /var/cache/conftool/dbconfig/20250801-141553-ladsgroup.json
[14:16:05] <wikibugs>	 (03PS1) 10Hashar: gerrit: replica renames as "gerrit2" application user [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693)
[14:16:22] <wikibugs>	 (03PS1) 10Clare Ming: Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123
[14:16:39] <wikibugs>	 (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1174701 (owner: 10L10n-bot)
[14:18:28] <wikibugs>	 (03CR) 10Eevans: "> Sorry for the long delay (and in future, feel free to chase if it looks like I've forgotten an outstanding review)." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 (owner: 10Eevans)
[14:18:29] <wikibugs>	 (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[14:18:46] <wikibugs>	 (03PS2) 10Eevans: convenience script to cleanup Cassandra instance state [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924
[14:18:48] <wikibugs>	 (03PS2) 10Clare Ming: Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123
[14:30:20] <cjming>	 I need an emergency deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1175123 -- context is https://wikimedia.slack.com/archives/C05ERLBF0E7/p1753993920317559, are SRE ok with a deployment? (cc: thcipriani). I can deploy.
[14:31:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P80436 and previous config saved to /var/cache/conftool/dbconfig/20250801-143104-ladsgroup.json
[14:32:34] <thcipriani>	 cjming: deploy for https://gerrit.wikimedia.org/r/1175123 is fine by me, sukhe or denisse Friday emergency deploy fine to do now? (pinged as SREs on call)
[14:32:36] <sukhe>	 cjming: no concerns from SRE as such (with my on-call hat on). not that I fully understand the change but I don't think it should cause any issues. thanks for checking.
[14:32:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:32:55] <sukhe>	 thcipriani: ^
[14:33:07] <thcipriani>	 <3
[14:33:16] <cjming>	 thanks sukhe, thcipriani - i will proceed then 
[14:35:58] <denisse>	 No concerns from my side.
[14:36:14] <cjming>	 ty denisse
[14:37:10] <wikibugs>	 (03CR) 10ArielGlenn: text-frontend: enforcement of UA policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto)
[14:37:13] <cjming>	 is it problematic that the last 3 backports errored out?
[14:39:11] <cjming>	 going ahead with deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1175123
[14:39:14] <thcipriani>	 cjming: I believe dancy fixed that late yesterday
[14:39:23] <cjming>	 cool - gtk
[14:39:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 (owner: 10Clare Ming)
[14:40:27] <thcipriani>	 all errored on the test server, so if we run into problems there, there may be investigation needed, but there was a command line deploy (plus a scap deploy) after the errors you see in spiderpig
[14:40:52] <cjming>	 ack
[14:40:53] <thcipriani>	 (...where "scap deploy" means deploying a new scap version...)
[14:41:05] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+1] Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 (owner: 10Clare Ming)
[14:44:07] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 (owner: 10Clare Ming)
[14:44:20] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175123|Revert "MetricsPlatform: Disable synchronous configs fetching"]]
[14:46:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P80437 and previous config saved to /var/cache/conftool/dbconfig/20250801-144611-ladsgroup.json
[14:46:14] <logmsgbot>	 !log cjming@deploy1003 cjming: Backport for [[gerrit:1175123|Revert "MetricsPlatform: Disable synchronous configs fetching"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:47:49] <wikibugs>	 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11054064 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae Marking as resolved. Thanks!
[14:48:02] <logmsgbot>	 !log cjming@deploy1003 cjming: Continuing with sync
[14:49:24] <wikibugs>	 (03CR) 10Elukey: "Preliminary pass! Hope what I wrote makes sense!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[14:53:10] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175123|Revert "MetricsPlatform: Disable synchronous configs fetching"]] (duration: 08m 50s)
[14:57:34] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Cool, I think it's worth having this available." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 (owner: 10Eevans)
[15:01:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T400854)', diff saved to https://phabricator.wikimedia.org/P80438 and previous config saved to /var/cache/conftool/dbconfig/20250801-150119-ladsgroup.json
[15:01:23] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[15:01:35] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[15:02:21] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[15:02:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T400854)', diff saved to https://phabricator.wikimedia.org/P80439 and previous config saved to /var/cache/conftool/dbconfig/20250801-150228-ladsgroup.json
[15:03:45] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2220 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:03:51] <Amir1>	 taking a look 
[15:03:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:04:25] <sukhe>	 !incidents
[15:04:25] <sirenbot>	 6535 (UNACKED)  db2220 (paged)/MariaDB Replica Lag: s7 (paged)
[15:04:26] <sirenbot>	 6534 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[15:04:29] <sukhe>	 !ack 6535
[15:04:29] <sirenbot>	 6535 (ACKED)  db2220 (paged)/MariaDB Replica Lag: s7 (paged)
[15:04:31] <sukhe>	 Amir1: <3
[15:05:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T400854)', diff saved to https://phabricator.wikimedia.org/P80440 and previous config saved to /var/cache/conftool/dbconfig/20250801-150501-ladsgroup.json
[15:06:22] <Amir1>	 Slave_SQL_Running_State: Waiting for semi-sync ACK from slave
[15:06:44] <denisse>	 Here as well.
[15:07:08] <Amir1>	 now       Slave_SQL_Running_State: init for update
[15:08:08] <Amir1>	 I think this is heartbeat
[15:08:19] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[15:08:47] <sukhe>	 db2220 is primary right?
[15:08:53] <Amir1>	 codfw
[15:09:00] <Amir1>	 so not anything super major
[15:09:05] <sukhe>	 yep
[15:09:26] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:09:29] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:44] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:04] <Amir1>	 the replication is clearly moving forward with no issues
[15:10:30] <sukhe>	 ok that's good at least. the resolve hasn't come in yet but as long as it is moving.
[15:11:03] <Amir1>	 I mean, the replication is working but the all systems are showing lags
[15:11:18] <Amir1>	 which usually means heartbeat needs a kick but that didn't fix it
[15:11:30] <revi>	 and a number of lag time is… going up
[15:12:11] <revi>	 10 minutes ago I saw '256 seconds', 421 seconds 3 minutes ago, now 502.
[15:12:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-e2-codfw - ayounsi@cumin1003"
[15:12:25] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-e2-codfw - ayounsi@cumin1003"
[15:12:25] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:12:26] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-e2-codfw.mgmt.codfw.wmnet
[15:13:40] <denisse>	 Replication lag: https://grafana.wikimedia.org/goto/nCv5uLQHg?orgId=1
[15:13:42] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:14:20] <Amir1>	 I stopped a write script to see if it's just a write load problem
[15:15:04] <Amir1>	 it seems it was the load
[15:15:13] <Amir1>	 it's not growing that fast anymore
[15:15:33] <Amir1>	 stuck at 9m3s https://orchestrator.wikimedia.org/web/cluster/alias/s7
[15:16:04] <Amir1>	 so in 9 minutes it should start going down once it actually processes all the heavy writes
[15:17:00] <sukhe>	 ok. thanks!
[15:17:10] <revi>	 yeah, stuck at 545 secs
[15:17:40] <revi>	 which is… 9min 5sec
[15:17:46] <wikibugs>	 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11054134 (10Pppery) a:05Novem_Linguae→03None
[15:18:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:19:28] <denisse>	 The replication lag is already going down.
[15:19:41] <revi>	 yeah, going down progressively
[15:19:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:20:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P80441 and previous config saved to /var/cache/conftool/dbconfig/20250801-152009-ladsgroup.json
[15:21:31] <revi>	 I actually (originally) thought whole prod was going crazy, but Commons and enwiki was fine, so :-p
[15:22:34] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:25:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[15:26:23] <Amir1>	 now it should go down fast
[15:30:28] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:35:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P80442 and previous config saved to /var/cache/conftool/dbconfig/20250801-153516-ladsgroup.json
[15:38:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2220 is OK: OK slave_sql_lag Replication lag: 8.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:38:50] <sukhe>	 :D
[15:39:06] <denisse>	 Page resolved on SpllunkOnCall.
[15:45:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[15:50:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T400854)', diff saved to https://phabricator.wikimedia.org/P80444 and previous config saved to /var/cache/conftool/dbconfig/20250801-155024-ladsgroup.json
[15:50:28] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[15:50:41] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[15:51:20] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[15:52:05] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance
[15:52:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80445 and previous config saved to /var/cache/conftool/dbconfig/20250801-155212-ladsgroup.json
[15:55:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80446 and previous config saved to /var/cache/conftool/dbconfig/20250801-155548-ladsgroup.json
[15:55:52] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[15:57:19] <wikibugs>	 (03PS1) 10Jly: values-security-landing-page.yaml: bump image version to 2025-08-01-155110 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175140 (https://phabricator.wikimedia.org/T398852)
[16:00:26] <wikibugs>	 (03CR) 10SBassett: [C:03+2] "LGTM to me and matches up with https://gitlab.wikimedia.org/repos/sre/miscweb/security-landing-page/-/jobs/577268#L81" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175140 (https://phabricator.wikimedia.org/T398852) (owner: 10Jly)
[16:00:57] <wikibugs>	 (03PS1) 10Ayounsi: Nokia ZTP: small fixes and better python script [puppet] - 10https://gerrit.wikimedia.org/r/1175141
[16:02:16] <wikibugs>	 (03Merged) 10jenkins-bot: values-security-landing-page.yaml: bump image version to 2025-08-01-155110 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175140 (https://phabricator.wikimedia.org/T398852) (owner: 10Jly)
[16:04:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:07:00] <logmsgbot>	 !log jly@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:07:13] <logmsgbot>	 !log jly@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:07:20] <logmsgbot>	 !log jly@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:07:40] <logmsgbot>	 !log jly@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:07:48] <logmsgbot>	 !log jly@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:08:04] <logmsgbot>	 !log jly@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:08:07] <logmsgbot>	 !log jly@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:08:11] <logmsgbot>	 !log jly@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:08:14] <logmsgbot>	 !log jly@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:08:16] <logmsgbot>	 !log jly@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:10:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P80447 and previous config saved to /var/cache/conftool/dbconfig/20250801-161056-ladsgroup.json
[16:11:58] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054317 (10Joe) >>! In T400881#11052795, @A_smart_kitten wrote: >>>! In T400881#1...
[16:12:59] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054327 (10Joe)
[16:26:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P80448 and previous config saved to /var/cache/conftool/dbconfig/20250801-162603-ladsgroup.json
[16:33:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174583 (https://phabricator.wikimedia.org/T400281) (owner: 10Theprotonade)
[16:36:10] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486)
[16:37:13] <wikibugs>	 (03CR) 10Clare Ming: [C:03+1] Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia)
[16:39:16] <cjming>	 sorry one more -- I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1175144 -- context is T399486, are SRE ok with a deployment? (cc: thcipriani). I can deploy.
[16:39:17] <stashbot>	 T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486
[16:40:06] <cjming>	 just as a forewarning - there might be one more after this (revert of the first thing I deployed earlier)
[16:41:09] <cjming>	 sukhe? denisse?
[16:41:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80449 and previous config saved to /var/cache/conftool/dbconfig/20250801-164111-ladsgroup.json
[16:41:15] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[16:41:28] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance
[16:41:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T400854)', diff saved to https://phabricator.wikimedia.org/P80450 and previous config saved to /var/cache/conftool/dbconfig/20250801-164134-ladsgroup.json
[16:43:43] <sukhe>	 cjming: is this urgent for Friday? (asking)
[16:44:36] <cjming>	 sukhe: yes - it will inform us whether we need to roll back something else that's riding the train next week
[16:45:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T400854)', diff saved to https://phabricator.wikimedia.org/P80451 and previous config saved to /var/cache/conftool/dbconfig/20250801-164510-ladsgroup.json
[16:45:14] <cjming>	 sorry for the friday drama
[16:46:39] * thcipriani reads
[16:46:43] <sukhe>	 please check with thcipriani too
[16:48:24] <cjming>	 fwiw dr0ptp4kt is advising on all these deployments as well - so i'm not just going rogue
[16:49:24] <wikibugs>	 (03PS1) 10BCornwall: Revert "acme-chief: Remove nc domains with DNSSEC enabled" [puppet] - 10https://gerrit.wikimedia.org/r/1175146
[16:51:56] <sukhe>	 cjming: yeah of course no worries about that :) 
[16:52:20] <sukhe>	 https://wikitech.wikimedia.org/wiki/Deployments/Emergencies dictates that SRE needs releng to be informed as well
[16:52:31] <sukhe>	 if it is urgent and that has been discussed, that's fine by at least me on on-call, but that's just SRE
[16:52:35] <cjming>	 sukhe: ack - thanks
[16:54:05] <cjming>	 thcipriani: we think this backport in WME will fix the event produce rate cliff drop and if it does, then we'll revert the config revert i did earlier if all this is ok with you
[16:55:21] <thcipriani>	 cjming: what happens if this doesn't fix it?
[16:58:03] <thcipriani>	 for clarity, did the revert from earlier get you to a stable place, or no?
[16:58:23] <cjming>	 thcipriani: then we'll go back to the drawing board and examine all commits in the last train cut -- but this backport was merged in 1.45.0-wmf.11 but never got into master
[16:58:41] <cjming>	 the revert from earlier didn't change anything - graphs stayed same
[16:59:03] <cjming>	 it further delays our ability to look at retention metrics for future a/b tests
[16:59:51] <sukhe>	 cjming: I think the verbage on emergency deploys can be better in the text and/or the intention
[17:00:15] <sukhe>	 at least from SRE's side, emergency deploys means that something needs to be deployed on Friday if it will lead to an outage over the weekend
[17:00:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P80452 and previous config saved to /var/cache/conftool/dbconfig/20250801-170018-ladsgroup.json
[17:00:26] <sukhe>	 or if it fixes something that's immediately broken that can't be carried over the weekend
[17:00:38] <sukhe>	 would this then fit that understanding, since you know more about this than at least I do?
[17:10:29] * thcipriani still catching up on context
[17:10:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add keytabs for new an-druid100[67] hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1171214 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene)
[17:15:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P80453 and previous config saved to /var/cache/conftool/dbconfig/20250801-171525-ladsgroup.json
[17:22:49] <thcipriani>	 cjming: sukhe alright, I'm up-to-speed on context, I'm good with this deploy (and revertt). seems like this deploy (plus revert of previous) should save some scrambling for folks. This deploy should put us in a stable spot for this, even if it doesn't 100% have the desired affect and should be safe to leave over the weekend (as I now understand it).
[17:23:39] <cjming>	 thcipriani: tysm
[17:24:40] <thcipriani>	 plus, seems small and safe. More context for lurkers: backport made it to wmf.11 but not wmf.12 which caused a noticable drop in event logs following group2 deploy. These deploys should align things.
[17:25:14] <wikibugs>	 (03PS1) 10CDobbins: . [puppet] - 10https://gerrit.wikimedia.org/r/1175151
[17:25:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia)
[17:26:08] <sukhe>	 thcipriani: cjming: cool, +1 from SRE too then
[17:26:12] <wikibugs>	 (03PS2) 10CDobbins: admin: remove access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374)
[17:26:16] <cjming>	 \o/
[17:26:55] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175152
[17:27:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia)
[17:27:27] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175144|Enable AA test on all wikis (T399486)]]
[17:27:30] <stashbot>	 T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486
[17:29:22] <logmsgbot>	 !log cjming@deploy1003 ksarabia, cjming: Backport for [[gerrit:1175144|Enable AA test on all wikis (T399486)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:30:08] <logmsgbot>	 !log cjming@deploy1003 ksarabia, cjming: Continuing with sync
[17:30:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T400854)', diff saved to https://phabricator.wikimedia.org/P80454 and previous config saved to /var/cache/conftool/dbconfig/20250801-173033-ladsgroup.json
[17:30:36] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[17:30:49] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance
[17:30:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T400854)', diff saved to https://phabricator.wikimedia.org/P80455 and previous config saved to /var/cache/conftool/dbconfig/20250801-173056-ladsgroup.json
[17:34:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T400854)', diff saved to https://phabricator.wikimedia.org/P80456 and previous config saved to /var/cache/conftool/dbconfig/20250801-173431-ladsgroup.json
[17:35:34] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175144|Enable AA test on all wikis (T399486)]] (duration: 08m 06s)
[17:35:37] <stashbot>	 T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486
[17:39:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.098s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:39:30] <sukhe>	 hehe
[17:40:08] <sukhe>	 zooming out shows similar spikes so I guess we will see how they ride out
[17:42:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] admin: remove access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins)
[17:43:39] <wikibugs>	 (03PS1) 10Clare Ming: Revert^2 "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155
[17:44:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.098s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[17:46:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11054529 (10CDobbins) 05Open→03In progress p:05Triage→03Medium a:03CDobbins
[17:48:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054544 (10CDobbins)
[17:49:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P80457 and previous config saved to /var/cache/conftool/dbconfig/20250801-174939-ladsgroup.json
[17:50:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054546 (10CDobbins)
[17:56:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054573 (10CDobbins)
[18:03:38] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+1] Revert^2 "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 (owner: 10Clare Ming)
[18:04:38] <cjming>	 per approvals above - deploying one last thing and that will be it from us for this eventful friday
[18:04:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P80458 and previous config saved to /var/cache/conftool/dbconfig/20250801-180447-ladsgroup.json
[18:04:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054582 (10CDobbins) @KFrancis There's a discrepancy in the email address on the NDA sheet (Parmarsiddharth2parmar@gmail.com) and in this task (siddharthvp@gmail.com). [[ https...
[18:04:56] <sukhe>	 cjming: gl :) 
[18:06:01] <cjming>	 ty :)
[18:06:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 (owner: 10Clare Ming)
[18:06:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 (owner: 10Clare Ming)
[18:07:10] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175155|Revert^2 "MetricsPlatform: Disable synchronous configs fetching"]]
[18:09:04] <logmsgbot>	 !log cjming@deploy1003 cjming: Backport for [[gerrit:1175155|Revert^2 "MetricsPlatform: Disable synchronous configs fetching"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:10:58] <logmsgbot>	 !log cjming@deploy1003 cjming: Continuing with sync
[18:12:05] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[18:12:53] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "Understanding that this needs to wait, but I'm giving my plus +1 for when it is ready to move forward" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[18:13:05] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[18:15:20] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11054588 (10DavidBrooks) >>! In T400119#11053166, @Joe wrote: > There won't be adding some magical regexes trying to ban any single case. We will make the...
[18:16:23] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175155|Revert^2 "MetricsPlatform: Disable synchronous configs fetching"]] (duration: 09m 13s)
[18:16:53] <wikibugs>	 (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1175157 (https://phabricator.wikimedia.org/T399604)
[18:17:55] <wikibugs>	 (03PS1) 10CDobbins: admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953)
[18:18:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins)
[18:19:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054604 (10SD0001) @KFrancis Sounds like an error in the sheet. The NDA doc I signed bears the email siddharthvp@gmail.com. I don't recognize that other email - seems to belong...
[18:19:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T400854)', diff saved to https://phabricator.wikimedia.org/P80459 and previous config saved to /var/cache/conftool/dbconfig/20250801-181954-ladsgroup.json
[18:19:58] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[18:20:11] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance
[18:20:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T400854)', diff saved to https://phabricator.wikimedia.org/P80460 and previous config saved to /var/cache/conftool/dbconfig/20250801-182017-ladsgroup.json
[18:20:48] <jinxer-wm>	 FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-wg45t - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[18:22:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T400854)', diff saved to https://phabricator.wikimedia.org/P80461 and previous config saved to /var/cache/conftool/dbconfig/20250801-182254-ladsgroup.json
[18:27:14] <wikibugs>	 (03PS2) 10CDobbins: admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953)
[18:27:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins)
[18:29:37] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Use tempaccounts.dblist to manage rollout wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[18:31:03] <wikibugs>	 (03PS3) 10CDobbins: admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953)
[18:32:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, adding @jborun@wikimedia.org for their review as well." [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins)
[18:33:04] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "(Wait for Joanna's review before merging, I would say.)" [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins)
[18:37:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11054679 (10CDobbins)
[18:38:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P80462 and previous config saved to /var/cache/conftool/dbconfig/20250801-183802-ladsgroup.json
[18:39:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11054686 (10CDobbins) 05Open→03Stalled p:05Triage→03Medium
[18:53:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P80463 and previous config saved to /var/cache/conftool/dbconfig/20250801-185310-ladsgroup.json
[18:54:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:55:27] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11054728 (10BCornwall) a:05BCornwall→03None
[19:00:34] <wikibugs>	 (03PS1) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405)
[19:01:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins)
[19:02:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11054736 (10VRiley-WMF) This PDU has been swapped
[19:06:34] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins)
[19:08:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T400854)', diff saved to https://phabricator.wikimedia.org/P80464 and previous config saved to /var/cache/conftool/dbconfig/20250801-190817-ladsgroup.json
[19:08:21] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[19:08:22] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance
[19:09:29] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:09:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:10:09] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2211.codfw.wmnet with reason: Maintenance
[19:10:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T400854)', diff saved to https://phabricator.wikimedia.org/P80465 and previous config saved to /var/cache/conftool/dbconfig/20250801-191016-ladsgroup.json
[19:13:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T400854)', diff saved to https://phabricator.wikimedia.org/P80466 and previous config saved to /var/cache/conftool/dbconfig/20250801-191354-ladsgroup.json
[19:13:58] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[19:14:01] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11054761 (10CDobbins)
[19:22:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:29:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P80467 and previous config saved to /var/cache/conftool/dbconfig/20250801-192901-ladsgroup.json
[19:32:14] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Restore Taavi's analytics-privatedata-users membership - https://phabricator.wikimedia.org/T400900#11054796 (10CDobbins)
[19:32:43] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] admin: add taavi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1174761 (https://phabricator.wikimedia.org/T400900) (owner: 10CDobbins)
[19:33:20] <icinga-wm>	 PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 159025 MB (4% inode=99%): /var/lib/hadoop/data/e 151560 MB (4% inode=99%): /var/lib/hadoop/data/m 159262 MB (4% inode=99%): /var/lib/hadoop/data/k 154811 MB (4% inode=99%): /var/lib/hadoop/data/f 154498 MB (4% inode=99%): /var/lib/hadoop/data/g 159939 MB (4% inode=99%): /var/lib/hadoop/data/h 160639 MB (4% inode=99%): /var/lib/hadoop/data
[19:33:20] <icinga-wm>	 0 MB (4% inode=99%): /var/lib/hadoop/data/j 154489 MB (4% inode=99%): /var/lib/hadoop/data/c 149060 MB (3% inode=99%): /var/lib/hadoop/data/l 153891 MB (4% inode=99%): /var/lib/hadoop/data/b 159936 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops
[19:34:36] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11054801 (10CDobbins)
[19:39:32] <wikibugs>	 (03PS1) 10BCornwall: Revert "ncredir: Revert addition of for-pay domains" [puppet] - 10https://gerrit.wikimedia.org/r/1175164
[19:40:15] <wikibugs>	 (03CR) 10BCornwall: [C:04-2] "Needs I765c6b00b15010822b200491209eb474f2034c40 before merging" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (owner: 10BCornwall)
[19:40:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: mw-experimental-mediawiki-image-update.service on wikikube-worker2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:40:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[19:44:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P80468 and previous config saved to /var/cache/conftool/dbconfig/20250801-194409-ladsgroup.json
[19:44:15] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Restore Taavi's analytics-privatedata-users membership - https://phabricator.wikimedia.org/T400900#11054814 (10CDobbins) 05In progress→03Resolved
[19:44:36] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins)
[19:46:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11054816 (10CDobbins)
[19:46:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11054817 (10CDobbins) 05Open→03Resolved p:05Triage→03Medium a:05joanna_borun→03CDobbins
[19:55:00] <wikibugs>	 06SRE, 10DNS, 06Traffic: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089#11054828 (10BCornwall) 05In progress→03Resolved Setting as resolved. Please re-open if it hasn't worked or if anything else is needed. Thanks!
[19:59:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T400854)', diff saved to https://phabricator.wikimedia.org/P80470 and previous config saved to /var/cache/conftool/dbconfig/20250801-195917-ladsgroup.json
[19:59:21] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[19:59:33] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2223.codfw.wmnet with reason: Maintenance
[19:59:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T400854)', diff saved to https://phabricator.wikimedia.org/P80471 and previous config saved to /var/cache/conftool/dbconfig/20250801-195940-ladsgroup.json
[19:59:52] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806)
[20:02:14] <wikibugs>	 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054847 (10Bawolff) InstantCommons does work in a way that is pretty easy to abus...
[20:03:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T400854)', diff saved to https://phabricator.wikimedia.org/P80472 and previous config saved to /var/cache/conftool/dbconfig/20250801-200317-ladsgroup.json
[20:06:21] <wikibugs>	 (03PS2) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405)
[20:06:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins)
[20:14:17] <wikibugs>	 (03PS1) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405)
[20:14:40] <wikibugs>	 (03Abandoned) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins)
[20:14:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins)
[20:18:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P80473 and previous config saved to /var/cache/conftool/dbconfig/20250801-201825-ladsgroup.json
[20:25:04] <wikibugs>	 (03PS2) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405)
[20:25:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins)
[20:25:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:33:13] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946)
[20:33:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P80474 and previous config saved to /var/cache/conftool/dbconfig/20250801-203332-ladsgroup.json
[20:35:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[20:40:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: mw-experimental-mediawiki-image-update.service on wikikube-worker2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:40:48] <jinxer-wm>	 RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-wg45t - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate
[20:48:13] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946)
[20:48:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T400854)', diff saved to https://phabricator.wikimedia.org/P80475 and previous config saved to /var/cache/conftool/dbconfig/20250801-204840-ladsgroup.json
[20:48:44] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[20:48:45] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[20:48:56] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2228.codfw.wmnet with reason: Maintenance
[20:49:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T400854)', diff saved to https://phabricator.wikimedia.org/P80476 and previous config saved to /var/cache/conftool/dbconfig/20250801-204903-ladsgroup.json
[20:50:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[20:52:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T400854)', diff saved to https://phabricator.wikimedia.org/P80477 and previous config saved to /var/cache/conftool/dbconfig/20250801-205239-ladsgroup.json
[20:53:34] <wikibugs>	 (03PS3) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405)
[21:05:01] <wikibugs>	 (03PS3) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946)
[21:07:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P80478 and previous config saved to /var/cache/conftool/dbconfig/20250801-210746-ladsgroup.json
[21:22:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P80479 and previous config saved to /var/cache/conftool/dbconfig/20250801-212254-ladsgroup.json
[21:32:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[21:32:44] <jinxer-wm>	 Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica
[21:38:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T400854)', diff saved to https://phabricator.wikimedia.org/P80480 and previous config saved to /var/cache/conftool/dbconfig/20250801-213802-ladsgroup.json
[21:38:05] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[22:45:29] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins)
[23:09:29] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:10:08] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:22:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:38:19] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175182
[23:38:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175182 (owner: 10TrainBranchBot)
[23:51:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175182 (owner: 10TrainBranchBot)
[23:57:30] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714/ using stat1009.eqiad.wmnet)