[00:07:00] PROBLEM - Check unit status of clean-stale-certs on acmechief2002 is CRITICAL: CRITICAL: Status of the systemd unit clean-stale-certs https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:07:03] (03PS1) 10Dzahn: add passwords::zuul::gerrit with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174851 (https://phabricator.wikimedia.org/T395938) [00:07:23] (03PS2) 10Dzahn: add passwords::zuul::gerrit with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174851 (https://phabricator.wikimedia.org/T395938) [00:07:36] (03CR) 10Dzahn: [V:03+2 C:03+2] add passwords::zuul::gerrit with fake password [labs/private] - 10https://gerrit.wikimedia.org/r/1174851 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:08:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174852 [00:08:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174852 (owner: 10TrainBranchBot) [00:09:13] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174853 [00:09:16] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174854 [00:09:25] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1174850/6478/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1174850 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:10:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T400854)', diff saved to https://phabricator.wikimedia.org/P80403 and previous config saved to /var/cache/conftool/dbconfig/20250801-001055-ladsgroup.json [00:11:03] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:11:12] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance [00:11:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T400854)', diff saved to https://phabricator.wikimedia.org/P80404 and previous config saved to /var/cache/conftool/dbconfig/20250801-001119-ladsgroup.json [00:13:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T400854)', diff saved to https://phabricator.wikimedia.org/P80405 and previous config saved to /var/cache/conftool/dbconfig/20250801-001345-ladsgroup.json [00:18:16] 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11052623 (10Nemoralis) [00:20:44] (03CR) 10Dzahn: "ACK, I will follow-up on this if needed. Just out tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1174842 (https://phabricator.wikimedia.org/T394838) (owner: 10BryanDavis) [00:28:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P80406 and previous config saved to /var/cache/conftool/dbconfig/20250801-002852-ladsgroup.json [00:32:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1174852 (owner: 10TrainBranchBot) [00:37:18] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bbtrp - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [00:40:54] (03CR) 10BCornwall: "Bad timing since big changes are currently under review" [puppet] - 10https://gerrit.wikimedia.org/r/1174853 (owner: 10Ncmonitor) [00:41:06] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174853 (owner: 10Ncmonitor) [00:41:08] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1174854 (owner: 10Ncmonitor) [00:44:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P80407 and previous config saved to /var/cache/conftool/dbconfig/20250801-004359-ladsgroup.json [00:57:13] 06SRE, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11052679 (10AntiCompositeNumber) Yup, it's working now. Thanks! [00:59:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T400854)', diff saved to https://phabricator.wikimedia.org/P80408 and previous config saved to /var/cache/conftool/dbconfig/20250801-005907-ladsgroup.json [00:59:13] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:00:43] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:02:37] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic, 13Patch-For-Review: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052701 (10Bawolff) [Anyways, I adjusted the QuickInstantCo... [01:10:13] (03CR) 10Krinkle: "So.. it seems there isn't a keyword here for funneling a URL prefix to a fixed destination. It can only override an exact URL 1:1, or funn" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [01:11:27] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 10m 44s) [01:12:18] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-bbtrp - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [01:15:02] (03PS1) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) [01:17:01] RECOVERY - Check unit status of clean-stale-certs on acmechief2002 is OK: OK: Status of the systemd unit clean-stale-certs https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:17:13] (03PS2) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) [01:19:31] (03CR) 10CI reject: [V:04-1] mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) (owner: 10Krinkle) [01:21:01] (03PS3) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) [01:35:02] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11052710 (10aaron) >>! In T328872#10889545, @Ladsgroup wrote: > I understand... [01:36:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-5m57k - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [01:42:04] (03PS1) 10BCornwall: acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) [01:42:29] (03CR) 10CI reject: [V:04-1] acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [01:43:12] (03PS2) 10BCornwall: acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) [01:44:36] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6480/co" [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [01:46:42] (03CR) 10RLazarus: "Please also add httpbb tests, at `production/modules/profile/files/httpbb/appserver/test_redirects.yaml`. You might want to test more than" [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T80222) (owner: 10Krinkle) [02:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:12:54] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#11052724 (10BCornwall) a:05BCornwall→03wiki_willy Assigning to @wiki_willy as he's taking over communications for this. [02:13:30] 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11052738 (10BCornwall) a:05BCornwall→03RobH Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us? [03:03:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:04:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:06:24] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11052755 (10Papaul) I login to the the Nokia switches in row E to check the transceivers in place 1 transceiver on each switch is showing unspecified> I will have to troubleshoot this when I... [03:07:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:07:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [03:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:56:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-5m57k - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [04:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:03:33] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:06:18] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11052789 (10Joe) >>! In T400119#11051059, @Alien333 wrote: > Where does UAs like `MediaWiki-JS/1.45.0-wmf.12`, the defaults used by a plain `new mw.Api()` in an on-wiki script,... [04:10:34] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052791 (10Joe) >>! In T400881#11050371, @Bawolff wrote: > Are you suggesting inc... [04:12:27] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052794 (10Joe) >>! In T400881#11052701, @Bawolff wrote: > [Anyways, I adjusted t... [04:25:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:26:39] FIRING: CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:30:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:31:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:00:10] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11052795 (10A_smart_kitten) >>! In T400881#11052791, @Joe wrote: >>>! In T400881#1... [05:09:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:02] (03CR) 10Tim Starling: [C:03+2] Enable sitemaps API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [05:14:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [05:14:12] (03Merged) 10jenkins-bot: Enable sitemaps API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173575 (https://phabricator.wikimedia.org/T400023) (owner: 10Tim Starling) [05:14:39] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] [05:14:45] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [05:16:41] !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [05:59:03] !log tstarling@deploy1003 tstarling: Continuing with sync [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T0600) [06:00:16] 06SRE, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11052828 (10Joe) Please note, this solution is temporary: bots working from clouds will break repeatedly if they're not properly identified with... [06:04:39] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173575|Enable sitemaps API (T400023)]] (duration: 49m 59s) [06:04:44] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [06:09:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:31:04] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) [06:39:08] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174971 [06:49:12] (03PS1) 10Jelto: Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252) [06:49:37] (03CR) 10CI reject: [V:04-1] Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252) (owner: 10Jelto) [06:50:48] (03PS2) 10Jelto: Revert "gitlab: pause restore on gitlab2002" [puppet] - 10https://gerrit.wikimedia.org/r/1174976 (https://phabricator.wikimedia.org/T400252) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T0700) [07:07:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:07:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:07:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [07:08:45] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SystemdUnitFailed (instance stat1008:9100) - https://phabricator.wikimedia.org/T400968 (10tappof) 03NEW [07:09:17] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969 (10tappof) 03NEW [07:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:09:55] (03PS1) 10Vgutierrez: Revert^2 "acme-chief: Add batch of pay-for-edit domains" [puppet] - 10https://gerrit.wikimedia.org/r/1174984 [07:14:26] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "acme-chief: Add batch of pay-for-edit domains" [puppet] - 10https://gerrit.wikimedia.org/r/1174984 (owner: 10Vgutierrez) [07:28:32] (03CR) 10Brouberol: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [07:28:34] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: Drop flaggedrevs_tracking job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [07:29:47] (03PS3) 10Filippo Giunchedi: profile::thanos::recording_rules: add two rules for the EditCheck SLO [puppet] - 10https://gerrit.wikimedia.org/r/1174748 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey) [07:29:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-8w9f8 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [07:30:02] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, just reformatted for legibility" [puppet] - 10https://gerrit.wikimedia.org/r/1174748 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey) [07:30:07] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11052931 (10TheDJ) >>! In T400119#11052789, @Joe wrote: > Case in point, I can't find any request with that UA in the logs for the past few days. Indeed it's not in the list of... [07:31:26] (03CR) 10Elukey: [C:03+1] ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz) [07:32:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:33:46] (03PS1) 10Brouberol: mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174990 (https://phabricator.wikimedia.org/T398936) [07:36:57] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174990 (https://phabricator.wikimedia.org/T398936) (owner: 10Brouberol) [07:38:20] (03CR) 10Elukey: [C:03+2] profile::thanos::recording_rules: add two rules for the EditCheck SLO [puppet] - 10https://gerrit.wikimedia.org/r/1174748 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey) [07:39:13] (03PS1) 10Vgutierrez: acme-chief: Remove nc domains with DNSSEC enabled [puppet] - 10https://gerrit.wikimedia.org/r/1174993 (https://phabricator.wikimedia.org/T400731) [07:40:06] (03CR) 10Vgutierrez: [C:03+2] acme-chief: Remove nc domains with DNSSEC enabled [puppet] - 10https://gerrit.wikimedia.org/r/1174993 (https://phabricator.wikimedia.org/T400731) (owner: 10Vgutierrez) [07:41:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:42:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:43:03] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:51:16] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz) [07:53:00] (03Merged) 10jenkins-bot: ml-services: Update image for article-descriptions model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174968 (https://phabricator.wikimedia.org/T400351) (owner: 10Bartosz Wójtowicz) [07:55:45] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:01:56] !log bwojtowicz@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:06:29] !log bwojtowicz@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:07:40] (03PS1) 10Kosta Harlan: UserInfoCard: Add config var for making UIC available [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) [08:15:05] (03PS1) 10Kosta Harlan: CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) [08:15:49] (03PS2) 10Kosta Harlan: CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) [08:17:43] (03CR) 10Mszwarc: [C:03+1] CheckUser: Make user info card feature discoverable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175042 (https://phabricator.wikimedia.org/T398681) (owner: 10Kosta Harlan) [08:17:55] (03CR) 10Clément Goubert: [C:03+1] mwmaint: decommission mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/1174753 (https://phabricator.wikimedia.org/T400442) (owner: 10Jasmine) [08:19:40] (03PS3) 10Elukey: profile::pyrra::filesystem::slos: add edit-check ratio [puppet] - 10https://gerrit.wikimedia.org/r/1174749 (https://phabricator.wikimedia.org/T395444) [08:26:46] (03CR) 10Mszwarc: [C:03+1] UserInfoCard: Add config var for making UIC available [extensions/CheckUser] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175002 (https://phabricator.wikimedia.org/T400627) (owner: 10Kosta Harlan) [08:31:07] (03PS1) 10Jelto: gitlab: enable nftables throttling again in monitoring mode [puppet] - 10https://gerrit.wikimedia.org/r/1175043 (https://phabricator.wikimedia.org/T400971) [08:34:26] (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slos: add edit-check ratio [puppet] - 10https://gerrit.wikimedia.org/r/1174749 (https://phabricator.wikimedia.org/T395444) (owner: 10Elukey) [08:34:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-8w9f8 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [08:37:43] (03CR) 10Jelto: [C:03+2] gitlab: enable nftables throttling again in monitoring mode [puppet] - 10https://gerrit.wikimedia.org/r/1175043 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto) [08:41:31] (03PS7) 10Ayounsi: Nokia ZTP [puppet] - 10https://gerrit.wikimedia.org/r/1174725 [08:43:04] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:55:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [09:02:46] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [09:09:16] (03CR) 10Elukey: "LGTM, I just left a suggestion for a little code refactor that would help to DRY a bit the code (lemme know if I got it correctly or not)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [09:12:58] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053153 (10Joe) >>! In T400119#11052931, @TheDJ wrote: >>>! In T400119#11052789, @Joe wrote: >> Case in point, I can't find any request with that UA in the logs for the past fe... [09:15:08] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053166 (10Joe) To give a bit of context, over the last day we saw: * 62 million valid requests with no user-agent * 24.5 million valid requests with user agent `okhttp/*` * 1... [09:16:20] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11053167 (10Alien333) Ok, thanks for the precisions! [09:18:50] (03CR) 10Elukey: "Added a couple of comments to better understand, lemme know!" [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi) [09:32:51] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bullseye [09:38:23] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bullseye [09:38:46] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [09:40:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11053227 (10MatthewVernon) @Jhancock.wm thanks for doing ms-be2088 soon :) I'm afraid the others will need doing on a rather longer timescale (I'll have t... [09:44:40] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [09:44:48] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11053235 (10MatthewVernon) "swift-repl" (it's not actually that any more, bu... [09:49:48] FIRING: [2x] ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-8w9f8 - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [09:52:35] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [09:57:06] (03PS1) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) [10:00:29] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11053246 (10elukey) I finally found a way to make the Debian Installer to see the two OS disks, namely using Bookworm: ` ~ # ls /dev/sd* /dev/sda /dev/sda1 /dev/sdb... [10:01:05] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [10:01:10] (03PS2) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) [10:05:51] (03PS3) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) [10:08:42] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11053270 (10elukey) The dashboards are up! * Rolling window: [[ https://slo.wikimedia.org/objectives?expr=%7B__name__=%22edit-check... [10:10:40] (03CR) 10Giuseppe Lavagetto: [C:04-1] "Overall LGTM, with one detail of the logic in the implementation that doesn't convince me." [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:14:40] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [10:18:50] (03CR) 10Clément Goubert: [C:03+2] python: Include virtualenv packages in python base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy) [10:18:51] (03CR) 10Clément Goubert: [V:03+2 C:03+2] python: Include virtualenv packages in python base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy) [10:18:56] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [10:26:28] (03PS4) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) [10:26:48] (03CR) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:27:20] (03CR) 10Clément Goubert: [V:03+2 C:03+2] "`python3` images rebuilt for `bullseye` and `bookworm`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1174506 (owner: 10Ahmon Dancy) [10:29:03] (03PS5) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) [10:30:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:30:49] (03CR) 10Vgutierrez: "syntax has been validated with `operations/puppet/modules/profile/files/cache/haproxy/tests$ ./docker_run.sh cp6016.drmrs.wmnet 1175056`" [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:33:30] (03PS6) 10Vgutierrez: cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) [10:38:53] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [10:43:05] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11053322 (10Clement_Goubert) This host was set aside for `mw-experimental` work by @jijiki, I'll silence the alert for a month. [10:45:41] (03PS1) 10STran: Use tempaccounts.dblist to enable temporary accounts for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) [10:49:57] (03CR) 10STran: "I wasn't sure if we wanted to use a dblist as the canonical list so I've split the difference unfortunately and named the dblist generical" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [10:50:36] (03CR) 10Brouberol: [C:03+2] deployment_server: define opensearch-test kubeconfigs in dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1174720 (https://phabricator.wikimedia.org/T400898) (owner: 10Brouberol) [10:50:50] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqaid: define an opensearch-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174721 (https://phabricator.wikimedia.org/T400898) (owner: 10Brouberol) [10:53:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:54:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80412 and previous config saved to /var/cache/conftool/dbconfig/20250801-105400-ladsgroup.json [10:54:04] (03CR) 10Ladsgroup: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [10:54:06] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:54:29] (03PS10) 10Tiziano Fogli: nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) [10:54:29] (03CR) 10Tiziano Fogli: "Sample rules generated on pontoon: https://pastebin.com/ZjDS2Tnd." [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [10:54:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-jq58c - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [10:56:32] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [10:56:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80413 and previous config saved to /var/cache/conftool/dbconfig/20250801-105631-ladsgroup.json [10:58:02] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174444 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli) [10:58:12] (03CR) 10FNegri: wikireplicas scripts: setup pytest, add first test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [10:58:29] (03Abandoned) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [10:59:46] (03Merged) 10jenkins-bot: thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174444 (https://phabricator.wikimedia.org/T374350) (owner: 10Effie Mouzeli) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T0700) [11:00:05] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250801T1100). nyaa~ [11:00:54] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11053372 (10jijiki) 05Open→03Stalled sorry folks, host's number is up for retirement, my bad. tx @Clement_Goubert [11:01:09] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11053380 (10jijiki) p:05Triage→03Low [11:01:15] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Validate JWT tokens issued by MW [puppet] - 10https://gerrit.wikimedia.org/r/1175056 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [11:07:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:07:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [11:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:09:31] (03CR) 10Harroyo-wmf: [C:03+1] Use tempaccounts.dblist to enable temporary accounts for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P80414 and previous config saved to /var/cache/conftool/dbconfig/20250801-111139-ladsgroup.json [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:40] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [11:21:18] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1051.eqiad.wmnet, wikikube-worker1118.eqiad.wmnet, wikikube-worker1123.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, [11:21:18] -worker1121.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1094.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1136.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1149.eqiad.wmnet, wikikube-worker1313.e [11:21:18] et, wikikube-worker1056.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1160.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1 https://wikitech.wikimedia.org/wiki/PyBal [11:21:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1280.eqiad.wmnet, wikikube-worker1298.eqiad.wmnet, wikikube-worker1148.eqiad.wmnet, wikikube-worker1155.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1116.eqiad.wmnet, [11:21:20] -worker1050.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1132.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1053.eqiad.wmnet, wikikube-worker1159.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1112.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1119.e [11:21:20] et, wikikube-worker1289.eqiad.wmnet, wikikube-worker1135.eqiad.wmnet, wikikube-worker1162.eqiad.wmnet, wikikube-worker1002.eqiad.wmnet, wikikube-worker1098.eqiad.wmnet, wikikube-worker1 https://wikitech.wikimedia.org/wiki/PyBal [11:21:28] erm [11:21:35] that's me, looking [11:22:04] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:22] hnowlan: I will reroute the call to your phone then [11:22:26] ack hnowlan [11:22:30] * vgutierrez orders a t-shirt [11:22:47] there's a bad change deployed for it cc effie [11:23:05] I have not deployed the change yet to prod [11:23:21] ah you did? [11:23:27] I roll restarted [11:24:27] (03PS1) 10Hnowlan: Revert "thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175077 [11:24:41] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [11:24:56] shouldn't have been merged on a friday probably [11:25:58] it's rolled back, will hopefully resolve [11:26:45] it's recovering [11:26:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P80415 and previous config saved to /var/cache/conftool/dbconfig/20250801-112647-ladsgroup.json [11:26:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:14] it is still quite early in the day and I wanted to test on staging [11:27:18] apologies oncallers [11:27:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [11:27:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [11:28:13] we should toggle on staging if we're merging on a friday [11:28:20] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:28:20] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:32:12] hnowlan: all good, no worries [11:35:10] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:37:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:38:28] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:40:20] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:00] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54369 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:41:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T400854)', diff saved to https://phabricator.wikimedia.org/P80417 and previous config saved to /var/cache/conftool/dbconfig/20250801-114155-ladsgroup.json [11:41:59] (03CR) 10Kosta Harlan: "IMO it would be less confusing to include all the wikis we've already deployed to in the new dblist. If we do that, during deployment we s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [11:42:07] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:42:13] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:42:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:42:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T400854)', diff saved to https://phabricator.wikimedia.org/P80418 and previous config saved to /var/cache/conftool/dbconfig/20250801-114238-ladsgroup.json [11:45:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T400854)', diff saved to https://phabricator.wikimedia.org/P80419 and previous config saved to /var/cache/conftool/dbconfig/20250801-114511-ladsgroup.json [11:49:11] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [12:00:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P80420 and previous config saved to /var/cache/conftool/dbconfig/20250801-120019-ladsgroup.json [12:01:23] (03PS1) 10Effie Mouzeli: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 [12:03:49] (03PS8) 10Ayounsi: sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 [12:03:58] (03CR) 10Ayounsi: sre.network.tls: add Nokia SR-Linux support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [12:07:33] (03CR) 10Ayounsi: "thanks, reply inline." [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi) [12:09:41] (03CR) 10Dreamy Jazz: "+1. I would prefer that all the DBs are in the list, in case at a later stage we use the dblist for something else (e.g. running maintenan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [12:09:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159#11053589 (10Jclark-ctr) 05Open→03Resolved [12:12:56] (03PS2) 10Effie Mouzeli: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 [12:14:04] (03PS3) 10Effie Mouzeli: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 [12:15:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P80421 and previous config saved to /var/cache/conftool/dbconfig/20250801-121526-ladsgroup.json [12:19:45] (03CR) 10Hnowlan: [C:03+1] thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 (owner: 10Effie Mouzeli) [12:19:50] (03Abandoned) 10Hnowlan: Revert "thumbor: Add configurable SUBPROCESS_TIMEOUT_KILL_AFTER" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175077 (owner: 10Hnowlan) [12:23:02] (03PS1) 10Ladsgroup: recountCategories: Avoid escpaing column name [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175095 (https://phabricator.wikimedia.org/T400987) [12:23:10] (03CR) 10Ladsgroup: [C:03+2] recountCategories: Avoid escpaing column name [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175095 (https://phabricator.wikimedia.org/T400987) (owner: 10Ladsgroup) [12:24:57] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 (owner: 10Effie Mouzeli) [12:26:40] (03Merged) 10jenkins-bot: thumbor: update previous changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175087 (owner: 10Effie Mouzeli) [12:26:52] (03CR) 10Elukey: [C:03+1] Nokia ZTP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi) [12:27:54] (03CR) 10Elukey: [C:03+1] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [12:30:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T400854)', diff saved to https://phabricator.wikimedia.org/P80422 and previous config saved to /var/cache/conftool/dbconfig/20250801-123034-ladsgroup.json [12:30:38] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:30:50] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [12:30:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T400854)', diff saved to https://phabricator.wikimedia.org/P80423 and previous config saved to /var/cache/conftool/dbconfig/20250801-123057-ladsgroup.json [12:37:12] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053640 (10Jclark-ctr) ` jclark@ssw1-f1-eqiad> show chassis environment Class Item Status Measurement Power FPC 0 Power Supply 0 OK 41 degrees C / 10... [12:37:38] (03Merged) 10jenkins-bot: recountCategories: Avoid escpaing column name [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175095 (https://phabricator.wikimedia.org/T400987) (owner: 10Ladsgroup) [12:39:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T400854)', diff saved to https://phabricator.wikimedia.org/P80424 and previous config saved to /var/cache/conftool/dbconfig/20250801-123928-ladsgroup.json [12:39:32] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:40:52] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053660 (10ayounsi) Thanks, weird that the alarms are still active :( Can you follow up with JTAC ? [12:43:13] (03CR) 10Ayounsi: [C:03+2] sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [12:46:20] (03PS1) 10Vgutierrez: cache::haproxy: Fix JWT exp date ACL [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) [12:46:22] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1175095|recountCategories: Avoid escpaing column name (T400987)]] [12:46:25] T400987: Regression: Category member counts broken in German Wikipedia - https://phabricator.wikimedia.org/T400987 [12:46:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:47:06] (03PS8) 10Ayounsi: Nokia ZTP [puppet] - 10https://gerrit.wikimedia.org/r/1174725 [12:47:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:47:12] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175100 [12:48:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [12:48:30] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1175095|recountCategories: Avoid escpaing column name (T400987)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:49:31] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:49:43] (03Merged) 10jenkins-bot: sre.network.tls: add Nokia SR-Linux support [cookbooks] - 10https://gerrit.wikimedia.org/r/1174407 (owner: 10Ayounsi) [12:50:15] (03CR) 10Brouberol: [C:03+2] "Thank you! It was good timing: I deployed it 10 minutes the dumps v1 DAGs kicked in :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173930 (https://phabricator.wikimedia.org/T398936) (owner: 10Ladsgroup) [12:50:20] (03CR) 10Ayounsi: [C:03+2] Nokia ZTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1174725 (owner: 10Ayounsi) [12:51:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11053691 (10CDanis) [12:52:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11053706 (10CDanis) [12:53:37] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache::haproxy: Fix JWT exp date ACL [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [12:53:58] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 5400 [12:54:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P80426 and previous config saved to /var/cache/conftool/dbconfig/20250801-125436-ladsgroup.json [12:54:58] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175095|recountCategories: Avoid escpaing column name (T400987)]] (duration: 08m 36s) [12:55:01] T400987: Regression: Category member counts broken in German Wikipedia - https://phabricator.wikimedia.org/T400987 [12:55:30] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5400 [12:55:31] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053726 (10Jclark-ctr) I do not believe I have login access to JTAC, but I will coordinate with RobH when he returns to get access. I made some adjustments to the airflow in rack F1. Changes are us... [12:55:52] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix JWT exp date ACL [puppet] - 10https://gerrit.wikimedia.org/r/1175099 (https://phabricator.wikimedia.org/T400238) (owner: 10Vgutierrez) [12:56:10] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 37662 [12:56:44] !log jiji@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [12:56:54] !log jiji@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:57:06] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37662 [12:57:10] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 263252 [12:57:31] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263252 [12:57:37] !log re-running recountCategories.php on all wikis except s4 and s1 (T400987) [12:57:37] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 274685 [12:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:05] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 274685 [12:59:20] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11053732 (10ayounsi) Thanks. Nothing out of the ordinary in the logs. [13:04:10] 06SRE, 07Epic, 05Goal: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527#11053740 (10CDanis) [13:04:57] 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11053743 (10Jelto) [13:05:43] 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11053744 (10Jelto) [13:09:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P80427 and previous config saved to /var/cache/conftool/dbconfig/20250801-130943-ladsgroup.json [13:24:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T400854)', diff saved to https://phabricator.wikimedia.org/P80428 and previous config saved to /var/cache/conftool/dbconfig/20250801-132451-ladsgroup.json [13:24:56] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:25:07] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:25:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T400854)', diff saved to https://phabricator.wikimedia.org/P80429 and previous config saved to /var/cache/conftool/dbconfig/20250801-132514-ladsgroup.json [13:27:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T400854)', diff saved to https://phabricator.wikimedia.org/P80430 and previous config saved to /var/cache/conftool/dbconfig/20250801-132745-ladsgroup.json [13:33:50] !log ayounsi@cumin1003 START - Cookbook sre.network.provision for device lsw1-e2-codfw.mgmt.codfw.wmnet [13:33:52] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:37:37] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e2-codfw - ayounsi@cumin1003" [13:37:42] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e2-codfw - ayounsi@cumin1003" [13:37:42] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:37:48] (03PS2) 10STran: Use tempaccounts.dblist to manage rollout wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) [13:37:48] (03PS1) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) [13:38:23] (03PS1) 10Hashar: gerrit: add daemons ssh host key to known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) [13:38:59] (03CR) 10STran: "Great, thanks! In that case I think I would prefer to split this up. iirc rollout is scheduled for a tuesday so I could feasibly test the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [13:39:48] (03PS2) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) [13:40:24] (03CR) 10CDanis: [C:03+1] "Seems reasonable IMO." [puppet] - 10https://gerrit.wikimedia.org/r/1174842 (https://phabricator.wikimedia.org/T394838) (owner: 10BryanDavis) [13:41:47] (03CR) 10Hashar: [C:03+1] "I have tried it on `gerrit1003` by disabling the Puppet agent and manually amending the file, that fixed the host key issue 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [13:42:08] (03PS1) 10Giuseppe Lavagetto: text-frontend: enforcement of UA policy [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) [13:42:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P80431 and previous config saved to /var/cache/conftool/dbconfig/20250801-134253-ladsgroup.json [13:47:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:58:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P80432 and previous config saved to /var/cache/conftool/dbconfig/20250801-135800-ladsgroup.json [14:05:42] !log upgrade redis-server and tools package on idm nodes for security upgrades [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:53] (03PS1) 10MVernon: thanos: drain thanos-be1005 for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1175120 (https://phabricator.wikimedia.org/T400877) [14:13:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T400854)', diff saved to https://phabricator.wikimedia.org/P80433 and previous config saved to /var/cache/conftool/dbconfig/20250801-141308-ladsgroup.json [14:13:12] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [14:13:13] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [14:13:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T400854)', diff saved to https://phabricator.wikimedia.org/P80434 and previous config saved to /var/cache/conftool/dbconfig/20250801-141320-ladsgroup.json [14:13:58] (03PS1) 10Krinkle: In sitemap responses set CC: public [core] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175121 (https://phabricator.wikimedia.org/T400023) [14:15:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T400854)', diff saved to https://phabricator.wikimedia.org/P80435 and previous config saved to /var/cache/conftool/dbconfig/20250801-141553-ladsgroup.json [14:16:05] (03PS1) 10Hashar: gerrit: replica renames as "gerrit2" application user [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) [14:16:22] (03PS1) 10Clare Ming: Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 [14:16:39] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1174701 (owner: 10L10n-bot) [14:18:28] (03CR) 10Eevans: "> Sorry for the long delay (and in future, feel free to chase if it looks like I've forgotten an outstanding review)." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 (owner: 10Eevans) [14:18:29] (03CR) 10Hashar: gerrit: config replicas for rename-project plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [14:18:46] (03PS2) 10Eevans: convenience script to cleanup Cassandra instance state [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 [14:18:48] (03PS2) 10Clare Ming: Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 [14:30:20] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1175123 -- context is https://wikimedia.slack.com/archives/C05ERLBF0E7/p1753993920317559, are SRE ok with a deployment? (cc: thcipriani). I can deploy. [14:31:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P80436 and previous config saved to /var/cache/conftool/dbconfig/20250801-143104-ladsgroup.json [14:32:34] cjming: deploy for https://gerrit.wikimedia.org/r/1175123 is fine by me, sukhe or denisse Friday emergency deploy fine to do now? (pinged as SREs on call) [14:32:36] cjming: no concerns from SRE as such (with my on-call hat on). not that I fully understand the change but I don't think it should cause any issues. thanks for checking. [14:32:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:32:55] thcipriani: ^ [14:33:07] <3 [14:33:16] thanks sukhe, thcipriani - i will proceed then [14:35:58] No concerns from my side. [14:36:14] ty denisse [14:37:10] (03CR) 10ArielGlenn: text-frontend: enforcement of UA policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175115 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [14:37:13] is it problematic that the last 3 backports errored out? [14:39:11] going ahead with deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1175123 [14:39:14] cjming: I believe dancy fixed that late yesterday [14:39:23] cool - gtk [14:39:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 (owner: 10Clare Ming) [14:40:27] all errored on the test server, so if we run into problems there, there may be investigation needed, but there was a command line deploy (plus a scap deploy) after the errors you see in spiderpig [14:40:52] ack [14:40:53] (...where "scap deploy" means deploying a new scap version...) [14:41:05] (03CR) 10Dr0ptp4kt: [C:03+1] Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 (owner: 10Clare Ming) [14:44:07] (03Merged) 10jenkins-bot: Revert "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175123 (owner: 10Clare Ming) [14:44:20] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175123|Revert "MetricsPlatform: Disable synchronous configs fetching"]] [14:46:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P80437 and previous config saved to /var/cache/conftool/dbconfig/20250801-144611-ladsgroup.json [14:46:14] !log cjming@deploy1003 cjming: Backport for [[gerrit:1175123|Revert "MetricsPlatform: Disable synchronous configs fetching"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:47:49] 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11054064 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae Marking as resolved. Thanks! [14:48:02] !log cjming@deploy1003 cjming: Continuing with sync [14:49:24] (03CR) 10Elukey: "Preliminary pass! Hope what I wrote makes sense!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:53:10] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175123|Revert "MetricsPlatform: Disable synchronous configs fetching"]] (duration: 08m 50s) [14:57:34] (03CR) 10MVernon: [C:03+1] "Cool, I think it's worth having this available." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1156924 (owner: 10Eevans) [15:01:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T400854)', diff saved to https://phabricator.wikimedia.org/P80438 and previous config saved to /var/cache/conftool/dbconfig/20250801-150119-ladsgroup.json [15:01:23] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [15:01:35] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:02:21] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1230.eqiad.wmnet with reason: Maintenance [15:02:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T400854)', diff saved to https://phabricator.wikimedia.org/P80439 and previous config saved to /var/cache/conftool/dbconfig/20250801-150228-ladsgroup.json [15:03:45] PROBLEM - MariaDB Replica Lag: s7 #page on db2220 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:03:51] taking a look [15:03:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:25] !incidents [15:04:25] 6535 (UNACKED) db2220 (paged)/MariaDB Replica Lag: s7 (paged) [15:04:26] 6534 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [15:04:29] !ack 6535 [15:04:29] 6535 (ACKED) db2220 (paged)/MariaDB Replica Lag: s7 (paged) [15:04:31] Amir1: <3 [15:05:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T400854)', diff saved to https://phabricator.wikimedia.org/P80440 and previous config saved to /var/cache/conftool/dbconfig/20250801-150501-ladsgroup.json [15:06:22] Slave_SQL_Running_State: Waiting for semi-sync ACK from slave [15:06:44] Here as well. [15:07:08] now Slave_SQL_Running_State: init for update [15:08:08] I think this is heartbeat [15:08:19] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [15:08:47] db2220 is primary right? [15:08:53] codfw [15:09:00] so not anything super major [15:09:05] yep [15:09:26] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:04] the replication is clearly moving forward with no issues [15:10:30] ok that's good at least. the resolve hasn't come in yet but as long as it is moving. [15:11:03] I mean, the replication is working but the all systems are showing lags [15:11:18] which usually means heartbeat needs a kick but that didn't fix it [15:11:30] and a number of lag time is… going up [15:12:11] 10 minutes ago I saw '256 seconds', 421 seconds 3 minutes ago, now 502. [15:12:21] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-e2-codfw - ayounsi@cumin1003" [15:12:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for lsw1-e2-codfw - ayounsi@cumin1003" [15:12:25] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:12:26] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-e2-codfw.mgmt.codfw.wmnet [15:13:40] Replication lag: https://grafana.wikimedia.org/goto/nCv5uLQHg?orgId=1 [15:13:42] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:14:20] I stopped a write script to see if it's just a write load problem [15:15:04] it seems it was the load [15:15:13] it's not growing that fast anymore [15:15:33] stuck at 9m3s https://orchestrator.wikimedia.org/web/cluster/alias/s7 [15:16:04] so in 9 minutes it should start going down once it actually processes all the heavy writes [15:17:00] ok. thanks! [15:17:10] yeah, stuck at 545 secs [15:17:40] which is… 9min 5sec [15:17:46] 06SRE, 06collaboration-services, 10Phabricator, 06Traffic: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11054134 (10Pppery) a:05Novem_Linguae→03None [15:18:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:28] The replication lag is already going down. [15:19:41] yeah, going down progressively [15:19:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P80441 and previous config saved to /var/cache/conftool/dbconfig/20250801-152009-ladsgroup.json [15:21:31] I actually (originally) thought whole prod was going crazy, but Commons and enwiki was fine, so :-p [15:22:34] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:25:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [15:26:23] now it should go down fast [15:30:28] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:35:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P80442 and previous config saved to /var/cache/conftool/dbconfig/20250801-153516-ladsgroup.json [15:38:45] RECOVERY - MariaDB Replica Lag: s7 #page on db2220 is OK: OK slave_sql_lag Replication lag: 8.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:38:50] :D [15:39:06] Page resolved on SpllunkOnCall. [15:45:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [15:50:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T400854)', diff saved to https://phabricator.wikimedia.org/P80444 and previous config saved to /var/cache/conftool/dbconfig/20250801-155024-ladsgroup.json [15:50:28] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [15:50:41] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [15:51:20] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:52:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:52:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80445 and previous config saved to /var/cache/conftool/dbconfig/20250801-155212-ladsgroup.json [15:55:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80446 and previous config saved to /var/cache/conftool/dbconfig/20250801-155548-ladsgroup.json [15:55:52] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [15:57:19] (03PS1) 10Jly: values-security-landing-page.yaml: bump image version to 2025-08-01-155110 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175140 (https://phabricator.wikimedia.org/T398852) [16:00:26] (03CR) 10SBassett: [C:03+2] "LGTM to me and matches up with https://gitlab.wikimedia.org/repos/sre/miscweb/security-landing-page/-/jobs/577268#L81" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175140 (https://phabricator.wikimedia.org/T398852) (owner: 10Jly) [16:00:57] (03PS1) 10Ayounsi: Nokia ZTP: small fixes and better python script [puppet] - 10https://gerrit.wikimedia.org/r/1175141 [16:02:16] (03Merged) 10jenkins-bot: values-security-landing-page.yaml: bump image version to 2025-08-01-155110 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175140 (https://phabricator.wikimedia.org/T398852) (owner: 10Jly) [16:04:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:07:00] !log jly@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:07:13] !log jly@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:07:20] !log jly@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:07:40] !log jly@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:07:48] !log jly@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:08:04] !log jly@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:08:07] !log jly@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:08:11] !log jly@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:08:14] !log jly@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:08:16] !log jly@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:10:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P80447 and previous config saved to /var/cache/conftool/dbconfig/20250801-161056-ladsgroup.json [16:11:58] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054317 (10Joe) >>! In T400881#11052795, @A_smart_kitten wrote: >>>! In T400881#1... [16:12:59] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054327 (10Joe) [16:26:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P80448 and previous config saved to /var/cache/conftool/dbconfig/20250801-162603-ladsgroup.json [16:33:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174583 (https://phabricator.wikimedia.org/T400281) (owner: 10Theprotonade) [16:36:10] (03PS1) 10Kimberly Sarabia: Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) [16:37:13] (03CR) 10Clare Ming: [C:03+1] Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [16:39:16] sorry one more -- I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1175144 -- context is T399486, are SRE ok with a deployment? (cc: thcipriani). I can deploy. [16:39:17] T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486 [16:40:06] just as a forewarning - there might be one more after this (revert of the first thing I deployed earlier) [16:41:09] sukhe? denisse? [16:41:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T400854)', diff saved to https://phabricator.wikimedia.org/P80449 and previous config saved to /var/cache/conftool/dbconfig/20250801-164111-ladsgroup.json [16:41:15] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [16:41:28] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [16:41:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T400854)', diff saved to https://phabricator.wikimedia.org/P80450 and previous config saved to /var/cache/conftool/dbconfig/20250801-164134-ladsgroup.json [16:43:43] cjming: is this urgent for Friday? (asking) [16:44:36] sukhe: yes - it will inform us whether we need to roll back something else that's riding the train next week [16:45:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T400854)', diff saved to https://phabricator.wikimedia.org/P80451 and previous config saved to /var/cache/conftool/dbconfig/20250801-164510-ladsgroup.json [16:45:14] sorry for the friday drama [16:46:39] * thcipriani reads [16:46:43] please check with thcipriani too [16:48:24] fwiw dr0ptp4kt is advising on all these deployments as well - so i'm not just going rogue [16:49:24] (03PS1) 10BCornwall: Revert "acme-chief: Remove nc domains with DNSSEC enabled" [puppet] - 10https://gerrit.wikimedia.org/r/1175146 [16:51:56] cjming: yeah of course no worries about that :) [16:52:20] https://wikitech.wikimedia.org/wiki/Deployments/Emergencies dictates that SRE needs releng to be informed as well [16:52:31] if it is urgent and that has been discussed, that's fine by at least me on on-call, but that's just SRE [16:52:35] sukhe: ack - thanks [16:54:05] thcipriani: we think this backport in WME will fix the event produce rate cliff drop and if it does, then we'll revert the config revert i did earlier if all this is ok with you [16:55:21] cjming: what happens if this doesn't fix it? [16:58:03] for clarity, did the revert from earlier get you to a stable place, or no? [16:58:23] thcipriani: then we'll go back to the drawing board and examine all commits in the last train cut -- but this backport was merged in 1.45.0-wmf.11 but never got into master [16:58:41] the revert from earlier didn't change anything - graphs stayed same [16:59:03] it further delays our ability to look at retention metrics for future a/b tests [16:59:51] cjming: I think the verbage on emergency deploys can be better in the text and/or the intention [17:00:15] at least from SRE's side, emergency deploys means that something needs to be deployed on Friday if it will lead to an outage over the weekend [17:00:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P80452 and previous config saved to /var/cache/conftool/dbconfig/20250801-170018-ladsgroup.json [17:00:26] or if it fixes something that's immediately broken that can't be carried over the weekend [17:00:38] would this then fit that understanding, since you know more about this than at least I do? [17:10:29] * thcipriani still catching up on context [17:10:54] (03CR) 10Brouberol: [C:03+1] Add keytabs for new an-druid100[67] hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1171214 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [17:15:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P80453 and previous config saved to /var/cache/conftool/dbconfig/20250801-171525-ladsgroup.json [17:22:49] cjming: sukhe alright, I'm up-to-speed on context, I'm good with this deploy (and revertt). seems like this deploy (plus revert of previous) should save some scrambling for folks. This deploy should put us in a stable spot for this, even if it doesn't 100% have the desired affect and should be safe to leave over the weekend (as I now understand it). [17:23:39] thcipriani: tysm [17:24:40] plus, seems small and safe. More context for lurkers: backport made it to wmf.11 but not wmf.12 which caused a noticable drop in event logs following group2 deploy. These deploys should align things. [17:25:14] (03PS1) 10CDobbins: . [puppet] - 10https://gerrit.wikimedia.org/r/1175151 [17:25:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [17:26:08] thcipriani: cjming: cool, +1 from SRE too then [17:26:12] (03PS2) 10CDobbins: admin: remove access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) [17:26:16] \o/ [17:26:55] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175152 [17:27:12] (03Merged) 10jenkins-bot: Enable AA test on all wikis [extensions/WikimediaEvents] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1175144 (https://phabricator.wikimedia.org/T399486) (owner: 10Kimberly Sarabia) [17:27:27] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175144|Enable AA test on all wikis (T399486)]] [17:27:30] T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486 [17:29:22] !log cjming@deploy1003 ksarabia, cjming: Backport for [[gerrit:1175144|Enable AA test on all wikis (T399486)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:30:08] !log cjming@deploy1003 ksarabia, cjming: Continuing with sync [17:30:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T400854)', diff saved to https://phabricator.wikimedia.org/P80454 and previous config saved to /var/cache/conftool/dbconfig/20250801-173033-ladsgroup.json [17:30:36] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [17:30:49] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:30:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T400854)', diff saved to https://phabricator.wikimedia.org/P80455 and previous config saved to /var/cache/conftool/dbconfig/20250801-173056-ladsgroup.json [17:34:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T400854)', diff saved to https://phabricator.wikimedia.org/P80456 and previous config saved to /var/cache/conftool/dbconfig/20250801-173431-ladsgroup.json [17:35:34] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175144|Enable AA test on all wikis (T399486)]] (duration: 08m 06s) [17:35:37] T399486: Turn on the A/A test in production - https://phabricator.wikimedia.org/T399486 [17:39:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.098s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:39:30] hehe [17:40:08] zooming out shows similar spikes so I guess we will see how they ride out [17:42:58] (03CR) 10Ssingh: [C:03+1] admin: remove access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1175151 (https://phabricator.wikimedia.org/T400374) (owner: 10CDobbins) [17:43:39] (03PS1) 10Clare Ming: Revert^2 "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 [17:44:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.098s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:46:24] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11054529 (10CDobbins) 05Open→03In progress p:05Triage→03Medium a:03CDobbins [17:48:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054544 (10CDobbins) [17:49:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P80457 and previous config saved to /var/cache/conftool/dbconfig/20250801-174939-ladsgroup.json [17:50:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054546 (10CDobbins) [17:56:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054573 (10CDobbins) [18:03:38] (03CR) 10Dr0ptp4kt: [C:03+1] Revert^2 "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 (owner: 10Clare Ming) [18:04:38] per approvals above - deploying one last thing and that will be it from us for this eventful friday [18:04:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P80458 and previous config saved to /var/cache/conftool/dbconfig/20250801-180447-ladsgroup.json [18:04:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054582 (10CDobbins) @KFrancis There's a discrepancy in the email address on the NDA sheet (Parmarsiddharth2parmar@gmail.com) and in this task (siddharthvp@gmail.com). [[ https... [18:04:56] cjming: gl :) [18:06:01] ty :) [18:06:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 (owner: 10Clare Ming) [18:06:58] (03Merged) 10jenkins-bot: Revert^2 "MetricsPlatform: Disable synchronous configs fetching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175155 (owner: 10Clare Ming) [18:07:10] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1175155|Revert^2 "MetricsPlatform: Disable synchronous configs fetching"]] [18:09:04] !log cjming@deploy1003 cjming: Backport for [[gerrit:1175155|Revert^2 "MetricsPlatform: Disable synchronous configs fetching"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:10:58] !log cjming@deploy1003 cjming: Continuing with sync [18:12:05] (03CR) 10Michael Große: [C:03+1] [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [18:12:53] (03CR) 10Michael Große: [C:03+1] "Understanding that this needs to wait, but I'm giving my plus +1 for when it is ready to move forward" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [18:13:05] (03CR) 10Michael Große: [C:03+1] [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [18:15:20] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11054588 (10DavidBrooks) >>! In T400119#11053166, @Joe wrote: > There won't be adding some magical regexes trying to ban any single case. We will make the... [18:16:23] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175155|Revert^2 "MetricsPlatform: Disable synchronous configs fetching"]] (duration: 09m 13s) [18:16:53] (03PS1) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1175157 (https://phabricator.wikimedia.org/T399604) [18:17:55] (03PS1) 10CDobbins: admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) [18:18:28] (03CR) 10CI reject: [V:04-1] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins) [18:19:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11054604 (10SD0001) @KFrancis Sounds like an error in the sheet. The NDA doc I signed bears the email siddharthvp@gmail.com. I don't recognize that other email - seems to belong... [18:19:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T400854)', diff saved to https://phabricator.wikimedia.org/P80459 and previous config saved to /var/cache/conftool/dbconfig/20250801-181954-ladsgroup.json [18:19:58] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [18:20:11] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:20:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T400854)', diff saved to https://phabricator.wikimedia.org/P80460 and previous config saved to /var/cache/conftool/dbconfig/20250801-182017-ladsgroup.json [18:20:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-wg45t - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [18:22:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T400854)', diff saved to https://phabricator.wikimedia.org/P80461 and previous config saved to /var/cache/conftool/dbconfig/20250801-182254-ladsgroup.json [18:27:14] (03PS2) 10CDobbins: admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) [18:27:48] (03CR) 10CI reject: [V:04-1] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins) [18:29:37] (03CR) 10Dreamy Jazz: [C:03+1] Use tempaccounts.dblist to manage rollout wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175062 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [18:31:03] (03PS3) 10CDobbins: admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) [18:32:33] (03CR) 10Ssingh: [C:03+1] "Looks good, adding @jborun@wikimedia.org for their review as well." [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins) [18:33:04] (03CR) 10Ssingh: [C:03+1] "(Wait for Joanna's review before merging, I would say.)" [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins) [18:37:26] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11054679 (10CDobbins) [18:38:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P80462 and previous config saved to /var/cache/conftool/dbconfig/20250801-183802-ladsgroup.json [18:39:24] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11054686 (10CDobbins) 05Open→03Stalled p:05Triage→03Medium [18:53:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P80463 and previous config saved to /var/cache/conftool/dbconfig/20250801-185310-ladsgroup.json [18:54:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:55:27] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11054728 (10BCornwall) a:05BCornwall→03None [19:00:34] (03PS1) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) [19:01:16] (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [19:02:08] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11054736 (10VRiley-WMF) This PDU has been swapped [19:06:34] (03CR) 10BCornwall: [C:03+1] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins) [19:08:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T400854)', diff saved to https://phabricator.wikimedia.org/P80464 and previous config saved to /var/cache/conftool/dbconfig/20250801-190817-ladsgroup.json [19:08:21] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [19:08:22] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance [19:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:09:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:10:09] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2211.codfw.wmnet with reason: Maintenance [19:10:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T400854)', diff saved to https://phabricator.wikimedia.org/P80465 and previous config saved to /var/cache/conftool/dbconfig/20250801-191016-ladsgroup.json [19:13:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T400854)', diff saved to https://phabricator.wikimedia.org/P80466 and previous config saved to /var/cache/conftool/dbconfig/20250801-191354-ladsgroup.json [19:13:58] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [19:14:01] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11054761 (10CDobbins) [19:22:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P80467 and previous config saved to /var/cache/conftool/dbconfig/20250801-192901-ladsgroup.json [19:32:14] 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Restore Taavi's analytics-privatedata-users membership - https://phabricator.wikimedia.org/T400900#11054796 (10CDobbins) [19:32:43] (03CR) 10CDobbins: [C:03+2] admin: add taavi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1174761 (https://phabricator.wikimedia.org/T400900) (owner: 10CDobbins) [19:33:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 159025 MB (4% inode=99%): /var/lib/hadoop/data/e 151560 MB (4% inode=99%): /var/lib/hadoop/data/m 159262 MB (4% inode=99%): /var/lib/hadoop/data/k 154811 MB (4% inode=99%): /var/lib/hadoop/data/f 154498 MB (4% inode=99%): /var/lib/hadoop/data/g 159939 MB (4% inode=99%): /var/lib/hadoop/data/h 160639 MB (4% inode=99%): /var/lib/hadoop/data [19:33:20] 0 MB (4% inode=99%): /var/lib/hadoop/data/j 154489 MB (4% inode=99%): /var/lib/hadoop/data/c 149060 MB (3% inode=99%): /var/lib/hadoop/data/l 153891 MB (4% inode=99%): /var/lib/hadoop/data/b 159936 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [19:34:36] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11054801 (10CDobbins) [19:39:32] (03PS1) 10BCornwall: Revert "ncredir: Revert addition of for-pay domains" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 [19:40:15] (03CR) 10BCornwall: [C:04-2] "Needs I765c6b00b15010822b200491209eb474f2034c40 before merging" [puppet] - 10https://gerrit.wikimedia.org/r/1175164 (owner: 10BCornwall) [19:40:25] FIRING: SystemdUnitFailed: mw-experimental-mediawiki-image-update.service on wikikube-worker2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:44:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P80468 and previous config saved to /var/cache/conftool/dbconfig/20250801-194409-ladsgroup.json [19:44:15] 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Restore Taavi's analytics-privatedata-users membership - https://phabricator.wikimedia.org/T400900#11054814 (10CDobbins) 05In progress→03Resolved [19:44:36] (03CR) 10CDobbins: [C:03+2] admin: add user noa to absent users [puppet] - 10https://gerrit.wikimedia.org/r/1175158 (https://phabricator.wikimedia.org/T399953) (owner: 10CDobbins) [19:46:42] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11054816 (10CDobbins) [19:46:55] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11054817 (10CDobbins) 05Open→03Resolved p:05Triage→03Medium a:05joanna_borun→03CDobbins [19:55:00] 06SRE, 10DNS, 06Traffic: Verify wikimediafoundation.org for Visual Studio Marketplace. - https://phabricator.wikimedia.org/T400089#11054828 (10BCornwall) 05In progress→03Resolved Setting as resolved. Please re-open if it hasn't worked or if anything else is needed. Thanks! [19:59:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T400854)', diff saved to https://phabricator.wikimedia.org/P80470 and previous config saved to /var/cache/conftool/dbconfig/20250801-195917-ladsgroup.json [19:59:21] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [19:59:33] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2223.codfw.wmnet with reason: Maintenance [19:59:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T400854)', diff saved to https://phabricator.wikimedia.org/P80471 and previous config saved to /var/cache/conftool/dbconfig/20250801-195940-ladsgroup.json [19:59:52] (03PS1) 10Ladsgroup: mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806) [20:02:14] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06Traffic: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11054847 (10Bawolff) InstantCommons does work in a way that is pretty easy to abus... [20:03:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T400854)', diff saved to https://phabricator.wikimedia.org/P80472 and previous config saved to /var/cache/conftool/dbconfig/20250801-200317-ladsgroup.json [20:06:21] (03PS2) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) [20:06:40] (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [20:14:17] (03PS1) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) [20:14:40] (03Abandoned) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175162 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [20:14:59] (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [20:18:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P80473 and previous config saved to /var/cache/conftool/dbconfig/20250801-201825-ladsgroup.json [20:25:04] (03PS2) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) [20:25:47] (03CR) 10CI reject: [V:04-1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [20:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:33:13] (03PS1) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) [20:33:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P80474 and previous config saved to /var/cache/conftool/dbconfig/20250801-203332-ladsgroup.json [20:35:24] (03CR) 10CI reject: [V:04-1] mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [20:40:25] RESOLVED: SystemdUnitFailed: mw-experimental-mediawiki-image-update.service on wikikube-worker2100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-6d8d7547b7-wg45t - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [20:48:13] (03PS2) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) [20:48:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T400854)', diff saved to https://phabricator.wikimedia.org/P80475 and previous config saved to /var/cache/conftool/dbconfig/20250801-204840-ladsgroup.json [20:48:44] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:48:45] (03CR) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [20:48:56] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2228.codfw.wmnet with reason: Maintenance [20:49:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T400854)', diff saved to https://phabricator.wikimedia.org/P80476 and previous config saved to /var/cache/conftool/dbconfig/20250801-204903-ladsgroup.json [20:50:57] (03CR) 10CI reject: [V:04-1] mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [20:52:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T400854)', diff saved to https://phabricator.wikimedia.org/P80477 and previous config saved to /var/cache/conftool/dbconfig/20250801-205239-ladsgroup.json [20:53:34] (03PS3) 10CDobbins: admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) [21:05:01] (03PS3) 10Ladsgroup: mariadb: Add CI rule to compare tables catalog and filtered_tables.txt [puppet] - 10https://gerrit.wikimedia.org/r/1175171 (https://phabricator.wikimedia.org/T398946) [21:07:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P80478 and previous config saved to /var/cache/conftool/dbconfig/20250801-210746-ladsgroup.json [21:22:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P80479 and previous config saved to /var/cache/conftool/dbconfig/20250801-212254-ladsgroup.json [21:32:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:32:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [21:38:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T400854)', diff saved to https://phabricator.wikimedia.org/P80480 and previous config saved to /var/cache/conftool/dbconfig/20250801-213802-ladsgroup.json [21:38:05] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [22:45:29] (03CR) 10BCornwall: [C:03+1] admin: add SD0001 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1175168 (https://phabricator.wikimedia.org/T400405) (owner: 10CDobbins) [23:09:29] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:10:08] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:22:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175182 [23:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175182 (owner: 10TrainBranchBot) [23:51:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1175182 (owner: 10TrainBranchBot) [23:57:30] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714/ using stat1009.eqiad.wmnet)