[00:10:11] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:25:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10616887 (10phaultfinder) [00:32:26] (03PS4) 10Sbisson: Enable CX unified dashboard on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) [00:37:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [00:38:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125654 [00:38:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125654 (owner: 10TrainBranchBot) [00:39:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10616891 (10phaultfinder) [00:42:50] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:47:50] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:49:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1125654 (owner: 10TrainBranchBot) [00:53:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:07:19] PROBLEM - Disk space on ms-be2069 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdj1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2069&var-datasource=codfw+prometheus/ops [01:08:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125655 [01:08:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125655 (owner: 10TrainBranchBot) [01:10:49] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:15:49] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:25:41] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:26:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1125655 (owner: 10TrainBranchBot) [01:30:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:40:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10616919 (10phaultfinder) [01:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:50:40] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:59:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10616921 (10phaultfinder) [02:11:11] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:14:49] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:24:49] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:30:34] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10616926 (10BCornwall) Sadly, I have also been unable to get ahold of Thomas. [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:49] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:55:49] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:04:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10616934 (10phaultfinder) [03:05:40] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:40] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:54:49] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:59:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:04:49] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:17:45] RESOLVED: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10616960 (10phaultfinder) [04:44:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:49:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:05:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:09:36] FIRING: ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:10:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:17:25] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:47:43] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:59:36] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:09:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:25] RESOLVED: ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:14:36] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:16:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125588 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [06:16:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125588 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [06:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10617001 (10phaultfinder) [06:34:36] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:44:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:33] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:36] RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:01:56] (03PS1) 10Marostegui: db1250: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1125804 (https://phabricator.wikimedia.org/T388024) [07:10:45] (03CR) 10Marostegui: [C:03+2] db1250: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1125804 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [07:14:36] FIRING: ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:17:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [07:19:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [07:19:36] RESOLVED: ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:22:18] (03PS1) 10Marostegui: mariadb: Promote db1250 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1125806 (https://phabricator.wikimedia.org/T388024) [07:23:04] (03CR) 10Marostegui: "[07:22:50] marostegui@cumin1002:~$ host 10.64.0.113" [puppet] - 10https://gerrit.wikimedia.org/r/1125806 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [07:23:46] (03CR) 10Marostegui: "This host was tested on haproxies past week and it all looked good." [puppet] - 10https://gerrit.wikimedia.org/r/1125806 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [07:33:11] (03PS1) 10Muehlenhoff: Remove access for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/1125929 [07:34:44] (03CR) 10Muehlenhoff: [C:03+2] Remove access for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/1125929 (owner: 10Muehlenhoff) [07:37:44] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jdcc-berkman out of all services on: 961 hosts [07:38:26] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jdcc-berkman out of all services on: 1284 hosts [07:40:51] (03PS1) 10Aklapper: phabricator weekly changes email: Trivial string changes [puppet] - 10https://gerrit.wikimedia.org/r/1125935 [07:41:56] (03PS2) 10Máté Szabó: Remove unused $wgSecurePollGPGCommand setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124745 (https://phabricator.wikimedia.org/T380441) [07:42:30] (03Abandoned) 10Kosta Harlan: Remove unused $wgSecurePollGPGCommand setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124745 (https://phabricator.wikimedia.org/T380441) (owner: 10Máté Szabó) [07:43:02] (03CR) 10Aklapper: "Trivial string changes only if I get hit by bus and someone else wants to understand the bigger picture (I can't +2 myself)" [puppet] - 10https://gerrit.wikimedia.org/r/1125935 (owner: 10Aklapper) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250309T0900) [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T0800). [08:00:05] DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:08] I can deploy [08:00:27] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10617146 (10MoritzMuehlenhoff) 05Open→03Resolved Stephen LaPorte was/is the point of contact for that project and he confirmed that the work is completed. As such, the produ... [08:02:33] DreamRimmer: I can start deployment, if you're around? [08:03:40] Well, this one is clear-cut, I can just go ahead. [08:04:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125588 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [08:05:04] (03Merged) 10jenkins-bot: Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125588 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [08:05:47] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1125588|Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage (T388301)]] [08:05:50] T388301: Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage - https://phabricator.wikimedia.org/T388301 [08:07:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2232].codfw.wmnet,db[1164,1217,1250].eqiad.wmnet with reason: Primary switchover m1 T388024 [08:07:10] T388024: Switch m1 master db1164 -> db1250 - https://phabricator.wikimedia.org/T388024 [08:07:47] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1250 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1125806 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [08:09:27] !log Failover m1 from db1164 to db1250 - T388024 [08:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:03] (03PS1) 10Marostegui: mariadb: Change m1 backups host [puppet] - 10https://gerrit.wikimedia.org/r/1125939 (https://phabricator.wikimedia.org/T388024) [08:14:34] (03CR) 10Marostegui: "jcrespo: db1250 is now the new master." [puppet] - 10https://gerrit.wikimedia.org/r/1125939 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [08:15:02] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10617183 (10MoritzMuehlenhoff) [08:17:05] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1125942 [08:17:30] (03CR) 10Marostegui: [C:03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1125942 (owner: 10Marostegui) [08:18:39] !log awight@deploy2002 awight, dreamrimmer: Backport for [[gerrit:1125588|Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage (T388301)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:42] T388301: Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage - https://phabricator.wikimedia.org/T388301 [08:19:21] !log awight@deploy2002 awight, dreamrimmer: Continuing with sync [08:20:59] (03PS1) 10Filippo Giunchedi: pontoon: add option to verify the bootstrap worked [puppet] - 10https://gerrit.wikimedia.org/r/1125943 [08:23:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10617222 (10MoritzMuehlenhoff) 05Stalled→03Open [08:25:48] (03PS1) 10Muehlenhoff: Add benbuchenenau to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1125945 (https://phabricator.wikimedia.org/T386904) [08:27:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [08:27:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10617230 (10ops-monitoring-bot) Draining ganeti1028.eqiad.wmnet of running VMs [08:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 4.167% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:29:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [08:29:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:29:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:29:56] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125588|Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage (T388301)]] (duration: 24m 08s) [08:29:59] T388301: Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage - https://phabricator.wikimedia.org/T388301 [08:30:20] (03CR) 10Volans: [C:03+2] cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat) [08:30:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [08:30:26] (03PS7) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) [08:30:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [08:30:56] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline" [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [08:30:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [08:31:36] (03CR) 10Filippo Giunchedi: [C:03+1] opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse) [08:31:52] !log UTC morning backports are done [08:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:07] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add option to verify the bootstrap worked [puppet] - 10https://gerrit.wikimedia.org/r/1125943 (owner: 10Filippo Giunchedi) [08:33:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10617247 (10ops-monitoring-bot) Draining ganeti1028.eqiad.wmnet of running VMs [08:33:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:33:57] (03PS1) 10Muehlenhoff: Switch ganeti1028 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1125947 [08:34:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 13.89% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:34:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:35:48] (03CR) 10Vgutierrez: [C:03+1] haproxy/icinga: Remove RSA from auth algorithms [puppet] - 10https://gerrit.wikimedia.org/r/1100192 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [08:36:23] (03CR) 10Vgutierrez: site,hiera: Reimage lvs6003 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1125472 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [08:37:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1 T387953', diff saved to https://phabricator.wikimedia.org/P74164 and previous config saved to /var/cache/conftool/dbconfig/20250310-083746-marostegui.json [08:37:50] T387953: Migrate pc1 to MariaDB 10.11 - https://phabricator.wikimedia.org/T387953 [08:39:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:39:36] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:53] (03PS1) 10Marostegui: pc1011: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1125949 (https://phabricator.wikimedia.org/T387953) [08:40:14] (03PS1) 10Elukey: services: Double memory for the kartotherian-main container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125950 (https://phabricator.wikimedia.org/T386926) [08:40:27] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125945 (https://phabricator.wikimedia.org/T386904) (owner: 10Muehlenhoff) [08:40:54] (03CR) 10Muehlenhoff: [C:03+2] Add benbuchenenau to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1125945 (https://phabricator.wikimedia.org/T386904) (owner: 10Muehlenhoff) [08:42:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Migration to 10.11 [08:43:33] (03PS1) 10Volans: interactive: notify when waiting for input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 [08:43:33] (03PS1) 10Volans: tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 [08:44:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10617318 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Ben.buchenau I've just merged a patch to enable your access, it wi... [08:45:04] (03CR) 10Marostegui: [C:03+2] pc1011: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1125949 (https://phabricator.wikimedia.org/T387953) (owner: 10Marostegui) [08:46:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10617322 (10Ben.buchenau) Perfect, thanks very much! [08:46:19] (03CR) 10Volans: "This is a proposal on exposing a way to notify people running interactive commands. LMK what do you think and if it makes any sense." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [08:46:41] (03Merged) 10jenkins-bot: cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat) [08:47:17] (03PS1) 10Jelto: gitlab: move restore to a later schedule [puppet] - 10https://gerrit.wikimedia.org/r/1125959 (https://phabricator.wikimedia.org/T388308) [08:48:13] (03CR) 10Brouberol: [C:03+1] "Ship it!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125424 (https://phabricator.wikimedia.org/T380624) (owner: 10Stevemunene) [08:48:15] (03CR) 10CI reject: [V:04-1] tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 (owner: 10Volans) [08:48:26] (03CR) 10CI reject: [V:04-1] interactive: notify when waiting for input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [08:49:15] (03PS1) 10Marostegui: pc2011: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1125960 (https://phabricator.wikimedia.org/T387953) [08:49:58] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5041/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125959 (https://phabricator.wikimedia.org/T388308) (owner: 10Jelto) [08:52:10] (03CR) 10Volans: "I'm checking the CI failure as:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [08:57:28] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10617350 (10MoritzMuehlenhoff) >>! In T388186#10614269, @Dwisehaupt wrote: > @MoritzMuehlenhoff Thanks for the info. I'll have... [08:57:51] (03PS8) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) [08:59:03] (03CR) 10Marostegui: [C:03+2] pc2011: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1125960 (https://phabricator.wikimedia.org/T387953) (owner: 10Marostegui) [09:01:55] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116828 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [09:03:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:05:05] (03CR) 10Jcrespo: [C:03+2] mariadb: Change m1 backups host [puppet] - 10https://gerrit.wikimedia.org/r/1125939 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [09:06:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1 T387953', diff saved to https://phabricator.wikimedia.org/P74166 and previous config saved to /var/cache/conftool/dbconfig/20250310-090600-marostegui.json [09:06:05] T387953: Migrate pc1 to MariaDB 10.11 - https://phabricator.wikimedia.org/T387953 [09:07:35] (03CR) 10Arnaudb: [C:03+1] gitlab: move restore to a later schedule [puppet] - 10https://gerrit.wikimedia.org/r/1125959 (https://phabricator.wikimedia.org/T388308) (owner: 10Jelto) [09:13:19] (03PS2) 10Jelto: Remove profile::kubernetes::* from role::ci [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) [09:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:19:24] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: move restore to a later schedule [puppet] - 10https://gerrit.wikimedia.org/r/1125959 (https://phabricator.wikimedia.org/T388308) (owner: 10Jelto) [09:20:50] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [09:21:04] jelto: you merged my change right? [09:21:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 9.722% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:21:50] marostegui: no, the diff included just my change [09:22:01] jelto: Ah right! Doing it now thanks [09:22:10] great :) [09:22:43] RESOLVED: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/migration at codfw: 22.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=migration - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:23:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1164.eqiad.wmnet with reason: Reboot [09:23:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1164.eqiad.wmnet [09:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/migration at codfw: 22.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=migration - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:28:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1164.eqiad.wmnet [09:28:56] (03PS9) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) [09:30:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10617481 (10phaultfinder) [09:30:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:32:04] (03CR) 10Ayounsi: [C:03+2] Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state [puppet] - 10https://gerrit.wikimedia.org/r/1124795 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [09:32:24] (03CR) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [09:32:29] (03CR) 10Federico Ceratto: [C:03+1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [09:32:32] (03CR) 10Federico Ceratto: [C:03+2] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [09:33:12] !log installing exim4 bugfix updates from Bookworm point release [09:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:18] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:35:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed on ms-be2069 - https://phabricator.wikimedia.org/T388373 (10MatthewVernon) 03NEW [09:37:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed on ms-be2069 - https://phabricator.wikimedia.org/T388373#10617526 (10MatthewVernon) p:05Triage→03High [09:39:47] !log run puppetserver.delete() for relforge100[567] and elastic110[456] - pending certificate requests since weeks ago, DSE confirmed those hosts are not in prod/used. [09:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:20] RECOVERY - Disk space on ms-be2069 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2069&var-datasource=codfw+prometheus/ops [09:49:33] (03CR) 10FNegri: [C:03+1] "LGTM. For the record this was tested last Friday in both codfw and eqiad and seems to be working correctly." [puppet] - 10https://gerrit.wikimedia.org/r/1125499 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [09:53:20] (03PS1) 10Marostegui: mariadb: Move db1164 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1125973 (https://phabricator.wikimedia.org/T388366) [09:54:11] (03CR) 10Ayounsi: "The location/row is a bit blury, in most of the cases it means an actual physical row of racks. But it's more and more meaning an "availab" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:54:18] (03CR) 10Btullis: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123672 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [09:54:37] (03CR) 10Btullis: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123673 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [09:55:42] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1125413 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:56:07] (03PS2) 10Volans: docs: removed deprecated call to sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 [09:56:07] (03PS1) 10Volans: puppetdb: add support for structured facts [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) [09:57:40] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::main@codfw [09:57:48] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123672 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [09:59:20] PROBLEM - Etcd cluster health on dse-k8s-etcd1002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1000) [10:00:07] (03CR) 10Volans: "This is a proposal to add support for structured facts in cumin's puppetdb backend. It's using the PuppetDB's dot notation that is declare" [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [10:01:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:02:10] (03PS10) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) [10:03:28] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:04:08] (03PS1) 10Dreamrimmer: Disallow editing modules for non-autoconfirmed users on the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125978 (https://phabricator.wikimedia.org/T388301) [10:04:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:04:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::main@codfw [10:04:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1164.eqiad.wmnet with reason: Reboot [10:05:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: Reboot [10:05:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125978 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [10:05:59] (03CR) 10Federico Ceratto: "Thanks for the feedback! I'll close some of the points after updating the code." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [10:06:25] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1164 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/1125973 (https://phabricator.wikimedia.org/T388366) (owner: 10Marostegui) [10:06:27] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123673 (https://phabricator.wikimedia.org/T387315) [10:06:50] dbproxy irc alerts are to be expected [10:07:03] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::main@eqiad [10:07:13] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-main@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123673 (https://phabricator.wikimedia.org/T387315) (owner: 10Vgutierrez) [10:07:46] marostegui: if my CR showed up on your merge session please go ahead [10:07:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Cloning [10:07:55] vgutierrez: it didn't [10:08:01] ack, merging now [10:12:44] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [10:15:20] !log test moving k8s-mlstaging from prometheus2005 to prometheus2007 - T383232 [10:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:23] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [10:15:44] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [10:16:44] (03CR) 10Tiziano Fogli: [C:03+2] blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100782 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [10:16:48] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [10:16:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::main@eqiad [10:17:06] (03CR) 10Tiziano Fogli: [C:03+2] cloudgw: move icmp checks under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/migration at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=migration - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:22:49] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-etcd1002.eqiad.wmnet with OS bookworm [10:24:27] (03CR) 10Hnowlan: [C:03+1] "lgtm, just a query" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [10:25:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/migration at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=migration - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:27:02] (03CR) 10Jgiannelos: pcs: Add missing rules for content pregeneration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [10:27:23] (03CR) 10Ladsgroup: [C:04-2] Add config needed to re-architecture mainstash away from x2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:27:41] FIRING: [3x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10617696 (10phaultfinder) [10:30:11] (03CR) 10Hnowlan: "sgtm! It'd be nice if we could pursue some optimisations in future as kartotherian is getting up there as far as usage goes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125950 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [10:31:19] (03CR) 10Elukey: "Yes definitely, I am chatting with Content Transform about it, hopefully there will be some prioritization/time allocated in the near futu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125950 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [10:31:26] (03CR) 10Federico Ceratto: Ask for confirmation before depooling last host in a group (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:32:41] FIRING: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:32:42] (03CR) 10Ayounsi: [C:03+1] Enable BGP Multipath for PyBal group [homer/public] - 10https://gerrit.wikimedia.org/r/1125471 (https://phabricator.wikimedia.org/T332027) (owner: 10Cathal Mooney) [10:32:48] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-etcd1002.eqiad.wmnet with reason: host reimage [10:33:02] (03CR) 10Ayounsi: [C:03+1] Add new Juniper leaf switches eqiad E8/F8 to IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1125488 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [10:33:44] !log filippo@puppetserver1001 conftool action : set/weight=10; selector: name=prometheus2007.codfw.wmnet [10:36:31] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-etcd1002.eqiad.wmnet with reason: host reimage [10:36:33] !log filippo@puppetserver1001 conftool action : set/pooled=no; selector: name=prometheus2007.codfw.wmnet [10:36:42] (03CR) 10DCausse: cirrussearch: Add alerts for thread pool rejections (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [10:37:10] !log filippo@puppetserver1001 conftool action : set/pooled=no; selector: name=prometheus2005.codfw.wmnet [10:37:25] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::scholarly@codfw [10:37:32] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-scholarly@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123676 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [10:39:29] (03CR) 10Marostegui: Ask for confirmation before depooling last host in a group (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [10:42:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:42:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:43:56] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [10:43:56] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::scholarly@codfw [10:44:01] (03CR) 10Ladsgroup: [C:03+2] Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [10:44:26] (03Merged) 10jenkins-bot: Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [10:45:15] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::scholarly@eqiad [10:45:26] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123677 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [10:45:31] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP on wdqs-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123677 (https://phabricator.wikimedia.org/T387316) [10:45:47] !log ladsgroup@deploy2002 Started deploy [dumps/dumps@afcb740]: Removing Yahoo! abstract dumps code (T382069) [10:45:50] T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069 [10:45:54] !log ladsgroup@deploy2002 Finished deploy [dumps/dumps@afcb740]: Removing Yahoo! abstract dumps code (T382069) (duration: 00m 07s) [10:47:31] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP on wdqs-scholarly@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123677 (https://phabricator.wikimedia.org/T387316) (owner: 10Vgutierrez) [10:54:00] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10617808 (10MoritzMuehlenhoff) [10:55:03] !log installing qemu security updates [10:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:10] (03PS1) 10Vgutierrez: varnish: X-Requestctl is now being handled by HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1125986 [10:56:03] (03CR) 10Marostegui: [C:03+1] dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:57:17] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-etcd1002.eqiad.wmnet with OS bookworm [10:57:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10617818 (10phaultfinder) [10:59:13] (03PS3) 10Jcrespo: dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) [10:59:16] (03PS1) 10Brouberol: mediawiki-dumps-legacy: deploy the cronjob template in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125988 (https://phabricator.wikimedia.org/T388378) [10:59:23] (03PS1) 10Brouberol: airflow-analyics-test: grant permissions to read Jobs in the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125989 (https://phabricator.wikimedia.org/T388378) [11:00:48] (03CR) 10Vgutierrez: "given that X-Requestctl is set to ` ` (a white space) in haproxy this could potentially mess with analytics validation to include requestc" [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez) [11:00:51] (03CR) 10Clément Goubert: [C:03+2] periodic_jobs: Remove last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125467 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [11:01:44] (03PS1) 10Filippo Giunchedi: prometheus: enable mod_proxy to use ssl [puppet] - 10https://gerrit.wikimedia.org/r/1125990 (https://phabricator.wikimedia.org/T383232) [11:02:19] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:03:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:03:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::scholarly@eqiad [11:05:05] (03CR) 10Elukey: [C:03+2] services: Double memory for the kartotherian-main container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125950 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [11:06:17] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [11:06:35] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [11:07:39] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [11:07:44] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:08:04] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:08:27] ^ marostegui expected maintenance, right? [11:08:38] yeah [11:08:42] cool [11:08:57] I gave them 1 hour downtime, and it expired [11:09:01] Giving 2 more [11:09:03] ah, I just saw it [11:09:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 8 hosts with reason: Cloning [11:11:33] (03CR) 10Clément Goubert: [C:03+2] periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [11:11:41] (03PS4) 10Clément Goubert: periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) [11:11:46] (03CR) 10Jgiannelos: "My only unknown at this point is how we handle page deletes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [11:13:08] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: enable mod_proxy to use ssl [puppet] - 10https://gerrit.wikimedia.org/r/1125990 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:13:10] (03CR) 10Federico Ceratto: Ask for confirmation before depooling last host in a group (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [11:13:25] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: enable mod_proxy to use ssl [puppet] - 10https://gerrit.wikimedia.org/r/1125990 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:14:10] (03PS1) 10Marostegui: mariadb: Set up ms1 [puppet] - 10https://gerrit.wikimedia.org/r/1125993 (https://phabricator.wikimedia.org/T387332) [11:14:32] (03PS1) 10Elukey: admin_ng: bump memory resourcequota for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125994 (https://phabricator.wikimedia.org/T386926) [11:15:39] (03CR) 10Clément Goubert: [C:03+2] periodic_jobs: Cleanup last wikitech jobs [puppet] - 10https://gerrit.wikimedia.org/r/1125468 (https://phabricator.wikimedia.org/T388249) (owner: 10Clément Goubert) [11:17:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Push ms1 config T387332', diff saved to https://phabricator.wikimedia.org/P74169 and previous config saved to /var/cache/conftool/dbconfig/20250310-111742-marostegui.json [11:17:46] Amir1: ^ [11:17:47] T387332: Set up ms1, ms2, ms3 db clusters - https://phabricator.wikimedia.org/T387332 [11:17:57] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "This looks slightly confusing in isolation but I believe T388301#10617662 describes the situation correctly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125978 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [11:18:02] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:02] marostegui: Thanks [11:18:09] (03CR) 10Marostegui: [C:03+2] mariadb: Set up ms1 [puppet] - 10https://gerrit.wikimedia.org/r/1125993 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:19:17] (03CR) 10Hnowlan: [C:03+1] admin_ng: bump memory resourcequota for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125994 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [11:19:42] marostegui: it looks good but I think it needs to have a weight like PC [11:19:58] Amir1: Fixing [11:20:18] (03PS1) 10Marostegui: site: Add a note about ms1 [puppet] - 10https://gerrit.wikimedia.org/r/1125995 [11:20:31] Thank you <3 [11:20:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add weight to ms1 hosts T387332', diff saved to https://phabricator.wikimedia.org/P74170 and previous config saved to /var/cache/conftool/dbconfig/20250310-112046-marostegui.json [11:20:50] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: deploy the cronjob template in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125988 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [11:20:51] Amir1: done [11:21:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:21:20] Thanks! [11:21:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set ms3 weights to 1 instead of 100', diff saved to https://phabricator.wikimedia.org/P74171 and previous config saved to /var/cache/conftool/dbconfig/20250310-112140-marostegui.json [11:21:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:22:01] (03PS2) 10Marostegui: site: Add a note about ms1 [puppet] - 10https://gerrit.wikimedia.org/r/1125995 (https://phabricator.wikimedia.org/T387332) [11:22:15] (03CR) 10Clément Goubert: [C:03+2] mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert) [11:22:31] (03PS1) 10Filippo Giunchedi: prometheus: remove prometheus2005 from k8s-mlstaging [puppet] - 10https://gerrit.wikimedia.org/r/1125997 (https://phabricator.wikimedia.org/T383232) [11:23:39] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: remove prometheus2005 from k8s-mlstaging [puppet] - 10https://gerrit.wikimedia.org/r/1125997 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:23:55] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: remove prometheus2005 from k8s-mlstaging [puppet] - 10https://gerrit.wikimedia.org/r/1125997 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10617954 (10phaultfinder) [11:24:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2144.codfw.wmnet,db[1151-1152].eqiad.wmnet with reason: Setting up [11:25:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Setting up [11:25:03] (03PS1) 10Ladsgroup: Set thumbnail steps to 1% of production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125998 (https://phabricator.wikimedia.org/T360589) [11:25:20] (03CR) 10Elukey: [C:03+2] admin_ng: bump memory resourcequota for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125994 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [11:25:53] (03CR) 10CI reject: [V:04-1] Set thumbnail steps to 1% of production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125998 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:26:19] (03CR) 10Marostegui: [C:03+2] site: Add a note about ms1 [puppet] - 10https://gerrit.wikimedia.org/r/1125995 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [11:27:42] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [11:28:28] (03PS2) 10Ladsgroup: Set thumbnail steps to 1% of production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125998 (https://phabricator.wikimedia.org/T360589) [11:28:59] vgutierrez: Emperor ^ this shouldn't really impact much but will see [11:29:03] jouncebot: nowandnext [11:29:03] No deployments scheduled for the next 2 hour(s) and 30 minute(s) [11:29:03] In 2 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1400) [11:29:10] Amir1: ack [11:29:34] (03PS2) 10Effie Mouzeli: mw-(api-int|parsoid|jobrunner): increase PHP8.1 traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845) [11:29:40] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [11:29:55] Amir1: ack, 🍿 [11:29:57] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [11:30:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125998 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:30:28] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [11:31:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:31:12] (03Merged) 10jenkins-bot: Set thumbnail steps to 1% of production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125998 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:31:24] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [11:31:29] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1125998|Set thumbnail steps to 1% of production (T360589)]] [11:31:32] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:31:50] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [11:31:59] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [11:32:26] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [11:32:39] (03PS11) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) [11:34:07] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1125998|Set thumbnail steps to 1% of production (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:34:16] (03CR) 10Jgiannelos: "@hnowlan@wikimedia.org I forgot one case from the old rules that also applies now. The one for page summary rerendering when page properti" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [11:34:36] (03CR) 10Marostegui: Ask for confirmation before depooling last host in a group (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [11:35:07] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:40:59] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10618084 (10MatthewVernon) I'm sorry, that must be annoying. I'm afraid from a Swift perspective I don't have anything I could... [11:41:37] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10618085 (10MatthewVernon) [11:41:57] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125998|Set thumbnail steps to 1% of production (T360589)]] (duration: 10m 27s) [11:42:01] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:42:16] 😅 [11:42:21] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:42:59] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:43:08] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:45:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10618109 (10phaultfinder) [11:46:42] (03PS1) 10Muehlenhoff: os-reports: Drop 1873 port [puppet] - 10https://gerrit.wikimedia.org/r/1126007 [11:46:54] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:47:01] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:47:33] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:47:59] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:48:02] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10618115 (10Peachey88) > Upload and try to publish a larger file >40 MiB What's the total size of the file? [11:48:51] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:49:11] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:49:36] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:50:03] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:50:43] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:51:21] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:51:42] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:52:01] (03CR) 10Fabfur: Fix previous commit (031 comment) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [11:52:18] 06SRE, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10618138 (10Vgutierrez) ping? it's also worth mentioning here that lists.wm.o right now is just offering RSA certificates and it should be migrated to a dual stack setup in pre... [11:52:30] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:53:08] (03PS1) 10Muehlenhoff: Add logstash-access to list of groups to drop on offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1126008 (https://phabricator.wikimedia.org/T376790) [11:53:25] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:55:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [11:55:30] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:55:36] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:55:44] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:55:51] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:56:01] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:56:10] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:56:15] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:57:55] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:58:04] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:58:11] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:58:13] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:58:48] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:58:59] (03PS1) 10Clément Goubert: mw-cron: Use php81 base image [puppet] - 10https://gerrit.wikimedia.org/r/1126011 (https://phabricator.wikimedia.org/T387916) [11:59:00] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:59:02] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:59:27] !log installing iputils bugfixes updates [11:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10618177 (10phaultfinder) [11:59:58] (03PS1) 10Muehlenhoff: Update record for aude to track status as contractor [puppet] - 10https://gerrit.wikimedia.org/r/1126013 (https://phabricator.wikimedia.org/T388034) [12:00:05] effe: MediaWiki infrastructure (UTC mid-day #2) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1200). Please do the needful. [12:00:15] (03CR) 10Slyngshede: [C:03+1] "Makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/1126008 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [12:00:36] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:00:38] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:01:21] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:01:22] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:01:46] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:02:23] (03CR) 10Muehlenhoff: [C:03+2] Update record for aude to track status as contractor [puppet] - 10https://gerrit.wikimedia.org/r/1126013 (https://phabricator.wikimedia.org/T388034) (owner: 10Muehlenhoff) [12:02:24] (03CR) 10JMeybohm: [C:03+2] admin_ng: Update dependencies between releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:03:28] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10618192 (10MoritzMuehlenhoff) >>! In T388034#10614966, @Seddon wrote: > Hey, yes @aude is currently reporting... [12:03:37] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10618193 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:03:48] (03PS2) 10Clément Goubert: mw-cron: Use php81 base image [puppet] - 10https://gerrit.wikimedia.org/r/1126011 (https://phabricator.wikimedia.org/T387916) [12:04:16] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10618195 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [12:05:46] (03CR) 10Effie Mouzeli: [C:03+2] mw-(api-int|parsoid|jobrunner): increase PHP8.1 traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [12:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:08:11] (03Merged) 10jenkins-bot: admin_ng: Update dependencies between releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:11:24] (03Merged) 10jenkins-bot: mw-(api-int|parsoid|jobrunner): increase PHP8.1 traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125418 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [12:13:10] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10Thumbor, 10UploadWizard: "Could not acquire lock" error when publishing larger files - https://phabricator.wikimedia.org/T386640#10618240 (10jijiki) >>! In T386640#10615703, @PantheraLeo1359531 wrote: > {F58691771} > > Still happening, no ma... [12:13:22] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:13:50] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:14:04] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:14:22] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:14:42] (03PS1) 10Tiziano Fogli: cloudgw/icmp check/ip6: disabling [puppet] - 10https://gerrit.wikimedia.org/r/1126023 [12:15:20] (03PS2) 10Tiziano Fogli: cloudgw/icmp check/ip6: disabling [puppet] - 10https://gerrit.wikimedia.org/r/1126023 (https://phabricator.wikimedia.org/T388379) [12:15:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10618253 (10phaultfinder) [12:15:50] (03CR) 10Muehlenhoff: [C:03+2] Add logstash-access to list of groups to drop on offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1126008 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [12:16:07] 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T388398 (10LSobanski) 03NEW [12:16:26] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1028.eqiad.wmnet with reason: remove from cluster for reimage [12:16:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10618267 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9fc8bc6c-fcab-42ee-95e1-ca8c3f853132) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [12:17:10] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1126023 (https://phabricator.wikimedia.org/T388379) (owner: 10Tiziano Fogli) [12:17:10] (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1126007 (owner: 10Muehlenhoff) [12:17:29] (03CR) 10Tiziano Fogli: [C:03+2] cloudgw/icmp check/ip6: disabling [puppet] - 10https://gerrit.wikimedia.org/r/1126023 (https://phabricator.wikimedia.org/T388379) (owner: 10Tiziano Fogli) [12:17:54] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [12:18:00] (03CR) 10Muehlenhoff: [C:03+2] os-reports: Drop 1873 port [puppet] - 10https://gerrit.wikimedia.org/r/1126007 (owner: 10Muehlenhoff) [12:18:13] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [12:18:14] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [12:18:27] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [12:20:30] (03PS3) 10Jelto: deployment_server: add puppetdb rsync to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) [12:21:33] (03CR) 10Jelto: "I removed port 1873 because of I8c4f7e45f9f32d52e845ef21058175f64c2bd233" [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [12:21:42] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:21:52] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:21:54] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:22:14] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:22:27] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:22:54] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:23:26] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:23:30] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:23:41] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:23:43] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:23:55] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:23:57] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:24:19] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:24:20] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:24:51] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:24:53] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:24:59] PROBLEM - MariaDB read only ms1 #page on db1152 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.6.17-MariaDB-log, Uptime 24027388s, event_scheduler: True, 4176.30 QPS, connection latency: 0.025142s, query latency: 0.000701s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:25:09] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:25:09] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:25:16] !incidents [12:25:16] 5717 (UNACKED) db1152 (paged)/MariaDB read only ms1 (paged) [12:25:17] Amir1: ^ [12:25:18] ms1 is not in production right? [12:25:21] !ack 5717 [12:25:21] 5717 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [12:25:22] !ack 5717 [12:25:23] 5717 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [12:25:23] he is in a meeting with me [12:25:27] oh, it is [12:25:29] Don't worry I will handle this [12:25:35] marostegui: I win :D what can I do to help? [12:25:38] It shouldn't be expected true [12:25:41] volans: Nah, I will do it [12:25:41] we haven't pushed the change yet [12:25:44] so monitoring is bad? [12:25:44] k [12:25:53] thx [12:26:14] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1028 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1125947 (owner: 10Muehlenhoff) [12:26:22] probably puppet needs a patch then [12:26:26] I have no idea why it is expected true, but that's bad [12:26:37] volans: no user impact, right? [12:26:39] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:26:44] jynus: no [12:26:54] ok, then returning to the meeting [12:26:59] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:27:00] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:27:14] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:29:56] I think I know where it is [12:30:09] the type of host on hiera? [12:31:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:31:11] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5042/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [12:31:15] 07sre-alert-triage, 06serviceops: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T388398#10618299 (10JMeybohm) →14Duplicate dup:03T384450 [12:31:25] (03PS1) 10Marostegui: mariadb: Add ms1,ms2 and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1126028 (https://phabricator.wikimedia.org/T387332) [12:31:28] jynus: ^ [12:31:42] will check it after the meeting, thanks [12:31:46] (03CR) 10Ladsgroup: [C:03+1] mariadb: Add ms1,ms2 and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1126028 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [12:31:47] about to finish it [12:31:53] that was fast Amir1 ! [12:32:06] (03CR) 10Marostegui: [C:03+2] mariadb: Add ms1,ms2 and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1126028 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [12:32:21] (03PS1) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [12:32:25] FIRING: SystemdUnitFailed: mediawiki_job_mediamoderation-hourlyScan.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:40] marostegui: in the long term, we should do what parsercache does :P [12:33:37] Amir1: I suggested to change the role to parsercache and you said no :-) [12:33:56] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1028.eqiad.wmnet [12:34:00] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [12:34:41] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on gerrit2003.wikimedia.org with reason: testing [12:34:46] (03PS2) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [12:34:59] RECOVERY - MariaDB read only ms1 #page on db1152 is OK: Version 10.6.17-MariaDB-log, Uptime 24027988s, read_only: False, event_scheduler: True, 4928.11 QPS, connection latency: 0.025365s, query latency: 0.000545s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:35:01] -command[check_mariadb_read_only_ms1]=db-check-health --port=3306 --icinga --check_read_only=true --process [12:35:01] +command[check_mariadb_read_only_ms1]=db-check-health --port=3306 --icinga --check_read_only=false --process [12:35:06] Recovery should come soon [12:35:14] There it is [12:35:17] reat [12:35:20] *great [12:35:57] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [12:39:25] (03CR) 10Filippo Giunchedi: [C:03+1] Fix previous commit (032 comments) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [12:41:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10618367 (10MoritzMuehlenhoff) [12:43:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti1028.eqiad.wmnet [12:43:39] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1028.eqiad.wmnet [12:44:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [12:54:02] (03CR) 10Kamila Součková: [C:03+1] Add pod-security.wmf.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:55:24] !log imported wmf-laptop 1.0.1 to apt.wikimedia.org [12:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:46] !log filippo@puppetserver1001 conftool action : set/pooled=yes; selector: name=prometheus2007.codfw.wmnet [12:58:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [12:58:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1028.eqiad.wmnet [12:59:45] (03PS3) 10Anzx: mnwwiktionary: add thesaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126033 (https://phabricator.wikimedia.org/T356620) [12:59:56] !log filippo@puppetserver1001 conftool action : set/pooled=no; selector: name=prometheus2006.codfw.wmnet [13:00:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126033 (https://phabricator.wikimedia.org/T356620) (owner: 10Anzx) [13:01:46] (03PS1) 10Cathal Mooney: Add cloud IPv6 ranges to Capirca IP block definitions [homer/public] - 10https://gerrit.wikimedia.org/r/1126035 (https://phabricator.wikimedia.org/T379283) [13:02:09] (03CR) 10Jcrespo: [C:03+1] mariadb: Add ms1,ms2 and ms3 [puppet] - 10https://gerrit.wikimedia.org/r/1126028 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [13:03:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS bookworm [13:03:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10618515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bookworm [13:04:37] RESOLVED: SystemdUnitFailed: mediawiki_job_mediamoderation-hourlyScan.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:01] !log test prometheus2007 as the sole host pooled in pybal - T383232 [13:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:04] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [13:07:50] (03PS3) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [13:09:02] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [13:17:14] (03PS6) 10JMeybohm: Add pod-security.wmf.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) [13:17:47] (03PS1) 10Federico Ceratto: Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) [13:19:29] (03CR) 10Federico Ceratto: "A safety check before pooling in a host" [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [13:20:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10618650 (10phaultfinder) [13:22:56] (03PS4) 10Jcrespo: dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) [13:22:56] (03PS1) 10Jcrespo: mariadb: Remove references to tendril & set the section name as 'db_inventory' [puppet] - 10https://gerrit.wikimedia.org/r/1126042 [13:23:33] (03PS4) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [13:23:34] (03CR) 10CI reject: [V:04-1] mariadb: Remove references to tendril & set the section name as 'db_inventory' [puppet] - 10https://gerrit.wikimedia.org/r/1126042 (owner: 10Jcrespo) [13:24:03] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124393 (owner: 10Volans) [13:24:48] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [13:26:06] (03CR) 10CI reject: [V:04-1] Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [13:26:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1028.eqiad.wmnet with reason: host reimage [13:26:29] (03PS1) 10Federico Ceratto: db1253.yaml, db1254.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126043 [13:28:19] (03PS5) 10Jcrespo: dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) [13:28:19] (03PS2) 10Jcrespo: mariadb: Remove references to tendril & set the section name as 'db_inventory' [puppet] - 10https://gerrit.wikimedia.org/r/1126042 [13:31:13] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10618754 (10MoritzMuehlenhoff) [13:31:26] (03Merged) 10jenkins-bot: sre.k8s: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124393 (owner: 10Volans) [13:32:44] (03PS3) 10Jcrespo: mariadb: Remove references to tendril & set the section name as 'db_inventory' [puppet] - 10https://gerrit.wikimedia.org/r/1126042 [13:33:01] (03PS2) 10Slyngshede: Permissions LDAP group validator [software/bitu] - 10https://gerrit.wikimedia.org/r/1115375 [13:33:11] (03PS5) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [13:33:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1028.eqiad.wmnet with reason: host reimage [13:34:23] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [13:36:27] (03PS6) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [13:37:25] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:28] (03PS2) 10Federico Ceratto: Implement Icinga notification check before pooling in a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) [13:37:41] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [13:39:11] (03CR) 10Tiziano Fogli: [C:03+2] blackbox/http: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [13:39:24] (03PS4) 10Tiziano Fogli: blackbox/http: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) [13:39:37] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:16] (03CR) 10Tiziano Fogli: [C:03+2] blackbox/http: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [13:40:40] (03PS4) 10Tiziano Fogli: blackbox/tcp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) [13:41:30] (03CR) 10Tiziano Fogli: [C:03+2] blackbox/tcp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [13:42:04] (03CR) 10Jelto: [C:03+2] Remove profile::kubernetes::* from role::ci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [13:44:08] (03PS7) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [13:44:15] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10618786 (10MoritzMuehlenhoff) [13:45:19] (03CR) 10CI reject: [V:04-1] Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 (owner: 10Ayounsi) [13:46:03] (03PS1) 10Gergő Tisza: SUL3: Attach SUL mode to the return URL of local wiki [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126049 (https://phabricator.wikimedia.org/T388067) [13:46:31] (03PS1) 10Gergő Tisza: SpecialCentralAutoLogin: Handle nullable wiki ID [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126050 (https://phabricator.wikimedia.org/T388252) [13:47:03] (03PS1) 10Gergő Tisza: Log and add user IDs that mismatch in the runtime exception [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126051 (https://phabricator.wikimedia.org/T388177) [13:47:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126050 (https://phabricator.wikimedia.org/T388252) (owner: 10Gergő Tisza) [13:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126049 (https://phabricator.wikimedia.org/T388067) (owner: 10Gergő Tisza) [13:47:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126051 (https://phabricator.wikimedia.org/T388177) (owner: 10Gergő Tisza) [13:49:20] (03PS8) 10Ayounsi: Add alerting for important BGP sessions status [alerts] - 10https://gerrit.wikimedia.org/r/1126030 [13:49:49] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: deploy the cronjob template in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125988 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [13:51:25] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: deploy the cronjob template in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125988 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [13:52:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1028.eqiad.wmnet with OS bookworm [13:52:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10618811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bookworm completed: - ganeti102... [13:52:38] (03CR) 10Btullis: [C:03+1] airflow-analyics-test: grant permissions to read Jobs in the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125989 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [13:53:09] (03CR) 10Brouberol: [C:03+2] airflow-analyics-test: grant permissions to read Jobs in the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125989 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [13:57:59] 06SRE, 10MW-on-K8s, 06serviceops, 07Python3-Porting: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10618827 (10Reedy) [13:58:23] !log installing libpgjava security updates [13:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1400). [14:00:05] Lucas_WMDE, kart_, DreamRimmer, anzx, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] o/ [14:00:15] o/ [14:00:17] here [14:00:35] * TheresNoTime cannot deploy this afternoon [14:00:42] (03CR) 10Slyngshede: [C:03+2] P:firewall absent check_conntrack script. [puppet] - 10https://gerrit.wikimedia.org/r/1087379 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [14:00:47] o/ [14:00:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:00:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:01:10] o/ [14:01:17] ok, I can deploy! [14:01:33] Lucas_WMDE: go ahead. 3 config + 4 backport patches :) [14:01:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:01:49] tgr_: I’d start with my backport + the config changes and then hand over to you for the SUL3 stuff if that’s alright? [14:01:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:03:05] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Clean up RDF feature flags again [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:03:42] let’s do the three config changes together, I think they seem harmless and unrelated enough [14:03:58] Lucas_WMDE: sure [14:04:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [14:04:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125978 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [14:04:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126033 (https://phabricator.wikimedia.org/T356620) (owner: 10Anzx) [14:05:05] (03PS1) 10Brouberol: mediawiki-dumps-legacy: fix issues preventing to deploy in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126052 (https://phabricator.wikimedia.org/T388378) [14:05:14] (03Merged) 10jenkins-bot: Enable CX unified dashboard on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [14:05:16] (03Merged) 10jenkins-bot: Disallow editing modules for non-autoconfirmed users on the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125978 (https://phabricator.wikimedia.org/T388301) (owner: 10Dreamrimmer) [14:05:18] (03Merged) 10jenkins-bot: mnwwiktionary: add thesaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126033 (https://phabricator.wikimedia.org/T356620) (owner: 10Anzx) [14:05:22] (03PS1) 10Slyngshede: Revert "P:firewall absent check_conntrack script." [puppet] - 10https://gerrit.wikimedia.org/r/1126053 [14:05:39] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1124464|Enable CX unified dashboard on phase 2 wikis (T387820)]], [[gerrit:1125978|Disallow editing modules for non-autoconfirmed users on the English Wikivoyage (T388301)]], [[gerrit:1126033|mnwwiktionary: add thesaurus namespace (T356620)]] [14:05:40] Lucas_WMDE: sounds good! [14:05:45] T387820: Deploy unified dashboard on 10 more wikis (phase 2) - https://phabricator.wikimedia.org/T387820 [14:05:45] T388301: Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage - https://phabricator.wikimedia.org/T388301 [14:05:46] T356620: Thesaurus namespace for Mon Wiktionary - https://phabricator.wikimedia.org/T356620 [14:06:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [14:07:31] (03CR) 10CI reject: [V:04-1] Revert "P:firewall absent check_conntrack script." [puppet] - 10https://gerrit.wikimedia.org/r/1126053 (owner: 10Slyngshede) [14:07:40] (03Merged) 10jenkins-bot: Clean up RDF feature flags again [extensions/Wikibase] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125408 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:07:54] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: fix issues preventing to deploy in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126052 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:07:55] (03PS2) 10Slyngshede: Revert "P:firewall absent check_conntrack script." [puppet] - 10https://gerrit.wikimedia.org/r/1126053 [14:08:20] !log lucaswerkmeister-wmde@deploy2002 dreamrimmer, sbisson, anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1124464|Enable CX unified dashboard on phase 2 wikis (T387820)]], [[gerrit:1125978|Disallow editing modules for non-autoconfirmed users on the English Wikivoyage (T388301)]], [[gerrit:1126033|mnwwiktionary: add thesaurus namespace (T356620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:38] kart_, DreamRimmer, anzx: please test :) [14:08:45] checking [14:08:51] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10618921 (10SDunlap) [14:09:39] (03PS1) 10FNegri: aptrepo: fetch toolforge k8s v1.29 packages [puppet] - 10https://gerrit.wikimedia.org/r/1126054 (https://phabricator.wikimedia.org/T362868) [14:09:53] Lucas_WMDE: tested. works fine! [14:10:15] yay [14:10:17] Lucas_WMDE: looks good [14:10:19] (03CR) 10Slyngshede: [C:03+2] Revert "P:firewall absent check_conntrack script." [puppet] - 10https://gerrit.wikimedia.org/r/1126053 (owner: 10Slyngshede) [14:10:31] looks good [14:10:34] !log lucaswerkmeister-wmde@deploy2002 dreamrimmer, sbisson, anzx, lucaswerkmeister-wmde: Continuing with sync [14:10:38] \o/ thanks folks! [14:14:31] Lucas_WMDE: need to run namespacedupes.php for mnwwiktionary afterwards [14:14:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [14:14:40] ack, thanks for the reminder! [14:14:51] (03PS1) 10Slyngshede: P:firewall Remove conntrack_table_size nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1126056 (https://phabricator.wikimedia.org/T374827) [14:15:26] (scap is still working atm) [14:15:59] (03CR) 10Slyngshede: "I was submitting https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087379 but we actually need to remove the check in Icinga first, th" [puppet] - 10https://gerrit.wikimedia.org/r/1126056 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [14:16:09] (03PS2) 10FNegri: aptrepo: fetch toolforge k8s v1.29 packages [puppet] - 10https://gerrit.wikimedia.org/r/1126054 (https://phabricator.wikimedia.org/T362868) [14:17:00] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124464|Enable CX unified dashboard on phase 2 wikis (T387820)]], [[gerrit:1125978|Disallow editing modules for non-autoconfirmed users on the English Wikivoyage (T388301)]], [[gerrit:1126033|mnwwiktionary: add thesaurus namespace (T356620)]] (duration: 11m 21s) [14:17:03] (03Abandoned) 10Slyngshede: P:firewall Remove conntrack_table_size nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/1126056 (https://phabricator.wikimedia.org/T374827) (owner: 10Slyngshede) [14:17:06] T387820: Deploy unified dashboard on 10 more wikis (phase 2) - https://phabricator.wikimedia.org/T387820 [14:17:07] T388301: Disallow editing modules for non-confirmed/non-autoconfirmed users on the English Wikivoyage - https://phabricator.wikimedia.org/T388301 [14:17:07] T356620: Thesaurus namespace for Mon Wiktionary - https://phabricator.wikimedia.org/T356620 [14:17:20] (03CR) 10Slyngshede: [C:03+2] P:firewall absent conntrack_table_size monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/994164 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:17:54] !log lucaswerkmeister-wmde@deploy2002 $ mwscript-k8s --comment=T356620 --follow -- namespaceDupes mnwwiktionary --fix | tee T356620 [14:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:38] (03PS1) 10Slyngshede: Revert^2 "P:firewall absent check_conntrack script." [puppet] - 10https://gerrit.wikimedia.org/r/1126057 [14:18:40] oops, and that file was supposed to go to my home directory, not /srv/mediawiki-staging xD [14:19:44] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10618984 (10MoritzMuehlenhoff) [14:19:48] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1125408|Clean up RDF feature flags again (T384344)]] [14:19:51] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:20:23] Lucas_WMDE: thank you [14:20:28] np :) [14:20:57] (03CR) 10Marostegui: [C:03+1] db1253.yaml, db1254.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126043 (owner: 10Federico Ceratto) [14:21:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1028.eqiad.wmnet to cluster eqiad and group C [14:21:22] (03PS2) 10Stevemunene: Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) [14:21:26] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: fix issues preventing to deploy in airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126052 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:22:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1028.eqiad.wmnet to cluster eqiad and group C [14:22:23] (03PS1) 10Ssingh: P:durum: reload nginx but don't restart it [puppet] - 10https://gerrit.wikimedia.org/r/1126059 [14:22:28] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1125408|Clean up RDF feature flags again (T384344)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:33] testing [14:22:53] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:22:57] lgtm [14:23:09] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5043/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126059 (owner: 10Ssingh) [14:23:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:23:34] (03PS2) 10Elukey: role::maps::{master,replica}: Fix lvs pool config [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) [14:24:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:24:13] Thanks Lucas_WMDE ! [14:25:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:25:50] (03CR) 10Scott French: [C:03+1] "Thank you! This should be sufficient on its own, in contrast to the FPM-based deployments where we also need to update `php.version` in th" [puppet] - 10https://gerrit.wikimedia.org/r/1126011 (https://phabricator.wikimedia.org/T387916) (owner: 10Clément Goubert) [14:26:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:27:21] (03CR) 10Marostegui: Implement Icinga notification check before pooling in a host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126040 (https://phabricator.wikimedia.org/T388384) (owner: 10Federico Ceratto) [14:28:11] (03CR) 10Lucas Werkmeister (WMDE): "ready to deploy at any time now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:29:21] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125408|Clean up RDF feature flags again (T384344)]] (duration: 09m 33s) [14:29:32] tgr_: over to you :) [14:29:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1010* for ban host prior to reimage - bking@cumin2002 - T387904 [14:29:59] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1010* for ban host prior to reimage - bking@cumin2002 - T387904 [14:30:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:31:00] thx [14:31:18] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [14:32:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126050 (https://phabricator.wikimedia.org/T388252) (owner: 10Gergő Tisza) [14:32:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:32:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126049 (https://phabricator.wikimedia.org/T388067) (owner: 10Gergő Tisza) [14:32:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126051 (https://phabricator.wikimedia.org/T388177) (owner: 10Gergő Tisza) [14:33:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:34:19] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [14:35:18] (03PS2) 10Federico Ceratto: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) [14:35:31] (03PS3) 10Federico Ceratto: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) [14:36:49] 06SRE, 07SRE-Unowned: The ops-maint-gcal.js script is missing support for some vendors - https://phabricator.wikimedia.org/T381680#10619064 (10Scott_French) 05Resolved→03Open a:05Scott_French→03None Alas, my patch only fixed the Arelion details-too-large issue. Unless someone has done so in the interim... [14:37:47] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10619073 (10MoritzMuehlenhoff) [14:38:00] jouncebot: nowandnext [14:38:00] For the next 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1400) [14:38:00] In 0 hour(s) and 51 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1530) [14:38:48] (03CR) 10Elukey: [C:03+2] role::maps::{master,replica}: Fix lvs pool config [puppet] - 10https://gerrit.wikimedia.org/r/1125388 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [14:39:23] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10619078 (10MoritzMuehlenhoff) [14:39:44] 06SRE, 10Phabricator, 07Documentation: Outdated documentation how to request LDAP group membership - https://phabricator.wikimedia.org/T388307#10619079 (10MoritzMuehlenhoff) 05Open→03Resolved All updated. [14:41:53] (03Merged) 10jenkins-bot: SpecialCentralAutoLogin: Handle nullable wiki ID [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126050 (https://phabricator.wikimedia.org/T388252) (owner: 10Gergő Tisza) [14:41:55] (03Merged) 10jenkins-bot: SUL3: Attach SUL mode to the return URL of local wiki [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126049 (https://phabricator.wikimedia.org/T388067) (owner: 10Gergő Tisza) [14:41:55] (03PS4) 10Federico Ceratto: Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) [14:42:04] (03Merged) 10jenkins-bot: Log and add user IDs that mismatch in the runtime exception [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126051 (https://phabricator.wikimedia.org/T388177) (owner: 10Gergő Tisza) [14:42:16] (03PS1) 10Bking: icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) [14:42:24] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1126050|SpecialCentralAutoLogin: Handle nullable wiki ID (T388252)]], [[gerrit:1126049|SUL3: Attach SUL mode to the return URL of local wiki (T388067)]], [[gerrit:1126051|Log and add user IDs that mismatch in the runtime exception (T388177)]] [14:42:40] (03CR) 10Hnowlan: [C:03+1] pcs: Add missing rules for content pregeneration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [14:43:21] (03PS2) 10Bking: icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) [14:44:40] !log restart swift on ms-fe2011 [14:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:24] (03CR) 10Cathal Mooney: [C:03+1] P:durum: reload nginx but don't restart it [puppet] - 10https://gerrit.wikimedia.org/r/1126059 (owner: 10Ssingh) [14:46:18] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: reload nginx but don't restart it [puppet] - 10https://gerrit.wikimedia.org/r/1126059 (owner: 10Ssingh) [14:46:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [14:47:09] (03PS4) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [14:48:03] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): serve 25% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125503 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:48:11] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:48:12] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:48:45] !log sudo cumin 'P:durum' 'run-puppet-agent' [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:49:25] (03PS3) 10Bking: icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) [14:49:53] !log tgr@deploy2002 tgr: Backport for [[gerrit:1126050|SpecialCentralAutoLogin: Handle nullable wiki ID (T388252)]], [[gerrit:1126049|SUL3: Attach SUL mode to the return URL of local wiki (T388067)]], [[gerrit:1126051|Log and add user IDs that mismatch in the runtime exception (T388177)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:50:02] T388252: PHP Deprecated: str_ends_with(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T388252 [14:50:02] T388067: Wikimedia\NormalizedException\NormalizedException: Authentication failed because of inconsistent provider array - https://phabricator.wikimedia.org/T388067 [14:50:02] T388177: RuntimeException: User ID mismatch - https://phabricator.wikimedia.org/T388177 [14:50:24] !log installing pymysql security updates [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [14:51:15] (03PS1) 10Btullis: Update the version of refinery used for refine_sanitize jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126073 (https://phabricator.wikimedia.org/T388417) [14:51:30] (03CR) 10CI reject: [V:04-1] Ask for confirmation before depooling last host in a group [cookbooks] - 10https://gerrit.wikimedia.org/r/1125421 (https://phabricator.wikimedia.org/T299442) (owner: 10Federico Ceratto) [14:51:51] !log tgr@deploy2002 tgr: Continuing with sync [14:53:01] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2045 to codfw - jhancock@cumin2002" [14:54:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10619189 (10MoritzMuehlenhoff) [14:54:34] (03PS1) 10Slyngshede: Revert "P:firewall absent conntrack_table_size monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1126075 [14:55:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2045 to codfw - jhancock@cumin2002" [14:55:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2045 [14:55:34] (03PS4) 10Bking: icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) [14:56:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [14:56:50] (03PS1) 10Brouberol: airflow-analytics-test: fix typo in rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126076 (https://phabricator.wikimedia.org/T388378) [14:57:58] (03CR) 10Scott French: "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [14:58:01] (03CR) 10Scott French: [C:03+2] aptrepo: update pcre2 backport from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [14:58:08] (03CR) 10Slyngshede: [C:03+2] Revert "P:firewall absent conntrack_table_size monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/1126075 (owner: 10Slyngshede) [14:58:12] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126050|SpecialCentralAutoLogin: Handle nullable wiki ID (T388252)]], [[gerrit:1126049|SUL3: Attach SUL mode to the return URL of local wiki (T388067)]], [[gerrit:1126051|Log and add user IDs that mismatch in the runtime exception (T388177)]] (duration: 15m 48s) [14:58:18] T388252: PHP Deprecated: str_ends_with(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T388252 [14:58:18] T388067: Wikimedia\NormalizedException\NormalizedException: Authentication failed because of inconsistent provider array - https://phabricator.wikimedia.org/T388067 [14:58:19] T388177: RuntimeException: User ID mismatch - https://phabricator.wikimedia.org/T388177 [14:58:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2045 [14:58:46] (03CR) 10JMeybohm: [C:03+2] Add pod-security.wmf.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [14:58:56] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:00:09] !log UTC afternoon deploys done [15:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:12] (03PS5) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [15:01:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046 [15:01:26] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:01:27] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:01:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046 [15:02:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10619222 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm 45/47/48 are fixed. 49/50 are set properly from the start. [15:03:23] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:04:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:04:18] (03Merged) 10jenkins-bot: Add pod-security.wmf.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [15:05:22] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:06:31] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:30] (03CR) 10Btullis: [C:03+1] airflow-analytics-test: fix typo in rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126076 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:07:32] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:08:00] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:08:16] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:09:23] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:10:42] (03CR) 10Filippo Giunchedi: icinga: route cloudelastic alerts to Data Platform SRE (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:10:57] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:12:31] (03PS5) 10Bking: icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) [15:12:53] (03CR) 10Bking: icinga: route cloudelastic alerts to Data Platform SRE (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:16:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:16:51] !log filippo@puppetserver1001 conftool action : set/pooled=yes; selector: name=prometheus2006.codfw.wmnet [15:16:55] !log filippo@puppetserver1001 conftool action : set/pooled=yes; selector: name=prometheus2005.codfw.wmnet [15:17:13] !log repool prometheus200[56] - T383232 [15:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:16] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [15:18:18] (03PS1) 10Filippo Giunchedi: pontoon: fix create_hosts arguments [puppet] - 10https://gerrit.wikimedia.org/r/1126083 [15:19:22] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix create_hosts arguments [puppet] - 10https://gerrit.wikimedia.org/r/1126083 (owner: 10Filippo Giunchedi) [15:22:35] (03PS6) 10Filippo Giunchedi: icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:22:51] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:23:27] (03PS1) 10Jforrester: Stop loading the ActiveAbstract extension for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126084 (https://phabricator.wikimedia.org/T382069) [15:23:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10619361 (10Jhancock.wm) [15:23:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10619365 (10Jhancock.wm) a:05Kappakayala→03Jhancock.wm [15:24:47] (03CR) 10Filippo Giunchedi: [C:03+1] "Deployment will happen automatically, icinga-exporter is restarted upon changing configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:25:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10619394 (10phaultfinder) [15:26:30] 06SRE, 06serviceops, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10619396 (10joanna_borun) [15:27:25] 06SRE, 06serviceops, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10619404 (10cmooney) p:05Triage→03Low [15:28:19] (03CR) 10Bking: [C:03+2] icinga: route cloudelastic alerts to Data Platform SRE [puppet] - 10https://gerrit.wikimedia.org/r/1126067 (https://phabricator.wikimedia.org/T388270) (owner: 10Bking) [15:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1530). Please do the needful. [15:30:24] !log installing systemd bugfix updates from Bookworm point release [15:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1012-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:33:11] 06SRE, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10619441 (10LSobanski) What's the timeline for dropping RSA certs? Just so we know how urgent this is. [15:33:24] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10619442 (10LSobanski) [15:35:14] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10619452 (10Jhancock.wm) @MoritzMuehlenhoff the idrac went down again. unfortunately it's a component on the system board itself. i know this is out of warranty and likely to be replaced next fiscal year. i can try re... [15:35:41] (03PS1) 10Jelto: gitlab: add both backups to rsync dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1126087 (https://phabricator.wikimedia.org/T388421) [15:36:05] (03CR) 10Volans: "LGTM, I think that after a bit of testing with some few test hosts we could remove the ask part and automatically set it when there is onl" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [15:36:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye [15:37:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10619484 (10Vgutierrez) >>! In T385067#10619441, @LSobanski wrote: > What's the timeline for dropping RSA certs? Just so we know how urgent this is.... [15:39:23] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126088 (https://phabricator.wikimedia.org/T128546) [15:39:25] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5044/co" [puppet] - 10https://gerrit.wikimedia.org/r/1126087 (https://phabricator.wikimedia.org/T388421) (owner: 10Jelto) [15:40:35] (03CR) 10Ladsgroup: [C:03+1] "Thanks <3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126084 (https://phabricator.wikimedia.org/T382069) (owner: 10Jforrester) [15:40:55] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126088 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:42:00] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126088 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:43:46] (03CR) 10Majavah: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1125453 (owner: 10JMeybohm) [15:50:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10619542 (10Jhancock.wm) a:03Jhancock.wm @Marostegui ordered a new disk with dell. should be here tomorrow. does this disk swap need to be coordinated at all? Dell Service Request: 206749006 [15:50:30] jouncebot: nowandnext [15:50:30] For the next 0 hour(s) and 9 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1530) [15:50:30] In 1 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700) [15:50:30] In 1 hour(s) and 9 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700) [15:51:23] (03CR) 10JMeybohm: [C:03+2] Rename TILLER_NAMESPACE to K8S_NAMESPACE [puppet] - 10https://gerrit.wikimedia.org/r/1125453 (owner: 10JMeybohm) [15:51:35] (03CR) 10Dreamy Jazz: [C:03+1] CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel) [15:52:10] Going to deploy now if that's all good. [15:52:15] (03CR) 10Fabfur: "I would delete it entirely if equals to a blank space." [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez) [15:53:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Preparing db1253 T385141', diff saved to https://phabricator.wikimedia.org/P74174 and previous config saved to /var/cache/conftool/dbconfig/20250310-155332-fceratto.json [15:53:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel) [15:53:36] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [15:53:40] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1126088| Bumping portals to master (T128546)]] (duration: 08m 38s) [15:53:43] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:53:58] (03CR) 10Vgutierrez: "that would require additional refactoring since that `-` isn't considered a valid value for X-requestctl anywhere else" [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez) [15:54:04] !log reprepro update pcre2_10.42-1~wmf11+1 in component/pcre2 from apt-staging - T386006 [15:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:07] T386006: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006 [15:54:15] (03Merged) 10jenkins-bot: CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125497 (https://phabricator.wikimedia.org/T380527) (owner: 10Amdrel) [15:55:30] jan_drewniak: Could you ping me when you are done? [15:55:51] Dreamy_Jazz: yup, won't take too long. [15:55:58] Thanks! [15:56:02] (03PS6) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [15:56:07] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1126088| Bumping portals to master (T128546)]] (duration: 02m 25s) [15:56:33] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:56:35] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:57:05] Dreamy_Jazz: ok all done. [15:57:29] (03CR) 10Fabfur: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez) [15:57:31] (03CR) 10Ssingh: "The service IPs need to be set in Netbox. See https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) before we can proceed. (" [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:57:58] (03CR) 10Ssingh: "Same as in the related patch in the chain: The service IPs need to be set in Netbox. See https://wikitech.wikimedia.org/wiki/LVS#DNS_chang" [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [15:58:01] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2211 - https://phabricator.wikimedia.org/T388295#10619574 (10Marostegui) >>! In T388295#10619542, @Jhancock.wm wrote: > @Marostegui ordered a new disk with dell. should be here tomorrow. does this disk swap need to be coordinated at all? > > Dell Ser... [15:58:45] (03PS7) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [15:58:57] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1010.eqiad.wmnet with OS bullseye [15:59:01] (03CR) 10Arnaudb: [C:03+1] gitlab: add both backups to rsync dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1126087 (https://phabricator.wikimedia.org/T388421) (owner: 10Jelto) [15:59:16] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [15:59:27] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1010 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125227 (owner: 10Bking) [15:59:48] (03CR) 10Muehlenhoff: [C:03+2] keepalived: Install keepalived from the "main" component [puppet] - 10https://gerrit.wikimedia.org/r/1125413 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [15:59:54] (03CR) 10Fabfur: [C:03+2] acme_chief: add parameter for destination path [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [16:00:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed on ms-be2069 - https://phabricator.wikimedia.org/T388373#10619599 (10Jhancock.wm) a:03Jhancock.wm @MatthewVernon drive has been replaced. all alerts have cleared that i can see. let me know if it all looks good on your end. it was out o... [16:00:40] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:00:40] !log imported keepalived 1:2.2.7-1~bpo11+1 to main component of bullseye-wikimedia T383557 [16:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:43] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [16:00:57] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [16:01:13] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker2.*,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:01:22] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker1.*,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:02:22] fabfur: I'll merge your acme_chief patch along? [16:04:18] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:04:26] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:04:57] Thanks! [16:05:28] (03CR) 10Jgiannelos: [C:03+2] pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [16:05:47] Looks like deployments are blocked because a security patch isn't applying any more [16:06:02] (03PS3) 10Scott French: php8.1: Install PCRE2 backport from component/php81 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1125536 (https://phabricator.wikimedia.org/T386006) [16:06:05] Dreamy_Jazz: hm… which branch is it complaining about? [16:06:06] !log herron@cumin1002 START - Cookbook sre.dns.netbox [16:06:16] wmf.19 [16:06:21] huh [16:06:25] (03CR) 10Scott French: "Built and verified locally:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1125536 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:06:30] Patch for T387691 [16:06:40] Which is odd considering that wmf.19 is the current wiki version [16:06:54] that one has known conflicts with wmf.20 (rebased version already available on the task) [16:07:01] (03Merged) 10jenkins-bot: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T388214) (owner: 10Jgiannelos) [16:07:01] but wmf.19 shouldn’t be an issue… [16:07:05] cc sbassett ^ [16:09:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [16:09:12] * Lucas_WMDE takes a peek at the patches repo [16:09:30] * Dreamy_Jazz is doing that too [16:09:51] oh, I think sbassett accidentally updated the patch in the wmf.19 directory instead of putting the new version in a new wmf.20 directory? [16:10:14] (or maybe I’m confused about how it’s supposed to work) [16:10:16] Yeah the date for the patch is too new for applying to wmf.19 [16:10:25] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: enabling aux-k8s codfw vips - herron@cumin1002" [16:10:54] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: enabling aux-k8s codfw vips - herron@cumin1002" [16:10:54] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:00] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl.svc.codfw.wmnet on all recursors [16:12:03] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl.svc.codfw.wmnet on all recursors [16:12:53] I think I know how to fix it (assuming I understand the issue correctly) but if the deployment isn’t particularly urgent I’d rather wait a bit for sbassett to chime in, if that’s okay with you Dreamy_Jazz (and jan_drewniak maybe) [16:13:30] It's not particularly urgent, but the next deployment window is soon [16:13:38] So would affect that too [16:13:44] (03CR) 10Elukey: WIP - sre.hosts.provision: add logic to set PXE for Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [16:14:24] Actually it's 4 hours away, so not that soon [16:14:25] yeah, I meant maybe half an hour or so [16:15:03] (03PS2) 10Vgutierrez: varnish: X-Requestctl is now being handled by HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1125986 [16:15:12] Given that, I'm not going to deplot [16:15:16] *deploy now [16:15:44] I can't make the next window, but if it's fixed in the interim I might deploy then [16:16:23] I should probably revert the config patch, though it is technically a no-op until wmf.20 [16:17:27] (03PS1) 10Dreamy Jazz: Revert "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126092 [16:17:38] (03CR) 10Dreamy Jazz: [C:03+2] Revert "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126092 (owner: 10Dreamy Jazz) [16:17:48] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [16:18:25] (03Merged) 10jenkins-bot: Revert "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126092 (owner: 10Dreamy Jazz) [16:18:31] (03PS1) 10Herron: dns: add aux-k8s ingress/ctrl vips [dns] - 10https://gerrit.wikimedia.org/r/1126093 (https://phabricator.wikimedia.org/T381417) [16:18:33] eh, if it’s causing that much pain maybe I should just try it after all [16:19:19] (03PS2) 10Herron: dns: add aux-k8s ingress/ctrl vips [dns] - 10https://gerrit.wikimedia.org/r/1126093 (https://phabricator.wikimedia.org/T381417) [16:21:22] The revert has been merged, so I'll wait till later. [16:21:42] (03PS1) 10Dreamy Jazz: Revert^2 "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126095 [16:21:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126095 (owner: 10Dreamy Jazz) [16:24:20] (03CR) 10Federico Ceratto: [C:03+2] db1253.yaml, db1254.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1126043 (owner: 10Federico Ceratto) [16:27:11] (03CR) 10Ssingh: [C:03+1] dns: add aux-k8s ingress/ctrl vips [dns] - 10https://gerrit.wikimedia.org/r/1126093 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:27:38] (03CR) 10Herron: [C:03+2] dns: add aux-k8s ingress/ctrl vips [dns] - 10https://gerrit.wikimedia.org/r/1126093 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:28:03] !log herron@dns1004 START - running authdns-update [16:29:08] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add both backups to rsync dependencies [puppet] - 10https://gerrit.wikimedia.org/r/1126087 (https://phabricator.wikimedia.org/T388421) (owner: 10Jelto) [16:29:33] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:29:46] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:30:11] !log herron@dns1004 END - running authdns-update [16:30:55] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl.svc.codfw.wmnet on all recursors [16:30:59] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl.svc.codfw.wmnet on all recursors [16:31:05] (03PS1) 10Jgiannelos: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126100 [16:31:17] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache k8s-ingress-aux.svc.codfw.wmnet on all recursors [16:31:20] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) k8s-ingress-aux.svc.codfw.wmnet on all recursors [16:32:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [16:32:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdj) failed on ms-be2069 - https://phabricator.wikimedia.org/T388373#10619766 (10MatthewVernon) 05Open→03Resolved Yes, it seems good now, thank you for the prompt fix! [16:33:44] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [16:37:19] (03CR) 10Subramanya Sastry: [C:03+1] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126100 (owner: 10Jgiannelos) [16:37:23] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126100 (owner: 10Jgiannelos) [16:37:35] (03CR) 10Hnowlan: [C:03+1] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126100 (owner: 10Jgiannelos) [16:39:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [16:39:16] (03PS1) 10Cathal Mooney: Add new Wikikube staging POD IP ranges to router/switch BGP filter [homer/public] - 10https://gerrit.wikimedia.org/r/1126102 (https://phabricator.wikimedia.org/T386232) [16:39:23] (03CR) 10Jgiannelos: [C:03+2] changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126100 (owner: 10Jgiannelos) [16:41:03] (03Merged) 10jenkins-bot: changeprop: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126100 (owner: 10Jgiannelos) [16:42:03] (03PS3) 10Volans: query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 [16:42:03] (03PS3) 10Volans: docs: removed deprecated call to sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 [16:42:06] (03PS2) 10Volans: puppetdb: add support for structured facts [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) [16:42:19] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:42:27] (03PS1) 10JMeybohm: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) [16:42:30] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:42:50] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:43:11] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:43:22] (03CR) 10Volans: "Updated as agree offline" [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 (owner: 10Volans) [16:43:25] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:43:30] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:43:35] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:44:17] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye [16:44:19] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2088 - https://phabricator.wikimedia.org/T387257#10619866 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:44:22] ok, I’ll go ahead and try to fix the patches repository (cc sbassett, Dreamy_Jazz) [16:44:50] (03PS1) 10Muehlenhoff: pcc: Drop obsolete OS conditional [puppet] - 10https://gerrit.wikimedia.org/r/1126104 [16:44:59] (03CR) 10Ssingh: [C:03+1] "Notes: 1) Run Puppet on the aux-k8s-ctrl hosts in codfw after merging this." [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:45:23] (03PS2) 10JMeybohm: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) [16:45:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10619876 (10Jhancock.wm) @elukey another disk has been pulled! (all i good. i have the easy part) [16:47:09] (03PS6) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [16:47:17] !log sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'disable-puppet "adding aux-k8s-ctrl codfw"' [16:47:18] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-ctrl2002.codfw.wmnet [16:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:32] okay, done, I think [16:47:43] (03PS2) 10Fabfur: sslcert: minor refactoring to use consistent key path [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) [16:47:55] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-ctrl2003.codfw.wmnet [16:48:01] fyi jeena, hashar as train deployers this week: I just created a partially populated wmf.20 directory in /srv/patches [16:48:14] under the assumption that scap prep will later copy over the rest of the patches that didn’t have rebase conflicts [16:48:24] (03CR) 10Herron: [C:03+2] aux-k8s-ctrl codfw: enable lvs [puppet] - 10https://gerrit.wikimedia.org/r/1123426 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [16:48:31] just so you know what’s up if it starts to throw up errors instead… [16:48:40] Scap prep will not update an existing /srv/patches/ directory [16:48:45] ok [16:48:51] should I manually copy over the other patches then? [16:48:56] Yes please. [16:48:59] ok, will do [16:49:11] (03CR) 10Elukey: [C:03+1] query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 (owner: 10Volans) [16:49:13] When you're done I can trigger the job that tests patches. [16:50:05] (03PS1) 10JMeybohm: Update wikikube-staging codfw pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1126105 (https://phabricator.wikimedia.org/T386232) [16:50:25] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:50:31] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:50:44] dancy: great, thanks. committed [16:51:09] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:52:29] (03PS1) 10Cathal Mooney: Delegate reverse zones for newly assigned K8s POD IP ranges staging [dns] - 10https://gerrit.wikimedia.org/r/1126108 (https://phabricator.wikimedia.org/T386232) [16:52:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [16:52:58] (03CR) 10JHathaway: [C:03+1] pcc: Drop obsolete OS conditional [puppet] - 10https://gerrit.wikimedia.org/r/1126104 (owner: 10Muehlenhoff) [16:53:03] (03CR) 10CI reject: [V:04-1] Delegate reverse zones for newly assigned K8s POD IP ranges staging [dns] - 10https://gerrit.wikimedia.org/r/1126108 (https://phabricator.wikimedia.org/T386232) (owner: 10Cathal Mooney) [16:53:21] Lucas_WMDE: Success! [16:53:29] \o/ [16:53:32] thanks! [16:53:39] Thanks for cleaning that up. [16:53:51] (03PS3) 10JMeybohm: admin_ng: Change staging-codfw pod ip range to 10.192.64.0/21 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126103 (https://phabricator.wikimedia.org/T386232) [16:53:57] Dreamy_Jazz: you should be good to deploy now… [16:53:59] jouncebot: nowandnext [16:53:59] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [16:53:59] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700) [16:53:59] In 0 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700) [16:54:03] …unless that’s too soon, I guess [16:54:30] That might be too soon [16:55:06] Given the next window has a task [16:55:48] !log dancy@deploy2002 Installing scap version "4.140.0" for 204 host(s) [16:57:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10619941 (10Jhancock.wm) @Clement_Goubert could you update the site.pp file to include the wikikube-workker servers? thank you! [16:57:34] (03CR) 10JMeybohm: [C:03+1] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1126102 (https://phabricator.wikimedia.org/T386232) (owner: 10Cathal Mooney) [16:57:40] (03PS1) 10Jgiannelos: Revert "pcs: Invalidate summaries on resource change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126109 [16:58:05] (03CR) 10DCausse: [C:03+1] "lgtm, ccing Trey to double check the dict we're pulling" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1125533 (https://phabricator.wikimedia.org/T386868) (owner: 10Ebernhardson) [16:58:05] !log restart pybal on lvs1020 [16:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:59] (03PS2) 10Jgiannelos: Revert "pcs: Invalidate summaries on resource change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126109 [16:59:27] !log enable puppet on lvs2014 [16:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:49] (03CR) 10Hnowlan: [C:03+1] Revert "pcs: Invalidate summaries on resource change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126109 (owner: 10Jgiannelos) [17:00:05] swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700). [17:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700). [17:00:10] (03CR) 10Hnowlan: [C:04-1] Revert "pcs: Invalidate summaries on resource change" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126109 (owner: 10Jgiannelos) [17:00:19] !log restart pybal on lvs2014 [17:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:41] !log dancy@deploy2002 Installation of scap version "4.140.0" completed for 204 hosts [17:01:17] o/ [17:01:37] * Lucas_WMDE chants “PHP 8.1! PHP 8.1!” [17:01:39] dancy: FYI, my work during this window won't touch scap (all one-off helmfile) [17:01:42] :) [17:01:55] ok [17:02:15] _but_ if you'd like be to kick the tires on scap before I merge my change, I'm happy to do that too [17:02:39] Perhaps Dreamy_Jazz could do his deployment as that exercise. [17:02:50] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2002.codfw.wmnet [17:02:53] (03PS2) 10Cathal Mooney: Delegate reverse zones for newly assigned K8s POD IP ranges staging [dns] - 10https://gerrit.wikimedia.org/r/1126108 (https://phabricator.wikimedia.org/T386232) [17:02:55] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2003.codfw.wmnet [17:02:56] Sure I could do that. [17:02:58] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2004.codfw.wmnet [17:03:01] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2005.codfw.wmnet [17:03:10] ah, I missed that! [17:03:13] Great! let 'er rip [17:03:20] (03CR) 10Dreamy Jazz: [C:03+2] Revert^2 "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126095 (owner: 10Dreamy Jazz) [17:03:27] \o/ [17:03:31] great [17:03:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126095 (owner: 10Dreamy Jazz) [17:04:05] (03Merged) 10jenkins-bot: Revert^2 "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126095 (owner: 10Dreamy Jazz) [17:04:36] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1126095|Revert^2 "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki"]] [17:04:40] My config is change is a no-op until wmf.20, so should be all good. [17:05:13] (03PS1) 10Jgiannelos: changeprop: Fix list of endpoints to be pregenerated in PCS level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 [17:05:15] Security patches no longer merge conflicting. Thanks Lucas_WMDE! [17:05:22] (03Abandoned) 10Jgiannelos: Revert "pcs: Invalidate summaries on resource change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126109 (owner: 10Jgiannelos) [17:06:33] !log lvs2013: restart pybal [17:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:42] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1010.eqiad.wmnet with OS bullseye [17:06:48] (03PS1) 10Clément Goubert: site.pp: Add wikikube-worker2248-2331, wikikube-ctrl2004-2005 [puppet] - 10https://gerrit.wikimedia.org/r/1126113 (https://phabricator.wikimedia.org/T384970) [17:06:48] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [17:07:32] (03PS2) 10Clément Goubert: site.pp: Add wikikube-worker2248-2331, wikikube-ctrl2004-2005 [puppet] - 10https://gerrit.wikimedia.org/r/1126113 (https://phabricator.wikimedia.org/T384970) [17:08:20] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1126095|Revert^2 "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:08:29] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [17:08:42] !log sudo cumin 'A:lvs and A:codfw' 'run-puppet-agent --enable "adding aux-k8s-ctrl codfw"' [17:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:57] The test servers check did fail the first time round, but seemed to be temporary network connection issues [17:09:22] (03PS2) 10Jgiannelos: changeprop: Fix list of endpoints to be pregenerated in PCS level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 [17:09:26] The error was "Max retries exceeded with url" [17:09:39] Along with "Errno 113 No route to host" [17:09:48] But was fine for the second attempt [17:10:03] !log sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding aux-k8s-ctrl codfw"' [17:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:08] Dreamy_Jazz: interesting, did you happen to catch which check it was? [17:10:33] To usability.wikimedia.org/wiki/Main_Page for check_testservers_k8s-2_of_2 [17:11:09] Also chair.wikimedia.org/wiki/Index for check_testservers_k8s-1_of_2 with "Connection reset by peer" [17:11:52] The first one was mwdebug-next and the second was mwdebug [17:12:18] got it, thanks! well, as long as it self-resolved, then I think you're good to go [17:12:18] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudelastic1010.eqiad.wmnet'] [17:12:47] Yeah, it presumably worked the second time round (as I chose "Retry testserver checks") when prompted [17:13:00] sounds good, thanks! [17:13:33] (03CR) 10Cathal Mooney: [C:03+2] Add new Wikikube staging POD IP ranges to router/switch BGP filter [homer/public] - 10https://gerrit.wikimedia.org/r/1126102 (https://phabricator.wikimedia.org/T386232) (owner: 10Cathal Mooney) [17:13:39] Nearly done with the deploy. About 75% of the way through K8s deployment [17:13:42] (03CR) 10Hnowlan: changeprop: Fix list of endpoints to be pregenerated in PCS level (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 (owner: 10Jgiannelos) [17:13:54] (03PS3) 10Jgiannelos: changeprop: Fix list of endpoints to be pregenerated in PCS level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 [17:14:11] (03CR) 10Elukey: "I have some concerns on the complexity that the ACLs will bring in, but I'll follow up in the task 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [17:14:56] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126095|Revert^2 "CommonSettings.php: Add $wgCentralAuthAutomaticVanishWiki"]] (duration: 10m 20s) [17:15:03] (03Merged) 10jenkins-bot: Add new Wikikube staging POD IP ranges to router/switch BGP filter [homer/public] - 10https://gerrit.wikimedia.org/r/1126102 (https://phabricator.wikimedia.org/T386232) (owner: 10Cathal Mooney) [17:15:06] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10620048 (10fnegri) The temperature remains very close to the threshold, and the alert has been firing intermittently since my previous comment.... [17:15:12] swfrench-wmf: Done with the deployment [17:15:25] Dreamy_Jazz: thank you! also for testing scap for us :) [17:15:36] dancy: any objections if I move forward? [17:15:42] Nope! [17:15:43] (03PS4) 10Jgiannelos: changeprop: Fix list of endpoints to be pregenerated in PCS level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 [17:15:53] great, off we go [17:16:06] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125503 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:16:07] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 25% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125503 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:16:19] (03CR) 10Jgiannelos: changeprop: Fix list of endpoints to be pregenerated in PCS level (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 (owner: 10Jgiannelos) [17:17:36] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Use php81 base image [puppet] - 10https://gerrit.wikimedia.org/r/1126011 (https://phabricator.wikimedia.org/T387916) (owner: 10Clément Goubert) [17:17:39] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 25% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125503 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:17:42] (03CR) 10Kamila Součková: [C:03+1] site.pp: Add wikikube-worker2248-2331, wikikube-ctrl2004-2005 [puppet] - 10https://gerrit.wikimedia.org/r/1126113 (https://phabricator.wikimedia.org/T384970) (owner: 10Clément Goubert) [17:21:36] FIRING: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [17:21:56] !incidents [17:21:56] 5719 (UNACKED) [2x] GatewayBackendErrorsHigh sre (api-gateway eqiad) [17:21:56] 5717 (RESOLVED) db1152 (paged)/MariaDB read only ms1 (paged) [17:22:00] pausing work [17:22:08] (not touched anything yet) [17:22:12] (03CR) 10Clément Goubert: [C:03+2] site.pp: Add wikikube-worker2248-2331, wikikube-ctrl2004-2005 [puppet] - 10https://gerrit.wikimedia.org/r/1126113 (https://phabricator.wikimedia.org/T384970) (owner: 10Clément Goubert) [17:22:36] !ack 5719 [17:22:37] 5719 (ACKED) [2x] GatewayBackendErrorsHigh sre (api-gateway eqiad) [17:22:45] looking [17:23:40] \o [17:24:58] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10620089 (10LSobanski) @jhathaway any thoughts on this? [17:25:10] jouncebot: nowandnext [17:25:10] For the next 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700) [17:25:10] For the next 0 hour(s) and 4 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T1700) [17:25:10] In 2 hour(s) and 34 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T2000) [17:25:19] looks like most liftwing backends are having issues [17:25:28] I can help checking if needed [17:25:31] eqiad right? [17:25:38] yeah [17:25:56] just reference_need and reference_risk it seems [17:26:07] (03CR) 10BCornwall: [C:03+1] Delegate reverse zones for newly assigned K8s POD IP ranges staging [dns] - 10https://gerrit.wikimedia.org/r/1126108 (https://phabricator.wikimedia.org/T386232) (owner: 10Cathal Mooney) [17:26:45] (03PS3) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) [17:26:57] TIL those two, they must be new [17:27:29] (03CR) 10Vgutierrez: [C:03+1] sslcert: minor refactoring to use consistent key path [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [17:27:32] yeah, added a few weeks back [17:27:37] ah the revision-models ns okok [17:29:16] I am writing in the ml channel, I think it is a re-occurrence of a known issue [17:29:41] preprocess needs to run some cpubound code and that stalls the ioloop [17:30:58] thank you! [17:31:06] This is probably no reason to block swfrench-wmf right? [17:31:34] with a deployment? Nono I think it is not a blocker [17:31:37] TIL there was a liftwing api via the gateway [17:32:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10620139 (10Clement_Goubert) >>! In T384970#10619941, @Jhancock.wm wrote: > @Clement_Goubert could you update the site.pp file to include the wikiku... [17:32:33] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:25] hnowlan: elukey: ack, yeah if adding a deployment into the mix would not cause confusion or make noise that makes troubleshooting harder, I'll go ahead [17:34:47] swfrench-wmf: nono I thnk it is external traffic from WME [17:35:05] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:35:23] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:35:42] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:36:00] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:37:38] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:37:47] (03CR) 10Hnowlan: [C:03+1] changeprop: Fix list of endpoints to be pregenerated in PCS level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 (owner: 10Jgiannelos) [17:37:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:38:11] (03CR) 10Ottomata: [C:03+1] Update the version of refinery used for refine_sanitize jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126073 (https://phabricator.wikimedia.org/T388417) (owner: 10Btullis) [17:38:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:38:27] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:40:10] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4044.ulsfo.wmnet [17:40:27] ms-be2075 going down was a downtime expiring? [17:40:28] !log Upgrading cp4044 to Varnish 7 (T378737) [17:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:31] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [17:40:45] (03Merged) 10jenkins-bot: changeprop: Fix list of endpoints to be pregenerated in PCS level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126110 (owner: 10Jgiannelos) [17:41:30] herron, urandom - o/ need to go but you can sync with isaranto. TL;DR is that some Enterprise requests are taking a heavy toll on some model servers, in turn causing 50x via API gateway. The fix is not easy.. [17:41:42] (03CR) 10Fabfur: [C:03+2] sslcert: minor refactoring to use consistent key path [puppet] - 10https://gerrit.wikimedia.org/r/1125415 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [17:43:06] (03CR) 10Ssingh: [C:03+1] "Same notes as previous: please run Puppet on the backend immediately for the role switch." [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:43:57] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:44:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:44:40] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:44:51] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:45:22] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:45:33] (03PS2) 10Herron: aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) [17:45:37] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:45:48] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:45:55] !log sudo cumin 'A:lvs-codfw' 'disable-puppet "adding k8s-ingress-aux codfw"'T381417 [17:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:59] T381417: aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417 [17:46:01] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:46:37] (03PS1) 10BCornwall: upgrade cp4044 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1126119 (https://phabricator.wikimedia.org/T378737) [17:46:48] (03CR) 10Herron: [C:03+2] aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:47:01] !log mw-(api-ext|web): migrated 25% of residual PHP 7.4 traffic to 8.1 - T383845 [17:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:05] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [17:47:23] swfrench-wmf: \o/ [17:47:29] nice [17:47:35] :) [17:47:36] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [17:47:37] (03CR) 10CDobbins: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [17:48:01] (03CR) 10CDobbins: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [17:48:21] nemo-yiannis: I'll have a scap to run afterwards, can you ping me when done? [17:48:25] having the 8.1 deployments be large enough to actually have the images cached locally on the k8s node is so nice :) [17:48:35] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1126119 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:48:40] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [17:48:41] swfrench-wmf: hahahah yeah [17:48:46] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [17:48:50] critical mass [17:49:01] precisely, yeah [17:49:18] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [17:49:23] (03CR) 10Ssingh: [C:03+1] "Looks good, good luck on the first release!" [puppet] - 10https://gerrit.wikimedia.org/r/1126119 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:49:33] claime: done [17:49:37] cool thanks [17:49:42] (03CR) 10BCornwall: [V:03+1 C:03+2] upgrade cp4044 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1126119 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:49:47] (03PS1) 10Dzahn: mariadb: grant RT GRANTs for m1 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126121 (https://phabricator.wikimedia.org/T388437) [17:50:18] claime: you will scap to pick up the mediawiki-deployments.yaml change, or shall I? [17:50:23] swfrench-wmf: doing it [17:50:44] awesome, thank you :) [17:51:55] (03PS1) 10Ilias Sarantopoulos: ml-services: revert uvicorn multiple workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126122 (https://phabricator.wikimedia.org/T387019) [17:52:01] swfrench-wmf: great it shows no change [17:52:06] but I know it'll pick it up [17:52:15] it's because it's a change to the templated stuff x) [17:52:19] !log cgoubert@deploy2002 Started scap sync-world: mw-cron to php 8.1 - T387916 [17:52:22] T387916: Migrate mw-cron to PHP 8.1 - https://phabricator.wikimedia.org/T387916 [17:52:38] claime: heh, also the diff runs _before_ the release files are updated :) [17:52:45] ooooh yeah [17:52:47] that too [17:52:54] :shrug" [17:53:11] (03CR) 10Dzahn: "re: the different IPs in here: 10.64.32.10 is dbproxy1024 10.64.0.15 is dbproxy1022, 10.192.23.11 is dbproxy2005" [puppet] - 10https://gerrit.wikimedia.org/r/1126121 (https://phabricator.wikimedia.org/T388437) (owner: 10Dzahn) [17:54:14] !log cgoubert@deploy2002 Finished scap sync-world: mw-cron to php 8.1 - T387916 (duration: 02m 49s) [17:54:35] (03PS2) 10Ilias Sarantopoulos: ml-services: revert uvicorn multiple workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126122 (https://phabricator.wikimedia.org/T387019) [17:55:03] !log restart pybal on lvs2014 [17:55:04] (03PS7) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [17:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:30] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: revert uvicorn multiple workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126122 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [17:56:31] swfrench-wmf: kubectl describe cronjobs.batch mediawiki-main-serviceops-version | grep -A2 'mediawiki-main-app' | grep 'Image' [17:56:33] Image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-03-10-170449-publish-81 [17:56:35] \o/ [17:56:44] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2003.codfw.wmnet [17:56:49] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2005.codfw.wmnet [17:56:59] claime: nice! and thank you :) [17:57:06] (03Merged) 10jenkins-bot: ml-services: revert uvicorn multiple workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126122 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [17:57:13] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-aux_30443: Servers aux-k8s-worker2002.codfw.wmnet, aux-k8s-worker2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:57:21] ok [17:57:22] Now we wait ~5 minutes to make sure the very important "Special:Version" cronjob works [17:57:26] herron: ^ [17:57:39] we are looking [17:58:15] PROBLEM - statsv Varnishkafka log producer on cp4044 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:58:16] PROBLEM - Webrequests Varnishkafka log producer on cp4044 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:58:34] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [17:58:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye [17:58:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125491 (https://phabricator.wikimedia.org/T388218) (owner: 10D3r1ck01) [17:59:09] (03CR) 10Federico Ceratto: "Given that we have to testbed running the command without command/script works as a canary test of sort." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [17:59:15] RECOVERY - statsv Varnishkafka log producer on cp4044 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:59:16] RECOVERY - Webrequests Varnishkafka log producer on cp4044 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:00:00] ^ expected, brett working on it [18:00:05] pybal should be happy soon [18:00:26] oh shoot, thanks for that [18:02:56] (03PS1) 10Scott French: Profile::Mediawiki_deployment: add 'deploy' field to release config [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) [18:02:56] (03PS1) 10Scott French: hieradata: add mw-script non-deploy releases to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) [18:02:58] (03PS1) 10Scott French: deployment_server: Use mw-script release values file [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) [18:03:02] !log herron@puppetserver1001 conftool action : set/pooled=no; selector: name=aux-k8s-worker2002.codfw.wmnet [18:03:05] !log herron@puppetserver1001 conftool action : set/pooled=no; selector: name=aux-k8s-worker2004.codfw.wmnet [18:04:05] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10620286 (10VRiley-WMF) Understood, I'm currently investigating this [18:04:09] (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [18:06:28] (03CR) 10Scott French: "This is the start of a 3 patchset series that would decouple mwscript-k8s from mw-web in terms of mediawiki image version, making use of [" [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [18:07:57] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:09:40] sukhe is that you? [18:09:51] (03PS1) 10Herron: Revert "aux-k8s codfw: enable worker ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1126124 [18:10:14] vgutierrez: yeah, me and herron, reverting it. backend issues [18:10:29] (03CR) 10Ssingh: [C:03+1] Revert "aux-k8s codfw: enable worker ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1126124 (owner: 10Herron) [18:10:31] (03CR) 10Herron: [C:03+2] Revert "aux-k8s codfw: enable worker ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1126124 (owner: 10Herron) [18:12:18] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet [18:12:20] (03PS8) 10Ssingh: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [18:13:49] (03CR) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [18:14:28] !log restart pybal on lvs2014 for reverted aux-k8s change [18:14:29] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:27] !log sudo cumin 'A:lvs-codfw' 'run-puppet-agent --enable "adding k8s-ingress-aux codfw"' [18:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:36] herron, urandom: I've upgraded Varnish from 6.0 to 7.1 on cp4044 (ulsfo). Things seem to be okay but just be aware. [18:17:02] brett: ack ok [18:17:55] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:17:58] !log restart pybal on lvs2013: not required but to clear up possible no restart alerts [18:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:52] !log cmooney@cumin1002 START - Cookbook sre.network.provision for device lsw1-e8-eqiad.mgmt.eqiad.wmnet [18:21:54] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:25:35] (03CR) 10Cathal Mooney: [C:03+2] Delegate reverse zones for newly assigned K8s POD IP ranges staging [dns] - 10https://gerrit.wikimedia.org/r/1126108 (https://phabricator.wikimedia.org/T386232) (owner: 10Cathal Mooney) [18:25:50] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e8-eqiad - cmooney@cumin1002" [18:26:03] !log cmooney@dns2005 START - running authdns-update [18:27:49] !log cmooney@dns2005 END - running authdns-update [18:32:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e8-eqiad - cmooney@cumin1002" [18:32:44] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:06] (03PS2) 10Scott French: mw-(api-ext|web): serve 50% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125504 (https://phabricator.wikimedia.org/T383845) [18:33:07] (03CR) 10Scott French: "This is planned for tomorrow, 11th of March. Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125504 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:34:08] (03CR) 10Fabfur: [C:04-2] haproxy: use TLS tmpfiles and add certificate check script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [18:34:22] !log cmooney@cumin1002 START - Cookbook sre.network.provision for device lsw1-f8-eqiad.mgmt.eqiad.wmnet [18:34:24] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:39:33] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1010.eqiad.wmnet with OS bullseye [18:39:41] (03CR) 10RLazarus: [C:03+1] Profile::Mediawiki_deployment: add 'deploy' field to release config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [18:40:40] (03CR) 10RLazarus: [C:03+1] hieradata: add mw-script non-deploy releases to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [18:42:05] (03CR) 10Fabfur: [C:03+1] varnish: X-Requestctl is now being handled by HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1125986 (owner: 10Vgutierrez) [18:44:04] (03PS1) 10Ladsgroup: FileModule: Normalize file paths for deps tracked from CSSMin [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126129 (https://phabricator.wikimedia.org/T388323) [18:44:47] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-f8-eqiad - cmooney@cumin1002" [18:44:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-f8-eqiad - cmooney@cumin1002" [18:44:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:25] (03CR) 10RLazarus: [C:03+1] deployment_server: Use mw-script release values file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [18:49:56] jouncebot: nowandnext [18:49:56] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [18:49:56] In 1 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T2000) [18:50:03] (03CR) 10Ladsgroup: [C:03+2] FileModule: Normalize file paths for deps tracked from CSSMin [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126129 (https://phabricator.wikimedia.org/T388323) (owner: 10Ladsgroup) [18:52:26] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10620517 (10VRiley-WMF) Is there a timeframe for us to take this server down? [18:52:56] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10620518 (10AStein-WMF) @MoritzMuehlenhoff here's the ssh public key: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIF+hqg33Lh8JNLmqz3T... [18:54:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126129 (https://phabricator.wikimedia.org/T388323) (owner: 10Ladsgroup) [18:55:59] (03PS3) 10Scott French: mw-(api-ext|web): serve 100% of residual traffic on 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) [18:55:59] (03CR) 10Scott French: "This is the start of a 6 patchset series that moves us to 100% PHP 8.1 for mw-api-ext and mw-web. It can be thought of as two phases:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125505 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:56:04] (03PS1) 10Scott French: hieradata: switch all releases of mw-(apt-ext|web) to 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1125501 (https://phabricator.wikimedia.org/T383845) [18:56:06] (03PS3) 10Scott French: mw-(api-ext|web): direct residual traffic back to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125506 (https://phabricator.wikimedia.org/T383845) [18:56:08] (03PS4) 10Scott French: mw-(api-ext|web): scale main up to normal multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125507 (https://phabricator.wikimedia.org/T383845) [18:56:10] (03PS1) 10Scott French: trafficserver: revert cookie-enrolled traffic to main [puppet] - 10https://gerrit.wikimedia.org/r/1125502 (https://phabricator.wikimedia.org/T383845) [18:56:12] (03PS5) 10Scott French: mw-(api-ext|web): scale next down to 1 replica [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125508 (https://phabricator.wikimedia.org/T383845) [18:58:34] (03PS1) 10Gergő Tisza: Enable SUL3 signup for all of group 1 and 1% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126131 (https://phabricator.wikimedia.org/T384007) [18:58:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126131 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [18:59:49] (03CR) 10Ssingh: haproxy: use TLS tmpfiles and add certificate check script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) (owner: 10Fabfur) [19:01:32] (03PS9) 10Fabfur: haproxy: use TLS tmpfiles and add certificate check script [puppet] - 10https://gerrit.wikimedia.org/r/1125541 (https://phabricator.wikimedia.org/T388147) [19:02:26] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10620585 (10VRiley-WMF) 05Open→03Resolved [19:02:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-e8-eqiad.mgmt.eqiad.wmnet [19:03:25] (03Merged) 10jenkins-bot: FileModule: Normalize file paths for deps tracked from CSSMin [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126129 (https://phabricator.wikimedia.org/T388323) (owner: 10Ladsgroup) [19:03:42] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.eqiad.wmnet with OS bullseye [19:03:46] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1126129|FileModule: Normalize file paths for deps tracked from CSSMin (T388323)]] [19:03:49] T388323: ResourceLoaderModule-dependencies writes the exact same value to database multiple times every second - https://phabricator.wikimedia.org/T388323 [19:06:30] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1126129|FileModule: Normalize file paths for deps tracked from CSSMin (T388323)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:04] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [19:09:18] (03PS1) 10Reedy: Drop TemplateData EventStreams/EventLogging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126134 (https://phabricator.wikimedia.org/T258917) [19:09:37] (03CR) 10Reedy: [C:04-2] "Minus 2 until the owners have spoken up..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126134 (https://phabricator.wikimedia.org/T258917) (owner: 10Reedy) [19:11:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-f8-eqiad.mgmt.eqiad.wmnet [19:11:55] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Set up dual-stack ECDSA/RSA certificate support for Exim - https://phabricator.wikimedia.org/T385067#10620624 (10jhathaway) I think this should be fine, looking through the logs it appears that almost all clients are 1.2 or 1.3. The handful of 1.1 T... [19:14:39] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126129|FileModule: Normalize file paths for deps tracked from CSSMin (T388323)]] (duration: 10m 53s) [19:14:42] T388323: ResourceLoaderModule-dependencies writes the exact same value to database multiple times every second - https://phabricator.wikimedia.org/T388323 [19:15:08] (03PS6) 10Andrea Denisse: alert: Remove stale vops-bot-sync-db* service [puppet] - 10https://gerrit.wikimedia.org/r/1126128 (https://phabricator.wikimedia.org/T388444) [19:15:08] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1126128/5049/" [puppet] - 10https://gerrit.wikimedia.org/r/1126128 (https://phabricator.wikimedia.org/T388444) (owner: 10Andrea Denisse) [19:17:29] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10620665 (10fnegri) I think this one is tricky to depool, there are some notes at https://wikitech.wikimedia.org/wiki/Dumps/Dump_servers#Mainten... [19:18:37] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T388454 (10phaultfinder) 03NEW [19:18:46] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10620679 (10fnegri) Routing the traffic to the other host would also clarify if the high temperature is somehow related to the user load on this... [19:19:52] (03PS3) 10Kamila Součková: benthos-mw-accesslog-metrics: create deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 [19:20:00] (03PS2) 10Cathal Mooney: Add new Juniper leaf switches eqiad E8/F8 to IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1125488 (https://phabricator.wikimedia.org/T382017) [19:21:03] (03Abandoned) 10Cathal Mooney: Add new Juniper leaf switches eqiad E8/F8 to IBGP cluster [homer/public] - 10https://gerrit.wikimedia.org/r/1125488 (https://phabricator.wikimedia.org/T382017) (owner: 10Cathal Mooney) [19:22:06] (03PS1) 10Cathal Mooney: Add new switches eqiad racks E8/F8 [homer/public] - 10https://gerrit.wikimedia.org/r/1126136 (https://phabricator.wikimedia.org/T382017) [19:22:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [19:25:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10620728 (10phaultfinder) [19:25:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1010.eqiad.wmnet with reason: host reimage [19:30:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1012-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:42:01] (03PS1) 10Sbisson: CX3 Build 1.0.0+20250310 [extensions/ContentTranslation] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126139 (https://phabricator.wikimedia.org/T284422) [19:43:21] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [19:43:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/ContentTranslation] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126139 (https://phabricator.wikimedia.org/T284422) (owner: 10Sbisson) [19:45:05] (03PS1) 10Kimberly Sarabia: Deploy to test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T388438) [19:48:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10620837 (10Jhancock.wm) the backplanes have been replaced. it was more difficult than i anticipated. When you have a chance, please let me know if the errors have ceased. Not su... [19:50:24] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T388454#10620858 (10Jhancock.wm) this is a side effect of automation testing on a new supermicro server. disregard until the 17th to see if it resolves. [19:50:58] (03PS8) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [19:52:25] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [19:52:30] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [19:58:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1010.eqiad.wmnet with OS bullseye [19:59:04] (03PS2) 10Jdlrobson: Deploy to test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [19:59:31] (03CR) 10Jdlrobson: Deploy to test wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T2000) [20:00:05] lucaswerkmeister, tgr, and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:40] o/ [20:01:07] o/ [20:03:25] (03CR) 10Reedy: Deploy to test wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [20:03:49] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4038.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [20:03:54] any deployers around? [20:05:33] I can deploy [20:05:50] thanks! [20:06:38] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4038.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [20:07:07] (03PS1) 10RLazarus: httpbb: Add a test case for www.wikipedia.org/ [puppet] - 10https://gerrit.wikimedia.org/r/1126144 (https://phabricator.wikimedia.org/T387549) [20:07:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123741 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [20:07:49] the easiest way to test my change is probably just to check with grep -r that there are no references to that variable left in wmf.19 ;) [20:07:54] but I can also test it on mwdebug later [20:08:13] Please do. [20:08:14] (03Merged) 10jenkins-bot: Remove $wgAllowAuthenticatedCrossOrigin again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123741 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [20:08:32] tgr: Are you lurking? [20:08:33] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1123741|Remove $wgAllowAuthenticatedCrossOrigin again (T322944)]] [20:08:37] T322944: Allow authenticated requests via OAuth to the Action API from any origin - https://phabricator.wikimedia.org/T322944 [20:11:13] !log dancy@deploy2002 lucaswerkmeister, dancy: Backport for [[gerrit:1123741|Remove $wgAllowAuthenticatedCrossOrigin again (T322944)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:38] checking, one sec [20:12:00] (03PS2) 10Scott French: Profile::Mediawiki_deployment: add 'deploy' field to release config [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) [20:12:00] (03PS2) 10Scott French: hieradata: add mw-script non-deploy releases to mw_releases [puppet] - 10https://gerrit.wikimedia.org/r/1125474 (https://phabricator.wikimedia.org/T387917) [20:12:00] (03PS2) 10Scott French: deployment_server: Use mw-script release values file [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) [20:13:11] dancy: looks good to me! [20:13:18] ok! Proceeding [20:13:21] !log dancy@deploy2002 lucaswerkmeister, dancy: Continuing with sync [20:13:57] (03CR) 10Ahmon Dancy: [C:03+2] CX3 Build 1.0.0+20250310 [extensions/ContentTranslation] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126139 (https://phabricator.wikimedia.org/T284422) (owner: 10Sbisson) [20:15:24] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [20:16:17] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250310 [extensions/ContentTranslation] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1126139 (https://phabricator.wikimedia.org/T284422) (owner: 10Sbisson) [20:18:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10620969 (10Jclark-ctr) [20:19:52] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123741|Remove $wgAllowAuthenticatedCrossOrigin again (T322944)]] (duration: 11m 18s) [20:19:55] T322944: Allow authenticated requests via OAuth to the Action API from any origin - https://phabricator.wikimedia.org/T322944 [20:20:07] stephanebisson: Ready to go? [20:20:14] yep [20:20:31] thanks dancy! \o/ [20:20:43] !log dancy@deploy2002 Started scap sync-world: Backport for [[gerrit:1126139|CX3 Build 1.0.0+20250310 (T284422 T387036)]] [20:20:47] T284422: New translation: Get mobile friendly image - https://phabricator.wikimedia.org/T284422 [20:20:48] T387036: Duplicated bookmarks on desktop dashboard - https://phabricator.wikimedia.org/T387036 [20:23:20] !log dancy@deploy2002 sbisson, dancy: Backport for [[gerrit:1126139|CX3 Build 1.0.0+20250310 (T284422 T387036)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:24:10] stephanebisson: Please test out the change and let me know how it goes. [20:24:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install db1257 - https://phabricator.wikimedia.org/T384979#10621002 (10Jclark-ctr) [20:24:53] dancy working as expected [20:25:03] Awesome. continuing. [20:25:06] !log dancy@deploy2002 sbisson, dancy: Continuing with sync [20:27:02] (03PS2) 10RLazarus: httpbb: Add test cases for wikipedia.org/ and www.wikipedia.org/ [puppet] - 10https://gerrit.wikimedia.org/r/1126144 (https://phabricator.wikimedia.org/T387549) [20:27:49] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [20:30:57] (03CR) 10RLazarus: [C:03+1] Profile::Mediawiki_deployment: add 'deploy' field to release config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125473 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [20:31:30] !log dancy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126139|CX3 Build 1.0.0+20250310 (T284422 T387036)]] (duration: 10m 46s) [20:31:38] T284422: New translation: Get mobile friendly image - https://phabricator.wikimedia.org/T284422 [20:31:40] T387036: Duplicated bookmarks on desktop dashboard - https://phabricator.wikimedia.org/T387036 [20:32:06] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1126144 (https://phabricator.wikimedia.org/T387549) (owner: 10RLazarus) [20:32:39] OK. I'm done with deployments. I did not deploy tgr's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1126131 I'm taking a break for a bit and will check back later. [20:33:02] thanks dancy! [20:36:28] (03CR) 10RLazarus: deployment_server: Use mw-script release values file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125475 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [20:42:48] (03CR) 10Btullis: [C:03+2] Update the version of refinery used for refine_sanitize jobs [puppet] - 10https://gerrit.wikimedia.org/r/1126073 (https://phabricator.wikimedia.org/T388417) (owner: 10Btullis) [20:46:34] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): serve 50% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125504 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [20:47:56] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10621036 (10Jclark-ctr) Netbox offline script not run [20:48:02] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10621037 (10Jclark-ctr) 05Resolved→03Open [20:48:17] dancy: sorry, forgot about the daylight saving weirdness. I can self-deploy. [20:49:27] I'll wait to see if the Security team wants to do something in their window first. [20:49:32] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1023.eqiad.wmnet - https://phabricator.wikimedia.org/T384679#10621039 (10Jclark-ctr) 05Open→03Resolved a:05Papaul→03Jclark-ctr Ran offline script 2025-03-10T20:48:43.084872+00:00 — Successfully offlined device es1023 (WM... [20:52:01] (03PS1) 10BCornwall: upgrade-varnish: Remove vmods/varnish explicitly [cookbooks] - 10https://gerrit.wikimedia.org/r/1126151 [21:00:05] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T2100). [21:02:05] (03PS3) 10Scott French: P:mediawiki::php: install PCRE2 backport from component/php81 [puppet] - 10https://gerrit.wikimedia.org/r/1125529 (https://phabricator.wikimedia.org/T386006) [21:02:05] (03CR) 10Scott French: "There are likely a couple of ways we could go about this, particularly because `component/php81` is already used for the 8.1 packages dire" [puppet] - 10https://gerrit.wikimedia.org/r/1125529 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [21:05:15] (03CR) 10RLazarus: [C:03+2] httpbb: Add test cases for wikipedia.org/ and www.wikipedia.org/ [puppet] - 10https://gerrit.wikimedia.org/r/1126144 (https://phabricator.wikimedia.org/T387549) (owner: 10RLazarus) [21:05:46] (03CR) 10Ssingh: [C:03+1] "makes sense! explicit better than implicit." [cookbooks] - 10https://gerrit.wikimedia.org/r/1126151 (owner: 10BCornwall) [21:07:55] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:10:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10621110 (10phaultfinder) [21:11:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for restbase - jclark@cumin1002" [21:11:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for restbase - jclark@cumin1002" [21:11:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:11:54] (03CR) 10CDobbins: "Sukhbir suggested making them two commits that could be rolled out separately, one commit for the countries that currently have non-defaul" [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [21:13:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:14:23] o/ looks like the security window is not being used, is it OK to deploy one more config change? [21:14:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:14:42] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:14:43] !log installed new benthos version (4.27.0-2 over 4.27.0-1) on cp4037 for testing' [21:14:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:16:01] (03PS4) 10Fabfur: Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) [21:17:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:17:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:20:22] (03CR) 10Fabfur: [C:03+2] Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [21:21:09] (03CR) 10Volans: upgrade-varnish: Remove vmods/varnish explicitly (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1126151 (owner: 10BCornwall) [21:21:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:21:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:21:36] FIRING: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [21:22:21] (03PS3) 10Kimberly Sarabia: Deploy donate banner to test wiki for event logging testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) [21:22:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:22:51] (03CR) 10Effie Mouzeli: [C:03+1] php8.1: Install PCRE2 backport from component/php81 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1125536 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [21:23:29] I'll go forward then [21:23:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:24:10] tgr_: I think there's an ongoing page about `api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad ` [21:24:15] !incidents [21:24:16] 5719 (ACKED) [2x] GatewayBackendErrorsHigh sre (api-gateway eqiad) [21:24:16] 5717 (RESOLVED) db1152 (paged)/MariaDB read only ms1 (paged) [21:24:26] is it safe now? [21:25:46] should I wait until it gets resolved? the change is not API related [21:27:05] dunno, wdyt herron? [21:27:45] afaik it is ongoing liftwing 5xx issue being monitored and safe to proceed [21:28:01] (03CR) 10Jdlrobson: Deploy donate banner to test wiki for event logging testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [21:28:11] thx [21:28:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1257.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:29:40] 👍 [21:31:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126131 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [21:32:38] (03Merged) 10jenkins-bot: Enable SUL3 signup for all of group 1 and 1% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126131 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [21:32:54] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1126131|Enable SUL3 signup for all of group 1 and 1% of group 2 users (T384007 T384218)]] [21:32:59] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [21:33:00] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [21:35:44] !log tgr@deploy2002 tgr: Backport for [[gerrit:1126131|Enable SUL3 signup for all of group 1 and 1% of group 2 users (T384007 T384218)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:39:03] (03PS1) 10Subramanya Sastry: CommonSettings.php: Remove reference to scandium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126156 [21:41:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:41:55] !log tgr@deploy2002 tgr: Continuing with sync [21:42:36] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10621259 (10Jdlrobson-WMF) 05Open→03In progress [21:42:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:43:36] (03CR) 10JHathaway: [C:03+1] "looks good, I did some poking around and I think it is okay to rely on this API. I found a handful of bug reports about corner cases, e.g." [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [21:47:31] (03PS4) 10Kimberly Sarabia: Deploy donate banner to test wiki for event logging testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) [21:48:15] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1126131|Enable SUL3 signup for all of group 1 and 1% of group 2 users (T384007 T384218)]] (duration: 15m 21s) [21:48:20] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [21:48:20] T384218: SUL3 Phase 2: Staged rollout for all new account creation - https://phabricator.wikimedia.org/T384218 [21:48:39] tgr: No problem. 'tis the season. [21:48:58] !log UTC late deploys done [21:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:29] (03CR) 10Volans: "Lol this was actually one of the test I did and it does work. You have to either set 536 as integer in the JSON (because it's an integer i" [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [21:54:15] (03CR) 10JHathaway: [C:03+1] "ah nice, yup just tested with 536 unquoted, and it worked, great." [software/cumin] - 10https://gerrit.wikimedia.org/r/1125974 (https://phabricator.wikimedia.org/T372666) (owner: 10Volans) [21:54:17] (03PS5) 10Kimberly Sarabia: Deploy donate banner to test wiki for event logging testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) [21:54:18] (03PS2) 10Dzahn: mariadb: remove RT GRANTs for m1 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1126121 (https://phabricator.wikimedia.org/T388437) [21:54:32] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Trivial string changes [puppet] - 10https://gerrit.wikimedia.org/r/1125935 (owner: 10Aklapper) [21:55:27] (03CR) 10Kimberly Sarabia: Deploy donate banner to test wiki for event logging testing (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [22:03:14] (03PS1) 10Dzahn: deployment_server/k8s: set kubeconfig files for codesearch [puppet] - 10https://gerrit.wikimedia.org/r/1126170 (https://phabricator.wikimedia.org/T268199) [22:06:49] (03CR) 10Jdlrobson: [C:03+1] "Thanks Kim!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126140 (https://phabricator.wikimedia.org/T387768) (owner: 10Kimberly Sarabia) [22:13:46] (03PS1) 10Dzahn: create a namespace for codesearch on k8s-aux cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) [22:15:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1012-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:16:20] (03PS1) 10Dzahn: create codesearch.wikimedia.org, point to standard DYNA [dns] - 10https://gerrit.wikimedia.org/r/1126176 (https://phabricator.wikimedia.org/T268199) [22:16:30] (03PS2) 10Dzahn: create codesearch.wikimedia.org, point to standard DYNA [dns] - 10https://gerrit.wikimedia.org/r/1126176 (https://phabricator.wikimedia.org/T268199) [22:16:55] (03CR) 10Dzahn: create a namespace for codesearch on k8s-aux cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:20:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10621429 (10phaultfinder) [22:21:54] (03PS1) 10Dzahn: add ingress service alias for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) [22:22:10] (03CR) 10Dzahn: create a namespace for codesearch on k8s-aux cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:31:47] (03PS2) 10Dzahn: add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) [22:34:43] (03PS3) 10Dzahn: add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) [22:37:40] (03PS1) 10Dzahn: add k8s ingress service aliases for jaeger in codfw [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) [22:50:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10621483 (10phaultfinder) [22:58:00] (03PS1) 10Dzahn: create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) [22:59:36] (03CR) 10Dzahn: "looks like k8s-ingress-aux-* records are missing for the entire aux cluster." [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250310T2300) [23:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:13:23] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:25:48] (03PS1) 10MusikAnimal: InitialiseSettings-labs: enable multiblocks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126184 (https://phabricator.wikimedia.org/T377121) [23:27:06] RESOLVED: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [23:30:03] (03PS9) 10Pppery: Fix some wrong descriptions of old dumps [puppet] - 10https://gerrit.wikimedia.org/r/1112123 (https://phabricator.wikimedia.org/T388472) [23:31:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126184 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [23:31:17] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2089 [23:31:54] (03Merged) 10jenkins-bot: InitialiseSettings-labs: enable multiblocks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126184 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [23:31:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2089 [23:34:26] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [23:38:27] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2089 to codfw - jhancock@cumin2002" [23:38:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-be2089 to codfw - jhancock@cumin2002" [23:38:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:38:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2089 [23:38:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2089 [23:40:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:43:23] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:43:42] (03PS1) 10Fabfur: cache: enable benthos on A:cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) [23:45:36] (03PS2) 10Fabfur: cache: enable benthos on A:cp-text_ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) [23:46:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126190 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [23:47:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be2089.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:59:17] (03PS3) 10Aaron Schulz: Update Docker images of staging changeprop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588)