[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0000) [00:05:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:10:29] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [00:18:49] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:41] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS13030/IPv4: Idle - Init7, AS13030/IPv6: Idle - Init7, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:19:48] (03CR) 10Btullis: "There is an intereting discussion about this here: https://wikimedia.slack.com/archives/C055QGPTC69/p1736582228252779" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109705 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [00:25:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456548 (10phaultfinder) [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110886 [00:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110886 (owner: 10TrainBranchBot) [00:42:49] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:45] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 103, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:46:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:50:36] (03CR) 10Eevans: [C:03+2] cassandra: rotate target_version 'dev' to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [00:54:41] (03PS2) 10Eevans: cassandra: set target_dev to 4.x (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) [00:54:43] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE, AS13030/IPv6: Idle - Init7, AS13030/IPv4: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:54:49] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:55:13] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [00:57:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110886 (owner: 10TrainBranchBot) [01:08:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110888 [01:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110888 (owner: 10TrainBranchBot) [01:10:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456602 (10phaultfinder) [01:27:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110888 (owner: 10TrainBranchBot) [01:32:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:42:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:43:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:08:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.12 [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1110894 (https://phabricator.wikimedia.org/T382363) [02:08:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.12 [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1110894 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [02:28:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.12 [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1110894 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [02:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456703 (10phaultfinder) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0300) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456733 (10phaultfinder) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0400) [04:01:47] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110900 (https://phabricator.wikimedia.org/T382363) [04:01:49] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110900 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [04:02:37] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110900 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [04:03:02] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.12 refs T382363 [04:03:06] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [04:10:29] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456773 (10phaultfinder) [04:54:00] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.12 refs T382363 (duration: 50m 57s) [04:54:03] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [04:54:04] (03PS2) 10Jdlrobson: Stop expanding sections by default on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107964 (https://phabricator.wikimedia.org/T376446) [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0500) [05:03:08] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.6 (duration: 03m 06s) [05:05:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:32:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:09:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:10:23] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:11:59] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:12:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:15:04] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM. Let me know when you want to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/993010 (owner: 10Reedy) [06:22:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456848 (10phaultfinder) [06:25:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:35:31] (03PS1) 10TChin: Eventstreams: Bump image, use service-utils [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111105 (https://phabricator.wikimedia.org/T361769) [06:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456854 (10phaultfinder) [06:57:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0700) [07:00:05] marostegui, Amir1, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0700). [07:02:16] (03PS2) 10Anzx: knwiki, knwikisource, knwikitionary, knwikiquote: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) [07:02:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:03:11] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) (owner: 10Anzx) [07:03:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) (owner: 10Anzx) [07:07:38] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:21:42] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2412-2415].codfw.wmnet [07:24:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2412-2415].codfw.wmnet [07:24:58] (03PS4) 10Anzx: hiwikisource: logo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) [07:25:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [07:25:29] (03CR) 10Jelto: [C:03+2] Rename mw241[2-5] to wikikube-worker22[12-15] [puppet] - 10https://gerrit.wikimedia.org/r/1110822 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [07:28:40] (03PS3) 10Anzx: knwiki, knwikisource, knwiktionary, knwikiquote: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) [07:29:28] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2412 to wikikube-worker2212 [07:29:31] (03PS5) 10Anzx: hiwikisource: logo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) [07:29:49] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:30:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [07:30:21] status [07:32:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [07:32:39] status [07:33:10] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2412 to wikikube-worker2212 - jelto@cumin1002" [07:33:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2412 to wikikube-worker2212 - jelto@cumin1002" [07:33:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:33:54] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2212 [07:34:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2212 [07:34:28] (03PS4) 10Anzx: knwiki, knwikisource, knwiktionary, knwikiquote: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) [07:34:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2412 to wikikube-worker2212 [07:35:41] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2413 to wikikube-worker2213 [07:35:48] (03PS6) 10Anzx: hiwikisource: logo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) [07:36:02] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:39:26] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2413 to wikikube-worker2213 - jelto@cumin1002" [07:39:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw2414:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:39:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2413 to wikikube-worker2213 - jelto@cumin1002" [07:39:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:39:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2213 [07:39:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2213 [07:40:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2413 to wikikube-worker2213 [07:43:25] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2414 to wikikube-worker2214 [07:43:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:43:46] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:47:17] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2414 to wikikube-worker2214 - jelto@cumin1002" [07:47:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2414 to wikikube-worker2214 - jelto@cumin1002" [07:47:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:47:43] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2214 [07:48:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2214 [07:48:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2414 to wikikube-worker2214 [07:49:08] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2415 to wikikube-worker2215 [07:49:29] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:52:50] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2415 to wikikube-worker2215 - jelto@cumin1002" [07:53:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2415 to wikikube-worker2215 - jelto@cumin1002" [07:53:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:53:10] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2215 [07:53:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2215 [07:54:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2415 to wikikube-worker2215 [07:54:19] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2212.codfw.wmnet wikikube-worker2213.codfw.wmnet wikikube-worker2214.codfw.wmnet wikikube-worker2215.codfw.wmnet on all recursors [07:54:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2212.codfw.wmnet wikikube-worker2213.codfw.wmnet wikikube-worker2214.codfw.wmnet wikikube-worker2215.codfw.wmnet on all recursors [07:56:23] (03CR) 10Filippo Giunchedi: "LGTM, though I'll let Clement vote" [puppet] - 10https://gerrit.wikimedia.org/r/1110872 (https://phabricator.wikimedia.org/T370527) (owner: 10Andrea Denisse) [07:57:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72020 and previous config saved to /var/cache/conftool/dbconfig/20250114-075741-root.json [08:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T0800). [08:00:05] ottomata, gmodena and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] o/ [08:00:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10456880 (10Marostegui) 05Open→03Resolved Looks good! thank you! ` root@es1043:~# mii-tool eno8303 eno8303: negotiated 1000baseT-FD flow-control, link ok ` [08:01:25] o/ [08:02:46] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2212.codfw.wmnet with OS bookworm [08:02:56] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2212 [08:05:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:07:04] (03PS2) 10Marostegui: orchestrator.conf.json.erb: Update whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1110819 [08:07:04] (03PS1) 10Marostegui: es1044: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111160 [08:08:37] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2212 - jelto@cumin1002" [08:08:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2212 - jelto@cumin1002" [08:08:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:08:47] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2212.codfw.wmnet 59.32.192.10.in-addr.arpa 9.5.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:08:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2212.codfw.wmnet 59.32.192.10.in-addr.arpa 9.5.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:08:51] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2212 [08:09:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2212 [08:09:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2212 [08:10:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2213.codfw.wmnet with OS bookworm [08:10:29] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:34] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2213 [08:10:50] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:12:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72021 and previous config saved to /var/cache/conftool/dbconfig/20250114-081246-root.json [08:14:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2213 - jelto@cumin1002" [08:14:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2213 - jelto@cumin1002" [08:14:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:14:19] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2213.codfw.wmnet 60.32.192.10.in-addr.arpa 0.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:14:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2213.codfw.wmnet 60.32.192.10.in-addr.arpa 0.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:14:22] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2213 [08:14:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2213 [08:14:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2213 [08:15:08] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2214.codfw.wmnet with OS bookworm [08:15:19] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2214 [08:15:45] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:17:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [08:17:14] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: migrate ops instance to prometheus::instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [08:17:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105878 (https://phabricator.wikimedia.org/T377956) (owner: 10Stevemunene) [08:18:36] (03CR) 10Muehlenhoff: "That won't work :-) Couple of comments inline how to clean that out." [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [08:19:25] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2214 - jelto@cumin1002" [08:19:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2214 - jelto@cumin1002" [08:19:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:19:30] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2214.codfw.wmnet 61.32.192.10.in-addr.arpa 1.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:19:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2214.codfw.wmnet 61.32.192.10.in-addr.arpa 1.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:19:33] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2214 [08:19:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2214 [08:19:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2214 [08:21:48] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2215.codfw.wmnet with OS bookworm [08:21:55] !log installing perl security updates [08:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:59] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2215 [08:22:06] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:25:34] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2215 - jelto@cumin1002" [08:25:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2215 - jelto@cumin1002" [08:25:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:25:39] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2215.codfw.wmnet 62.32.192.10.in-addr.arpa 2.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:25:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2215.codfw.wmnet 62.32.192.10.in-addr.arpa 2.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:25:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2215 [08:25:43] anzx: gmodena: did you get your patch deployed? [08:25:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2215 [08:25:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2215 [08:26:05] damn bot [08:26:10] that is so verbose [08:26:23] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2212.codfw.wmnet with reason: host reimage [08:26:25] (03CR) 10Gmodena: [C:03+1] Revert^2 "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110777 (owner: 10Ottomata) [08:27:20] gmodena: is there anything specific to do for your patch? I recognize that is a config change I revert solely because it got merged while I was deploying the train. But beside that I don't know what it does [08:27:32] looks like that is just clean up? [08:27:39] hashar not deployed yet. Normally I'd do myself, but this is a patch you previously reverted (it landed during a deployment train) and wanted to ask for an ack [08:27:46] hashar correct, it's a cleanup [08:27:47] ah yeah [08:27:49] please do! [08:27:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72022 and previous config saved to /var/cache/conftool/dbconfig/20250114-082751-root.json [08:28:06] hashar ack. I'll deploy [08:28:17] this way I can review the other two patches. Thank you! [08:28:30] hashar np. thanks for checking in! [08:28:53] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110883 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [08:29:38] (03CR) 10Hashar: [C:03+1] "I will deploy it. Thank you for the patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [08:29:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gmodena@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110777 (owner: 10Ottomata) [08:30:43] (03Merged) 10jenkins-bot: Revert^2 "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110777 (owner: 10Ottomata) [08:31:13] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2213.codfw.wmnet with reason: host reimage [08:31:31] !log gmodena@deploy2002 Started scap sync-world: Backport for [[gerrit:1110777|Revert^2 "config: remove eventbus instrumentation setting"]] [08:32:32] (03CR) 10Hashar: [C:03+1] "I will deploy it. Thank you for the patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) (owner: 10Anzx) [08:32:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2212.codfw.wmnet with reason: host reimage [08:34:54] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2214.codfw.wmnet with reason: host reimage [08:36:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2213.codfw.wmnet with reason: host reimage [08:37:07] (03PS9) 10Filippo Giunchedi: prometheus: k8s instances migration to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) [08:37:07] (03PS2) 10Filippo Giunchedi: prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) [08:38:53] !log gmodena@deploy2002 otto, gmodena: Backport for [[gerrit:1110777|Revert^2 "config: remove eventbus instrumentation setting"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:40:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2214.codfw.wmnet with reason: host reimage [08:40:33] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4791/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [08:42:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72023 and previous config saved to /var/cache/conftool/dbconfig/20250114-084256-root.json [08:43:47] !log gmodena@deploy2002 otto, gmodena: Continuing with sync [08:45:55] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2212.codfw.wmnet with OS bookworm [08:53:24] !log gmodena@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110777|Revert^2 "config: remove eventbus instrumentation setting"]] (duration: 21m 52s) [08:55:02] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbprov2006.codfw.wmnet with reason: os upgrade [08:55:17] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbprov2006.codfw.wmnet with reason: os upgrade [08:55:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2213.codfw.wmnet with OS bookworm [08:58:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72024 and previous config saved to /var/cache/conftool/dbconfig/20250114-085802-root.json [08:58:11] (03PS1) 10Brouberol: airflow: replace the scheduler liveness check by a tcpSocket probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111162 (https://phabricator.wikimedia.org/T383651) [08:58:59] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2215.codfw.wmnet with OS bookworm [08:59:22] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2215.codfw.wmnet with OS bookworm [08:59:25] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2215 [08:59:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2215 [08:59:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2214.codfw.wmnet with OS bookworm [08:59:50] (03CR) 10Btullis: [C:03+1] airflow: replace the scheduler liveness check by a tcpSocket probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111162 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol) [09:01:27] (03CR) 10Brouberol: [C:03+2] airflow: replace the scheduler liveness check by a tcpSocket probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111162 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol) [09:05:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:06:07] hashar: if you are going to deploy , i am here [09:06:24] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [09:06:25] yeah I will but the other one is still going on :/ [09:06:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:06:33] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:41] ok, i thought other one was finished [09:08:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [09:09:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [09:09:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [09:09:34] (03PS1) 10Hashar: gerrit: restore IP addresses in ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) [09:09:40] let me check [09:09:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [09:09:51] gmodena: your deployment is still going on sin't it? [09:09:53] (03CR) 10CI reject: [V:04-1] gerrit: restore IP addresses in ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [09:10:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [09:10:52] (03PS2) 10Hashar: gerrit: restore IP addresses in ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) [09:10:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [09:11:11] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [09:11:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [09:12:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [09:13:45] (03PS2) 10Brouberol: airflow: Allow specific task pods to access the kube-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110883 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [09:15:51] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2215.codfw.wmnet with reason: host reimage [09:19:39] (03CR) 10Tiziano Fogli: [C:03+1] thanos-query: write active queries to file [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [09:19:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2215.codfw.wmnet with reason: host reimage [09:21:34] (03PS1) 10Fabfur: Added new stream config for haproxy_requestctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) [09:22:24] (03CR) 10CI reject: [V:04-1] Added new stream config for haproxy_requestctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [09:22:36] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:21] Finished scap sync-world: Backport for [[gerrit:1110777|Revert^2 "config: remove eventbus instrumentation setting"]] (duration: 21m 52s) [09:24:22] from the logs [09:24:27] at 8:53:24 UTC [09:24:34] (03CR) 10Marostegui: [C:03+2] es1044: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111160 (owner: 10Marostegui) [09:24:42] (03CR) 10Marostegui: [C:03+2] orchestrator.conf.json.erb: Update whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1110819 (owner: 10Marostegui) [09:24:44] which of course I have missed in the above log spam [09:25:08] anzx: I am doing your patches [09:25:12] ok [09:25:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) (owner: 10Anzx) [09:25:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [09:26:20] (03Merged) 10jenkins-bot: knwiki, knwikisource, knwiktionary, knwikiquote: update logo, wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111106 (https://phabricator.wikimedia.org/T382802) (owner: 10Anzx) [09:26:22] (03Merged) 10jenkins-bot: hiwikisource: logo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111109 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [09:26:49] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1111106|knwiki, knwikisource, knwiktionary, knwikiquote: update logo, wordmark (T382802)]], [[gerrit:1111109|hiwikisource: logo fix (T310961)]] [09:26:55] T382802: Update wordmark and logo width for knwiki , knwikisource , knwikiquote , knwiktionary - https://phabricator.wikimedia.org/T382802 [09:26:55] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [09:27:08] (03PS2) 10Fabfur: Added new stream config for haproxy_requestctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) [09:29:14] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:29:27] (03PS1) 10Marostegui: instances.yaml: Add es1044 [puppet] - 10https://gerrit.wikimedia.org/r/1111168 (https://phabricator.wikimedia.org/T382569) [09:29:56] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1044 [puppet] - 10https://gerrit.wikimedia.org/r/1111168 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [09:31:32] !log hashar@deploy2002 anzx, hashar: Backport for [[gerrit:1111106|knwiki, knwikisource, knwiktionary, knwikiquote: update logo, wordmark (T382802)]], [[gerrit:1111109|hiwikisource: logo fix (T310961)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:31:35] hashar: checking [09:31:36] anzx: changes are on the test servers if you wanna test them [09:31:39] \o/ [09:31:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1044 to dbctl depooled T382569', diff saved to https://phabricator.wikimedia.org/P72025 and previous config saved to /var/cache/conftool/dbconfig/20250114-093147-marostegui.json [09:31:51] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [09:32:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72026 and previous config saved to /var/cache/conftool/dbconfig/20250114-093216-root.json [09:33:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1022 T382569', diff saved to https://phabricator.wikimedia.org/P72027 and previous config saved to /var/cache/conftool/dbconfig/20250114-093315-marostegui.json [09:33:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es[1022,1043].eqiad.wmnet with reason: cloning [09:33:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es[1022,1043].eqiad.wmnet with reason: cloning [09:34:00] hashar: look good [09:35:31] !log hashar@deploy2002 anzx, hashar: Continuing with sync [09:36:14] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:17] (03PS1) 10Marostegui: es1044: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1111172 [09:36:36] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 106, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:36:39] (03CR) 10Marostegui: [C:03+2] es1044: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1111172 (owner: 10Marostegui) [09:38:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2215.codfw.wmnet with OS bookworm [09:38:24] (03CR) 10Muehlenhoff: [C:03+2] Switch Presto access to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109411 (owner: 10Muehlenhoff) [09:40:40] (03CR) 10Volans: [C:03+2] enum: remove type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778 (owner: 10Volans) [09:42:30] (03CR) 10Marostegui: [C:03+1] P:conftool: allow the parsercache section flavor [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [09:43:01] (03CR) 10Jelto: "looks mostly good. But I don't understand the difference between `kubectl$(K8S_VERSION)' and `kubectl-$(K8S_VERSION)'. Why do we need both" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:43:10] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111106|knwiki, knwikisource, knwiktionary, knwikiquote: update logo, wordmark (T382802)]], [[gerrit:1111109|hiwikisource: logo fix (T310961)]] (duration: 16m 21s) [09:43:14] T382802: Update wordmark and logo width for knwiki , knwikisource , knwikiquote , knwiktionary - https://phabricator.wikimedia.org/T382802 [09:43:15] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [09:43:17] (03PS3) 10Hashar: gerrit: restore IP addresses in ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) [09:43:22] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp [09:43:23] hashar: could run https://www.irccloud.com/pastebin/e9jWrN4j/ [09:43:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc4 T383398', diff saved to https://phabricator.wikimedia.org/P72028 and previous config saved to /var/cache/conftool/dbconfig/20250114-094350-marostegui.json [09:43:54] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [09:43:55] !log homer 'lsw1-c3-codfw*' commit 'T377877' [09:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:59] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [09:44:18] (03CR) 10Hashar: "PCC failed with:" [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [09:44:24] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10457073 (10cmooney) >>! In T371501#10453986, @dcaro wrote: > We still have to restart all the osd daemon processes to pick up the config chan... [09:44:29] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [09:44:40] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov2006.codfw.wmnet: Renew puppet certificate - root@cumin1002 [09:44:44] anzx: yes I will [09:44:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc[2014-2016].codfw.wmnet,pc1016.eqiad.wmnet with reason: reorganizing pc4 [09:44:50] (03PS2) 10Muehlenhoff: Presto: Remove ferm support [puppet] - 10https://gerrit.wikimedia.org/r/1109412 [09:44:57] I thought the logo got purged automagically [09:45:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc[2014-2016].codfw.wmnet,pc1016.eqiad.wmnet with reason: reorganizing pc4 [09:45:12] (03CR) 10Gmodena: [C:03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [09:45:47] anzx: done [09:46:53] hashar: thank you [09:47:10] \o/ [09:47:10] !log homer 'cr*codfw*' commit 'T377877' [09:47:12] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov2006.codfw.wmnet: Renew puppet certificate - root@cumin1002 [09:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72030 and previous config saved to /var/cache/conftool/dbconfig/20250114-094722-root.json [09:47:54] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 120, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109412 (owner: 10Muehlenhoff) [09:48:38] (03PS3) 10Volans: api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) [09:48:43] (03PS4) 10Hashar: gerrit: restore IP addresses in ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) [09:48:51] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [09:50:45] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2212-2215].codfw.wmnet [09:50:46] (03PS1) 10Marostegui: mariadb: Reorganize pc4 [puppet] - 10https://gerrit.wikimedia.org/r/1111174 (https://phabricator.wikimedia.org/T383398) [09:50:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2212-2215].codfw.wmnet [09:51:53] (03Merged) 10jenkins-bot: enum: remove type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778 (owner: 10Volans) [09:52:18] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383595#10457092 (10Jelto) [09:53:10] (03CR) 10Btullis: [C:03+1] "Super, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1109412 (owner: 10Muehlenhoff) [09:53:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote pc2014 to codfw pc4 master dbmaint T383398', diff saved to https://phabricator.wikimedia.org/P72031 and previous config saved to /var/cache/conftool/dbconfig/20250114-095320-marostegui.json [09:53:25] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [09:53:28] (03CR) 10Btullis: [C:03+2] airflow: Allow specific task pods to access the kube-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110883 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [09:54:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc4 T383398', diff saved to https://phabricator.wikimedia.org/P72032 and previous config saved to /var/cache/conftool/dbconfig/20250114-095404-marostegui.json [09:54:17] (03CR) 10Jelto: Support multiple kubernetes-client versions (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:54:58] (03Merged) 10jenkins-bot: airflow: Allow specific task pods to access the kube-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110883 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [09:56:22] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1109672 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:58:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:58:54] (03CR) 10Marostegui: [C:03+2] mariadb: Reorganize pc4 [puppet] - 10https://gerrit.wikimedia.org/r/1111174 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [09:59:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:59:58] (03CR) 10CI reject: [V:04-1] api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) (owner: 10Volans) [10:00:15] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:01:47] (03PS1) 10Marostegui: mariadb: Reorganize pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1111179 (https://phabricator.wikimedia.org/T383398) [10:01:48] (03PS4) 10Volans: api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) [10:02:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72033 and previous config saved to /var/cache/conftool/dbconfig/20250114-100227-root.json [10:02:37] (03CR) 10Muehlenhoff: [C:03+2] Presto: Remove ferm support [puppet] - 10https://gerrit.wikimedia.org/r/1109412 (owner: 10Muehlenhoff) [10:02:53] (03CR) 10Marostegui: [C:03+2] mariadb: Reorganize pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1111179 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:03:02] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp [10:03:09] moritzm: ok to merge? [10:03:23] yes, please [10:03:30] moritzm: mergning [10:03:33] thx [10:03:37] :* [10:05:22] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp [10:11:05] (03PS1) 10Muehlenhoff: Switch an-test-presto1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1111180 [10:12:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff) [10:14:48] (03CR) 10Jelto: [C:03+1] "lgtm, I tested the checksum part locally and it works as expected" [debs/calico] (v3.29) - 10https://gerrit.wikimedia.org/r/1109671 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:15:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:49] (03CR) 10Hashar: [C:03+1] "The PCC shows contint being updated which is updating `/var/lib/zuul/.ssh/known_hosts`" [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [10:16:56] (03PS1) 10Jcrespo: dbbackups: Review and update grants for dump user on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111182 [10:17:01] (03PS1) 10Marostegui: site.pp: Reorganize pc sections [puppet] - 10https://gerrit.wikimedia.org/r/1111183 (https://phabricator.wikimedia.org/T383398) [10:17:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72034 and previous config saved to /var/cache/conftool/dbconfig/20250114-101732-root.json [10:18:04] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize pc sections [puppet] - 10https://gerrit.wikimedia.org/r/1111183 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:19:03] (03CR) 10CI reject: [V:04-1] dbbackups: Review and update grants for dump user on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111182 (owner: 10Jcrespo) [10:20:24] (03CR) 10JMeybohm: Support multiple kubernetes-client versions (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:20:30] (03PS2) 10Jcrespo: dbbackups: Review and update grants for m1 dump user on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111182 (https://phabricator.wikimedia.org/T373579) [10:21:52] (03CR) 10Clément Goubert: [C:03+1] profile::mediawiki::common: Remove obsolete DSH group check [puppet] - 10https://gerrit.wikimedia.org/r/1110872 (https://phabricator.wikimedia.org/T370527) (owner: 10Andrea Denisse) [10:22:50] (03PS1) 10Marostegui: valid_sections.pp: Add pc6 and pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1111184 (https://phabricator.wikimedia.org/T383234) [10:23:09] (03PS3) 10Jcrespo: dbbackups: Review and update grants for m1 dump user on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111182 (https://phabricator.wikimedia.org/T373579) [10:26:54] (03PS1) 10Marostegui: dbproxy2006: Change m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1111185 (https://phabricator.wikimedia.org/T373579) [10:27:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp [10:27:43] (03CR) 10Marostegui: "root@cumin1002:~# host 10.192.28.6" [puppet] - 10https://gerrit.wikimedia.org/r/1111185 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:27:59] (03PS1) 10Jelto: Rename mw237[3-6] to wikikube-worker22[16-19] [puppet] - 10https://gerrit.wikimedia.org/r/1111187 (https://phabricator.wikimedia.org/T377877) [10:30:30] (03CR) 10Jcrespo: [C:03+1] dbproxy2006: Change m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1111185 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:30:42] (03CR) 10Marostegui: [C:03+2] dbproxy2006: Change m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1111185 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [10:32:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72035 and previous config saved to /var/cache/conftool/dbconfig/20250114-103238-root.json [10:33:37] (03CR) 10Cathal Mooney: "One comment in-line on the separate NAT source. But happy for this to proceed overall." [puppet] - 10https://gerrit.wikimedia.org/r/1105036 (https://phabricator.wikimedia.org/T383261) (owner: 10FNegri) [10:34:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2235.codfw.wmnet with reason: upgrade [10:34:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2235.codfw.wmnet with reason: upgrade [10:35:23] !log Reboot db2235 m5 codfw master [10:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:10] (03CR) 10Cathal Mooney: [C:03+2] Validators: Allow an interface to be called just "irb" on a device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1105346 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [10:37:30] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp [10:37:58] (03PS1) 10Marostegui: wmnet: Update pc4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1111189 (https://phabricator.wikimedia.org/T383398) [10:38:50] PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2235.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2235.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:29] (03Merged) 10jenkins-bot: Validators: Allow an interface to be called just "irb" on a device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1105346 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [10:39:30] ^ expected (I downtimed it) [10:39:50] Ah, I made a typo for the hostname and that's why it alerted [10:39:54] anyway, will recover soon [10:41:21] (03CR) 10Marostegui: [C:03+2] wmnet: Update pc4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1111189 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:41:26] !log marostegui@dns1006 START - running authdns-update [10:41:48] RECOVERY - MariaDB Replica IO: m5 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:43:06] !log marostegui@dns1006 END - running authdns-update [10:43:39] !log marostegui@dns1006 START - running authdns-update [10:45:23] !log marostegui@dns1006 END - running authdns-update [10:46:47] (03CR) 10JMeybohm: "Not sure what you're referring to but let me try to explain my idea:" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:47:29] (03PS2) 10JMeybohm: Support multiple kubernetes-client versions [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) [10:47:30] (03Abandoned) 10Vgutierrez: liberica: liberica got renamed to libericad [puppet] - 10https://gerrit.wikimedia.org/r/1099216 (owner: 10Vgutierrez) [10:47:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72036 and previous config saved to /var/cache/conftool/dbconfig/20250114-104743-root.json [10:51:27] (03CR) 10Vgutierrez: [C:04-1] "you got some syntax error per https://integration.wikimedia.org/ci/job/alerts-pipeline-test/2216/console" [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [10:52:50] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302#10457281 (10cmooney) a:03cmooney I fired off a mail to Planters Telecom Collective asking if they still needed the sessions. I'll remove if they don't come b... [10:57:57] (03PS1) 10Fabfur: hiera: add haproxykafka to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111193 (https://phabricator.wikimedia.org/T378578) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1100) [11:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72037 and previous config saved to /var/cache/conftool/dbconfig/20250114-110248-root.json [11:05:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp [11:06:54] (03PS3) 10JMeybohm: Update to kubernetes v1.31.4 [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1109672 (https://phabricator.wikimedia.org/T341984) [11:09:22] 06SRE, 06Traffic, 13Patch-For-Review: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879#10457334 (10fgiunchedi) I'm untagging o11y here since things seem stable and there's no action ATM, please reach out if things change! [11:10:26] (03PS1) 10Michael Große: fix(tracking): TimingMetric:observe records milliseconds [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111196 (https://phabricator.wikimedia.org/T383208) [11:10:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111196 (https://phabricator.wikimedia.org/T383208) (owner: 10Michael Große) [11:13:25] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1003.eqiad.wmnet with reason: os upgrade [11:13:39] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1003.eqiad.wmnet with reason: os upgrade [11:14:20] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2239.codfw.wmnet with reason: reboot [11:14:46] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: reboot [11:14:56] (03PS1) 10Marostegui: instances.yaml: Remove es1020 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1111198 (https://phabricator.wikimedia.org/T383578) [11:15:37] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1020 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1111198 (https://phabricator.wikimedia.org/T383578) (owner: 10Marostegui) [11:16:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es1020 from dbctl for decommission T383578', diff saved to https://phabricator.wikimedia.org/P72038 and previous config saved to /var/cache/conftool/dbconfig/20250114-111647-marostegui.json [11:16:52] T383578: decommission es1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383578 [11:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72039 and previous config saved to /var/cache/conftool/dbconfig/20250114-111754-root.json [11:25:44] (03PS1) 10Marostegui: mariadb: Remove es1020 [puppet] - 10https://gerrit.wikimedia.org/r/1111199 (https://phabricator.wikimedia.org/T383578) [11:27:12] (03PS1) 10Muehlenhoff: presto::server: Specify ports as integers, not strings [puppet] - 10https://gerrit.wikimedia.org/r/1111200 [11:27:27] (03PS2) 10Muehlenhoff: presto::server: Specify ports as integers, not strings [puppet] - 10https://gerrit.wikimedia.org/r/1111200 [11:28:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es1020.eqiad.wmnet [11:30:14] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1020 [puppet] - 10https://gerrit.wikimedia.org/r/1111199 (https://phabricator.wikimedia.org/T383578) (owner: 10Marostegui) [11:33:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72040 and previous config saved to /var/cache/conftool/dbconfig/20250114-113259-root.json [11:33:15] (03PS1) 10Vgutierrez: Add missing includes for private1-d8-codfw reverse zones [dns] - 10https://gerrit.wikimedia.org/r/1111201 [11:34:33] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [11:37:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111200 (owner: 10Muehlenhoff) [11:37:57] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [11:38:24] (03PS3) 10Muehlenhoff: presto::server: Specify ports as integers, not strings [puppet] - 10https://gerrit.wikimedia.org/r/1111200 [11:38:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [11:38:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:38:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1020.eqiad.wmnet [11:39:16] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383578#10457409 (10Marostegui) a:05Marostegui→03None [11:39:37] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383578#10457415 (10Marostegui) This is ready for #dc-ops [11:40:47] (03CR) 10JMeybohm: [C:03+1] Rename mw237[3-6] to wikikube-worker22[16-19] [puppet] - 10https://gerrit.wikimedia.org/r/1111187 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [11:41:22] (03CR) 10Volans: [C:03+1] "Include LGTM and the files are already generated." [dns] - 10https://gerrit.wikimedia.org/r/1111201 (owner: 10Vgutierrez) [11:42:10] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111193 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [11:42:12] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update to calico v3.29.1 [debs/calico] (v3.29) - 10https://gerrit.wikimedia.org/r/1109671 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:42:29] (03CR) 10Mvolz: [C:04-1] rest-gateway: add params to config, rework citoid path matching (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [11:43:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:45:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1110873 (https://phabricator.wikimedia.org/T383271) (owner: 10JHathaway) [11:46:10] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q3-Q4), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10457419 (10joanna_borun) Approved [11:48:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1044 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P72041 and previous config saved to /var/cache/conftool/dbconfig/20250114-114804-root.json [11:50:59] (03CR) 10Ladsgroup: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1111184 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [11:51:54] (03CR) 10Vgutierrez: [C:03+2] Add missing includes for private1-d8-codfw reverse zones [dns] - 10https://gerrit.wikimedia.org/r/1111201 (owner: 10Vgutierrez) [11:52:51] !log vgutierrez@dns1004 START - running authdns-update [11:53:25] (03CR) 10Marostegui: [C:03+2] valid_sections.pp: Add pc6 and pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1111184 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [11:54:00] PROBLEM - Host dbproxy1025 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111200 (owner: 10Muehlenhoff) [11:54:44] !log vgutierrez@dns1004 END - running authdns-update [11:55:02] RECOVERY - Host dbproxy1025 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:55:44] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2212 gradually with 4 steps - Maint over [11:57:58] (03PS4) 10Muehlenhoff: presto::server: Specify ports as integers, not strings [puppet] - 10https://gerrit.wikimedia.org/r/1111200 [11:58:19] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10457455 (10cmooney) [11:58:57] (03CR) 10Volans: [C:03+2] "Merging, last PS just fixed a typo in a docstring." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) (owner: 10Volans) [11:59:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111200 (owner: 10Muehlenhoff) [12:01:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:01:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2015,2017].codfw.wmnet,pc[1014-1015,1017].eqiad.wmnet with reason: maintenance [12:02:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2015,2017].codfw.wmnet,pc[1014-1015,1017].eqiad.wmnet with reason: maintenance [12:02:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc5 eqiad codfw dbmaint T383398', diff saved to https://phabricator.wikimedia.org/P72043 and previous config saved to /var/cache/conftool/dbconfig/20250114-120234-marostegui.json [12:02:38] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [12:08:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc5 T383398', diff saved to https://phabricator.wikimedia.org/P72044 and previous config saved to /var/cache/conftool/dbconfig/20250114-120804-marostegui.json [12:08:09] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [12:10:07] (03Merged) 10jenkins-bot: api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) (owner: 10Volans) [12:10:29] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:53] (03PS1) 10Marostegui: pc1014: Move it to pc4 [puppet] - 10https://gerrit.wikimedia.org/r/1111205 (https://phabricator.wikimedia.org/T383398) [12:12:58] (03PS3) 10Giuseppe Lavagetto: ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) [12:12:58] (03PS3) 10Giuseppe Lavagetto: Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) [12:13:00] (03CR) 10Marostegui: [C:03+2] pc1014: Move it to pc4 [puppet] - 10https://gerrit.wikimedia.org/r/1111205 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [12:16:06] (03PS3) 10Volans: api: allow to skip the START log to SAL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655) [12:16:20] (03PS1) 10Btullis: airflow: Use the existing labels for kubernetes and spark operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111206 (https://phabricator.wikimedia.org/T383430) [12:16:41] (03PS1) 10Marostegui: site.pp: Reorganize pc4 and pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1111207 (https://phabricator.wikimedia.org/T383398) [12:17:33] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize pc4 and pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1111207 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [12:18:01] 06SRE, 10Scap, 06serviceops-radar: Introduce state to Scap - https://phabricator.wikimedia.org/T209881#10457507 (10jijiki) 05Open→03Invalid With #mw-on-k8s, this task is invalid. [12:18:36] (03PS2) 10Btullis: airflow: Use the existing labels for kubernetes and spark operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111206 (https://phabricator.wikimedia.org/T383430) [12:18:47] jouncebot: now and next [12:18:47] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [12:19:13] jouncebot: bro :( [12:20:42] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: k8s instances migration to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [12:23:19] 06SRE, 10Scap, 06serviceops-radar: Introduce state to Scap - https://phabricator.wikimedia.org/T209881#10457532 (10jijiki) [12:24:43] 06SRE, 06serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143#10457535 (10jijiki) 05Open→03Resolved a:03jijiki I think this is resolved. [12:26:40] 06SRE, 10Scap, 06serviceops, 05Goal: SRE FY2019 Q3:TEC6: First steps towards Canary Deployments - https://phabricator.wikimedia.org/T213156#10457542 (10jijiki) 05Open→03Resolved a:03jijiki Main goal was T282148. [12:30:04] (03CR) 10Muehlenhoff: [C:03+2] Switch magru01 to managed /var/lib/ganeti/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109092 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [12:32:47] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [12:34:08] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [12:34:15] (03PS3) 10Filippo Giunchedi: prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) [12:37:57] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [12:38:19] (03PS1) 10Marostegui: installserver: Do not format es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1111210 (https://phabricator.wikimedia.org/T382569) [12:41:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2212 gradually with 4 steps - Maint over [12:41:45] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1111210 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [12:41:45] (03CR) 10Jelto: [C:03+1] "ack thanks for the clarification! Makes sense now." [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:42:51] (03CR) 10Ladsgroup: [C:03+1] P:conftool: allow the parsercache section flavor [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [12:44:06] (03PS1) 10Marostegui: mariadb: Remove db2128 [puppet] - 10https://gerrit.wikimedia.org/r/1111211 (https://phabricator.wikimedia.org/T383572) [12:44:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2128.codfw.wmnet [12:44:45] (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2128 [puppet] - 10https://gerrit.wikimedia.org/r/1111211 (https://phabricator.wikimedia.org/T383572) (owner: 10Marostegui) [12:49:04] (03PS1) 10Muehlenhoff: Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1111213 [12:49:07] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [12:50:50] (03CR) 10JMeybohm: [C:03+2] Support multiple kubernetes-client versions [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:52:34] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2128.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [12:52:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2128.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [12:52:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2128.codfw.wmnet [12:53:02] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2128.codfw.wmnet - https://phabricator.wikimedia.org/T383572#10457620 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `db2128.codfw.wmnet` - db2128.codfw.wmnet (**... [12:53:04] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2128.codfw.wmnet - https://phabricator.wikimedia.org/T383572#10457621 (10Marostegui) a:05Marostegui→03None [12:53:20] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission db2128.codfw.wmnet - https://phabricator.wikimedia.org/T383572#10457626 (10Marostegui) This is ready for #dc-ops [12:54:47] (03CR) 10Jelto: "I left a comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:55:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:16] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:58:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-ml-staging_31443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:58:32] (03PS1) 10Awight: Switch to explicit numbering for Parsoid footnote markers [extensions/Cite] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111215 (https://phabricator.wikimedia.org/T382310) [12:59:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:59:10] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2373-2376].codfw.wmnet [12:59:14] (03CR) 10Awight: [C:04-2] "Cherry-pick to wmf/1.44.0-wmf.12 is scheduled for 20 January" [extensions/Cite] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111215 (https://phabricator.wikimedia.org/T382310) (owner: 10Awight) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1300) [13:01:06] (03CR) 10Ladsgroup: Add new file tables to WMCS views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup) [13:01:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2373-2376].codfw.wmnet [13:03:51] (03PS3) 10NMW03: Add azwiki to mobile-anon-talk dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) [13:04:06] (03CR) 10Jelto: [C:03+2] Rename mw237[3-6] to wikikube-worker22[16-19] [puppet] - 10https://gerrit.wikimedia.org/r/1111187 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [13:04:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) (owner: 10NMW03) [13:04:46] @jouncebot: next [13:04:52] jouncebot: next [13:04:52] In 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1400) [13:06:03] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2373 to wikikube-worker2216 [13:06:24] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:06:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [13:06:51] status [13:09:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [13:09:07] status [13:09:46] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2373 to wikikube-worker2216 - jelto@cumin1002" [13:10:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2373 to wikikube-worker2216 - jelto@cumin1002" [13:10:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:06] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2216 [13:10:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2216 [13:11:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2373 to wikikube-worker2216 [13:11:18] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2374 to wikikube-worker2217 [13:11:39] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:15:40] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2374 to wikikube-worker2217 - jelto@cumin1002" [13:15:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2374 to wikikube-worker2217 - jelto@cumin1002" [13:15:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2217 [13:16:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2217 [13:16:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2375:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2375 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:16:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2374 to wikikube-worker2217 [13:17:31] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2375 to wikikube-worker2218 [13:17:42] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:21:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2376:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2376 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:22:03] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2375 to wikikube-worker2218 - jelto@cumin1002" [13:22:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2375 to wikikube-worker2218 - jelto@cumin1002" [13:22:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:26] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2218 [13:22:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2218 [13:23:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2375 to wikikube-worker2218 [13:24:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:24:28] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2376 to wikikube-worker2219 [13:24:49] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:28:18] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2376 to wikikube-worker2219 - jelto@cumin1002" [13:28:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2376 to wikikube-worker2219 - jelto@cumin1002" [13:28:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:34] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2219 [13:28:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2219 [13:29:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2376 to wikikube-worker2219 [13:29:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2216.codfw.wmnet wikikube-worker2217.codfw.wmnet wikikube-worker2218.codfw.wmnet wikikube-worker2219.codfw.wmnet on all recursors [13:29:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2216.codfw.wmnet wikikube-worker2217.codfw.wmnet wikikube-worker2218.codfw.wmnet wikikube-worker2219.codfw.wmnet on all recursors [13:29:49] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1111200 (owner: 10Muehlenhoff) [13:30:29] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:47] (03CR) 10Btullis: [C:03+1] airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109714 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [13:31:08] (03CR) 10Btullis: [C:03+1] "Didn't we already do this?" [puppet] - 10https://gerrit.wikimedia.org/r/1109714 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [13:32:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2216.codfw.wmnet with OS bookworm [13:32:54] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2216 [13:34:03] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:34:35] (03CR) 10Brouberol: "Turns out we did it for search, but not research" [puppet] - 10https://gerrit.wikimedia.org/r/1109714 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [13:34:42] (03CR) 10Brouberol: [C:03+2] airflow-research: disable the airflow systemd services [puppet] - 10https://gerrit.wikimedia.org/r/1109714 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [13:35:55] (03CR) 10Brouberol: [C:03+1] airflow: Use the existing labels for kubernetes and spark operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111206 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [13:37:26] (03CR) 10Btullis: [C:03+2] airflow: Use the existing labels for kubernetes and spark operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111206 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [13:37:45] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2216 - jelto@cumin1002" [13:37:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2216 - jelto@cumin1002" [13:37:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:37:50] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2216.codfw.wmnet 145.48.192.10.in-addr.arpa 5.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:37:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2216.codfw.wmnet 145.48.192.10.in-addr.arpa 5.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:37:54] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2216 [13:38:13] (03PS3) 10David Caro: ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) [13:38:13] (03PS1) 10David Caro: toolforge::prometheus: remove frontproxy-redis [puppet] - 10https://gerrit.wikimedia.org/r/1111221 [13:38:13] (03PS1) 10David Caro: toolforge::proxy: remove absenting statement [puppet] - 10https://gerrit.wikimedia.org/r/1111222 [13:38:52] (03PS2) 10David Caro: toolforge::prometheus: remove frontproxy-redis [puppet] - 10https://gerrit.wikimedia.org/r/1111221 [13:38:52] (03PS2) 10David Caro: toolforge::proxy: remove absenting statement [puppet] - 10https://gerrit.wikimedia.org/r/1111222 [13:39:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2216 [13:39:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2216 [13:39:36] (03Merged) 10jenkins-bot: airflow: Use the existing labels for kubernetes and spark operators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111206 (https://phabricator.wikimedia.org/T383430) (owner: 10Btullis) [13:39:55] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2217.codfw.wmnet with OS bookworm [13:40:06] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2217 [13:40:14] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:40:34] (03PS3) 10David Caro: toolforge::proxy: remove absenting statement [puppet] - 10https://gerrit.wikimedia.org/r/1111222 (https://phabricator.wikimedia.org/T314664) [13:40:57] (03PS1) 10Brouberol: airflow-research: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1111223 (https://phabricator.wikimedia.org/T380620) [13:40:59] (03PS3) 10David Caro: toolforge::prometheus: remove frontproxy-redis [puppet] - 10https://gerrit.wikimedia.org/r/1111221 (https://phabricator.wikimedia.org/T314664) [13:41:07] (03PS4) 10David Caro: toolforge::proxy: remove absenting statement [puppet] - 10https://gerrit.wikimedia.org/r/1111222 (https://phabricator.wikimedia.org/T314664) [13:41:39] (03CR) 10Brouberol: [C:03+2] airflow-research: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1111223 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [13:42:49] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2218.codfw.wmnet with OS bookworm [13:43:35] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2217 - jelto@cumin1002" [13:43:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2217 - jelto@cumin1002" [13:43:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:40] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2217.codfw.wmnet 146.48.192.10.in-addr.arpa 6.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:43:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2217.codfw.wmnet 146.48.192.10.in-addr.arpa 6.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:43:43] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2217 [13:43:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2217 [13:43:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2217 [13:44:09] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:44:15] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2218 [13:44:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:44:34] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:44:41] !log imported kubernetes 1.23.14-5 to bullseye/bookworm-wikimedia - T341984 [13:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:44] T341984: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 [13:46:16] 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673 (10RobH) 03NEW [13:46:37] 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10457774 (10RobH) [13:47:08] 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10457777 (10RobH) a:03Eevans Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving... [13:47:50] (03PS1) 10Kamila Součková: kubernetes: rename mw141[4-6,9] -> kubernetes-worker10[99-01] [puppet] - 10https://gerrit.wikimedia.org/r/1111225 (https://phabricator.wikimedia.org/T365571) [13:47:58] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2218 - jelto@cumin1002" [13:48:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2218 - jelto@cumin1002" [13:48:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:02] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2218.codfw.wmnet 147.48.192.10.in-addr.arpa 7.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:48:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2218.codfw.wmnet 147.48.192.10.in-addr.arpa 7.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:48:06] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2218 [13:48:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2218 [13:48:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2218 [13:48:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2219.codfw.wmnet with OS bookworm [13:48:33] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2219 [13:48:48] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:50:39] (03PS1) 10Daimona Eaytoy: test(2)wiki: Explicitly assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111227 (https://phabricator.wikimedia.org/T376822) [13:50:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:50:48] !log imported calico 3.29.1-1 to bookworm-wikimedia - T341984 [13:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] T341984: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 [13:51:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111227 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [13:51:19] (03CR) 10CI reject: [V:04-1] test(2)wiki: Explicitly assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111227 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [13:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:22] (03CR) 10JMeybohm: [C:03+2] Update to kubernetes v1.31.4 [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1109672 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:53:23] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2219 - jelto@cumin1002" [13:53:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2219 - jelto@cumin1002" [13:53:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:28] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2219.codfw.wmnet 148.48.192.10.in-addr.arpa 8.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:53:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2219.codfw.wmnet 148.48.192.10.in-addr.arpa 8.4.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:53:31] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2219 [13:53:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2219 [13:53:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2219 [13:54:21] (03PS2) 10Daimona Eaytoy: test(2)wiki: Explicitly assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111227 (https://phabricator.wikimedia.org/T376822) [13:54:39] (03PS1) 10Btullis: airflow: Allow the scheduler to patch existing pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111230 (https://phabricator.wikimedia.org/T380621) [13:55:26] (03CR) 10Brouberol: [C:03+1] "Spot on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111230 (https://phabricator.wikimedia.org/T380621) (owner: 10Btullis) [13:56:12] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2216.codfw.wmnet with reason: host reimage [13:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:48] (03CR) 10Btullis: [C:03+2] airflow: Allow the scheduler to patch existing pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111230 (https://phabricator.wikimedia.org/T380621) (owner: 10Btullis) [13:57:44] !log imported kubernetes 1.31.4-1 to bookworm-wikimedia - T341984 [13:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] T341984: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 [13:58:15] (03Merged) 10jenkins-bot: airflow: Allow the scheduler to patch existing pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111230 (https://phabricator.wikimedia.org/T380621) (owner: 10Btullis) [13:58:27] (03CR) 10CDanis: [C:03+1] P:conftool: allow the parsercache section flavor [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1400). Please do the needful. [14:00:05] Daimona, steve_munene, MichaelG_WMF, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] o/ [14:00:15] o/ [14:00:16] * MichaelG_WMF is here :) [14:01:05] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2217.codfw.wmnet with reason: host reimage [14:01:11] My change only adjusts how timing metrics are being tracked. There is probably nothing to test there. [14:01:16] Hi Lucas_WMDE checking whether https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1105878 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1105879 are in plan for todays window [14:01:20] o/ [14:01:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:01:39] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:01:45] (03CR) 10FNegri: [C:03+2] Add komla to wmcs-roots [puppet] - 10https://gerrit.wikimedia.org/r/1087919 (https://phabricator.wikimedia.org/T379159) (owner: 10FNegri) [14:02:20] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [14:02:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2216.codfw.wmnet with reason: host reimage [14:02:43] stevemunene: yes, they’re in the deployment calendar [14:02:45] I can deploy! [14:03:04] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q3-Q4), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10457879 (10fnegri) 05Open→03Resolved [14:03:29] Nice thanks Lucas_WMDE cc dcausse [14:03:35] o/ [14:03:43] let’s start with Daimona [14:03:52] and I think I’d like to deploy those changes separately [14:04:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109842 (https://phabricator.wikimedia.org/T383154) (owner: 10Daimona Eaytoy) [14:04:24] ^ this one looks a bit larger than I’d be comfortable with deploying together with the other one ^^ [14:04:38] (fortunately the large CampaignEvents config changes will soon be history anyway) [14:04:48] (03Merged) 10jenkins-bot: Enable CampaignEvents extension on idwiki, itwiki, mswiki, and plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109842 (https://phabricator.wikimedia.org/T383154) (owner: 10Daimona Eaytoy) [14:05:17] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1109842|Enable CampaignEvents extension on idwiki, itwiki, mswiki, and plwiki (T383154)]] [14:05:21] T383154: Release CampaignEvents extension to Indonesian, Italian, Malay, and Polish Wikipedia - https://phabricator.wikimedia.org/T383154 [14:05:24] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2218.codfw.wmnet with reason: host reimage [14:05:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2217.codfw.wmnet with reason: host reimage [14:05:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:13] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10457893 (10elukey) Some tests to see if JBOD could be forced directly from the OS without rebooting into BIOS: `... [14:08:43] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:48] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2218.codfw.wmnet with reason: host reimage [14:09:41] (03PS2) 10Ottomata: admin - remove and deprecate unused eventlogging groups [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) [14:10:58] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:11:01] (03CR) 10Ottomata: "Okay! I updated the README too to fix the docs then." [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:11:43] (03PS2) 10TChin: mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) [14:11:48] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 6.382 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:16] (03PS1) 10Hashar: scap: do not show logo when cleaning old versions [puppet] - 10https://gerrit.wikimedia.org/r/1111233 (https://phabricator.wikimedia.org/T303828) [14:12:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:06] (03PS3) 10TChin: mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) [14:15:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1109842|Enable CampaignEvents extension on idwiki, itwiki, mswiki, and plwiki (T383154)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:15:16] T383154: Release CampaignEvents extension to Indonesian, Italian, Malay, and Polish Wikipedia - https://phabricator.wikimedia.org/T383154 [14:15:32] (03CR) 10Hashar: "Ref: https://phabricator.wikimedia.org/T303828#10456908" [puppet] - 10https://gerrit.wikimedia.org/r/1111233 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [14:15:50] (03CR) 10TChin: mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [14:16:00] diff to api.php?action=query&meta=siteinfo&siprop=usergroups|restrictions&format=json&formatversion=2 on the four described wikis looks good to me FWIW [14:16:24] (using https://github.com/lucaswerkmeister/home/blob/main/.bashrc.d/wikimedia-debug-diff) [14:17:58] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2219.codfw.wmnet with OS bookworm [14:18:13] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2219.codfw.wmnet with OS bookworm [14:18:16] Daimona: can you test the change on mwdebug? [14:18:17] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2219 [14:18:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2219 [14:19:42] (03PS4) 10Jcrespo: dbbackups: Review and update grants for m1 dump user on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111182 (https://phabricator.wikimedia.org/T373579) [14:19:54] (03PS1) 10Ssingh: sre.dns.admin: update show to use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 [14:21:10] :eyes: [14:21:35] Lucas_WMDE: looks good to me, thanks! [14:21:44] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [14:21:46] ok! [14:21:55] (03PS2) 10Ssingh: sre.dns.admin: update show to use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 [14:22:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2216.codfw.wmnet with OS bookworm [14:23:00] (03CR) 10Andrew Bogott: [C:03+1] "it sure has" [puppet] - 10https://gerrit.wikimedia.org/r/1111222 (https://phabricator.wikimedia.org/T314664) (owner: 10David Caro) [14:23:47] (03CR) 10Andrew Bogott: [C:03+1] toolforge::prometheus: remove frontproxy-redis [puppet] - 10https://gerrit.wikimedia.org/r/1111221 (https://phabricator.wikimedia.org/T314664) (owner: 10David Caro) [14:24:26] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:24:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [14:25:29] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2217.codfw.wmnet with OS bookworm [14:26:27] stevemunene: I’m guessing it’s probably okay to deploy your two changes together? (once we get to them) [14:26:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [14:27:04] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov1003.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [14:28:12] Yes it is Lucas_WMDE cc dcausse [14:28:31] ok [14:28:34] (03CR) 10CI reject: [V:04-1] sre.dns.admin: update show to use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 (owner: 10Ssingh) [14:28:38] +1 [14:28:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2218.codfw.wmnet with OS bookworm [14:29:17] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109842|Enable CampaignEvents extension on idwiki, itwiki, mswiki, and plwiki (T383154)]] (duration: 23m 59s) [14:29:20] T383154: Release CampaignEvents extension to Indonesian, Italian, Malay, and Polish Wikipedia - https://phabricator.wikimedia.org/T383154 [14:29:25] (03CR) 10Ssingh: "Failure is to be expected but I will rebase and ask for review when the Spicerack change is deployed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 (owner: 10Ssingh) [14:29:51] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov1003.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [14:30:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111227 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:30:44] (03CR) 10Volans: [C:03+2] api: allow to skip the START log to SAL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655) (owner: 10Volans) [14:30:52] (03Merged) 10jenkins-bot: test(2)wiki: Explicitly assign event organizer rights to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111227 (https://phabricator.wikimedia.org/T376822) (owner: 10Daimona Eaytoy) [14:31:05] (03PS1) 10Btullis: airflow: revert the change to the kube-api networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111237 (https://phabricator.wikimedia.org/T380621) [14:31:21] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111227|test(2)wiki: Explicitly assign event organizer rights to all users (T376822)]] [14:31:24] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:31:32] (03CR) 10Jelto: "lgtm, nit: I used T377876 as the task for eqiad renames and reimages" [puppet] - 10https://gerrit.wikimedia.org/r/1111225 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [14:31:40] (03CR) 10Jelto: [C:03+1] kubernetes: rename mw141[4-6,9] -> kubernetes-worker10[99-01] [puppet] - 10https://gerrit.wikimedia.org/r/1111225 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [14:32:34] (03CR) 10Ottomata: [C:03+1] Eventstreams: Bump image, use service-utils [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111105 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [14:33:52] (03CR) 10FNegri: Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105036 (https://phabricator.wikimedia.org/T383261) (owner: 10FNegri) [14:34:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:35:19] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2219.codfw.wmnet with reason: host reimage [14:35:44] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:53] (03CR) 10Ottomata: [C:03+2] admin - remove and deprecate unused eventlogging groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:37:00] (03CR) 10Btullis: [C:03+2] airflow: revert the change to the kube-api networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111237 (https://phabricator.wikimedia.org/T380621) (owner: 10Btullis) [14:37:17] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10458049 (10fnegri) [14:37:47] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Backport for [[gerrit:1111227|test(2)wiki: Explicitly assign event organizer rights to all users (T376822)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:50] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:38:00] Lucas_WMDE: how many minutes will it take to reach my patch in the deployment list? I have to leave in ~10 minutes [14:38:06] (03PS1) 10Marostegui: production-parsercache.sql.erb: Add new sections [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) [14:38:26] (03CR) 10Marostegui: "This is a noop,no grants are changing" [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:38:44] (03CR) 10Brouberol: [C:03+1] airflow: revert the change to the kube-api networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111237 (https://phabricator.wikimedia.org/T380621) (owner: 10Btullis) [14:38:44] (03Merged) 10jenkins-bot: airflow: revert the change to the kube-api networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111237 (https://phabricator.wikimedia.org/T380621) (owner: 10Btullis) [14:39:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2219.codfw.wmnet with reason: host reimage [14:39:22] Nemoralis: we definitely won’t have time for it then, sorry :( [14:39:24] (03CR) 10Ladsgroup: production-parsercache.sql.erb: Add new sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:39:34] 10 minutes isn’t enough to finish this deployment and start yours even if we jump the rest of the queue [14:39:42] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T380937) (owner: 10Bking) [14:39:57] Daimona: AFAICT the only difference is that the rights get reordered, i.e. effectively a no-op ^^ [14:39:57] (03CR) 10Marostegui: production-parsercache.sql.erb: Add new sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:39:59] can you confirm? [14:40:08] (03PS2) 10Marostegui: production-parsercache.sql.erb: Add new sections [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) [14:40:15] Lucas_WMDE: ok, I will reschedule my patch to late backport window, thanks :D [14:40:17] (03CR) 10Marostegui: production-parsercache.sql.erb: Add new sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:41:16] I’ll start gate-and-submit for the backport already [14:41:18] jouncebot: next [14:41:18] In 1 hour(s) and 18 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1600) [14:41:25] (03Merged) 10jenkins-bot: api: allow to skip the START log to SAL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105666 (https://phabricator.wikimedia.org/T324655) (owner: 10Volans) [14:41:27] (03PS1) 10Ottomata: configcluster.yaml - remove eventlogging from profile::etcd::tlsproxy::acls [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) [14:41:28] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111196 (https://phabricator.wikimedia.org/T383208) (owner: 10Michael Große) [14:41:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [14:42:15] (03CR) 10Ottomata: "Moritz, I'm not sure if this is the right thing to do. We can abandon if we should just leave this." [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:43:06] ok, great :) [14:43:31] is it okay to deploy then? [14:46:02] Lucas_WMDE: mine (GrowthExperiments) is okay to deploy, but not sure if you were asking me or Daimona [14:46:09] I’m asking Daimona [14:46:16] Yup, okay to deploy, sorry! [14:46:17] still sitting at the “continue with sync?” prompt [14:46:19] !log lucaswerkmeister-wmde@deploy2002 daimona, lucaswerkmeister-wmde: Continuing with sync [14:46:21] ok, thanks! [14:46:48] Amir1: happy with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1111238 ? [14:46:49] (03CR) 10JMeybohm: [C:03+1] shellbox-syntaxhighlight: 1 eqiad replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087579 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [14:47:19] (03CR) 10Ladsgroup: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:47:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [14:47:27] (03CR) 10Marostegui: [C:03+2] production-parsercache.sql.erb: Add new sections [puppet] - 10https://gerrit.wikimedia.org/r/1111238 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:47:58] Thanks marostegui. If you see it somewhere, let's just change it to pcX or remove the list altogether (depending on the case) [14:48:44] jouncebot: next [14:48:45] In 1 hour(s) and 11 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1600) [14:48:49] (03CR) 10Filippo Giunchedi: [C:03+2] thanos-query: write active queries to file [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [14:48:58] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1111241 [14:50:00] godog: I’m currently deploying, and if it’s okay I’d probably like to overrun the window [14:50:07] (as there are still some changes pending) [14:50:47] Lucas_WMDE: ack thank you that's fine [14:50:53] ok :) [14:51:41] (03PS1) 10Brouberol: airflow: revert to having the scheduling using an http check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) [14:52:40] (03PS5) 10Jcrespo: dbbackups: Review and update grants for m1 dump user on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1111182 (https://phabricator.wikimedia.org/T373579) [14:52:48] (03CR) 10Btullis: [C:03+1] airflow: revert to having the scheduling using an http check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:53:35] (03PS2) 10Brouberol: airflow: revert to having the scheduling using an http check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) [14:53:47] (03CR) 10Stevemunene: "looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:53:47] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111227|test(2)wiki: Explicitly assign event organizer rights to all users (T376822)]] (duration: 22m 26s) [14:53:51] T376822: Configure the CampaignEvents extension to use the event-organizer group by default - https://phabricator.wikimedia.org/T376822 [14:54:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [14:54:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105878 (https://phabricator.wikimedia.org/T377956) (owner: 10Stevemunene) [14:54:37] (03CR) 10CI reject: [V:04-1] airflow: revert to having the scheduling using an http check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:54:50] (03Merged) 10jenkins-bot: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [14:54:53] (03Merged) 10jenkins-bot: Make WikimediaCampaignEvents use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105878 (https://phabricator.wikimedia.org/T377956) (owner: 10Stevemunene) [14:55:19] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1105879|Make WikibaseQualityConstraints use split-graph query service (T374021)]], [[gerrit:1105878|Make WikimediaCampaignEvents use split-graph query service (T377956)]] [14:55:24] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [14:55:24] T377956: Make WikimediaCampaignEvents use split-graph query service - https://phabricator.wikimedia.org/T377956 [14:57:06] (03PS3) 10Brouberol: airflow: revert to having the scheduling using an http check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) [14:58:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [14:58:33] Daimona: o/ is there a special or API we could use to test a change to the sparql endpoints used by the WikimediaCampaignEvents extension? [14:58:50] s/special/special page/ [14:59:12] Lucas_WMDE: please LMK once you are done and I'll finish the rollout of my change in eqiad [14:59:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2219.codfw.wmnet with OS bookworm [14:59:29] can do [14:59:41] I can also take a break in between if you want [14:59:42] (03CR) 10Brouberol: [C:03+2] airflow: revert to having the scheduling using an http check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111245 (https://phabricator.wikimedia.org/T380620) (owner: 10Brouberol) [14:59:45] (though idk how long the rollout takes ^^) [14:59:53] (03CR) 10Majavah: [C:03+1] "whoops" [puppet] - 10https://gerrit.wikimedia.org/r/1111221 (https://phabricator.wikimedia.org/T314664) (owner: 10David Caro) [14:59:56] (03CR) 10Majavah: [C:03+1] toolforge::proxy: remove absenting statement [puppet] - 10https://gerrit.wikimedia.org/r/1111222 (https://phabricator.wikimedia.org/T314664) (owner: 10David Caro) [15:00:18] dcausse: hi! Yep, you can test it here: https://meta.wikimedia.org/wiki/Special:AllEvents?tab=form-tabs-1 (also on other wikis with the CampaignEvents extension enabled) [15:00:30] thx! [15:01:26] !log lucaswerkmeister-wmde@deploy2002 stevemunene, lucaswerkmeister-wmde: Backport for [[gerrit:1105879|Make WikibaseQualityConstraints use split-graph query service (T374021)]], [[gerrit:1105878|Make WikimediaCampaignEvents use split-graph query service (T377956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:01:31] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [15:01:31] T377956: Make WikimediaCampaignEvents use split-graph query service - https://phabricator.wikimedia.org/T377956 [15:01:47] I can test the WikibaseQualityConstraints part [15:01:50] Lucas_WMDE: please go ahead, thank you though [15:01:55] Lucas_WMDE: thanks! [15:02:05] rollout on my end is quick but it can impact monitoring queries [15:03:05] (03Merged) 10jenkins-bot: fix(tracking): TimingMetric:observe records milliseconds [extensions/GrowthExperiments] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111196 (https://phabricator.wikimedia.org/T383208) (owner: 10Michael Große) [15:03:18] BTW, apologies but I can't test the CampaignEvents stuff because I'm overwhelmed with meetings today :) [15:03:35] (in the context of split-graph migration) [15:03:57] But do ping me if anything looks wrong and I'll reserve some time to take a look [15:04:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:04:16] hm, https://www.wikidata.org/wiki/Special:ConstraintReport/Q4115189 stops showing the distinct-values constraint violation when I turn on WikimediaDebug :/ [15:04:28] dcausse: do you know if the split query service is lagging behind more, perhaps? [15:04:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:04:48] * Lucas_WMDE tries to query the split services manually [15:04:59] Lucas_WMDE: no it should not... [15:05:14] you’re right, https://w.wiki/CigH finds it just fine [15:05:21] hm [15:05:28] * Lucas_WMDE peeks at logstash [15:05:30] constraints are checked via a job? [15:05:41] so perhaps not easily testable? [15:05:46] no, they’re checked live [15:05:50] ok [15:05:53] and the special page also bypasses the cache that’s normally there [15:06:07] (they used to be checked via jobs too but that’s currently disabled I believe) [15:06:28] ack [15:06:36] no messages from server:www.wikidata.org in logstash mwdebug at all o_O [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:12] so for the campaigns I can navigate just fine through Special:AllEvents on meta hitting a debug [15:07:13] https://www.wikidata.org/wiki/Special:ConstraintReport/Q35017419 doesn’t show the duplicate on the sandbox either [15:07:17] so it’s broken in both directions, it seems [15:07:23] :/ [15:07:48] I think the WBQC part needs a revert (and then further investigation) [15:07:52] wondering what the better way to do this is [15:08:00] roll out both changes now and then revert the WBQC part [15:08:07] ack [15:08:17] or abort the current deployment and then deploy the WBQC revert (which would include syncing the second change) [15:08:26] that might be faster, actually. only one sync-world instead of two [15:09:04] (I was first thinking, it’s fine to still deploy the WBQC change and only revert it afterwards because the breakage isn’t critical, but it would be slower that way anyways ^^) [15:09:42] Lucas_WMDE: I think the duplication cannot work cross graph [15:10:06] wait perhaps it can [15:10:20] * dcausse needs to look at the codebase again [15:10:48] * Lucas_WMDE looks at the code [15:10:58] hm [15:11:02] I think you might be right [15:11:25] we’re not just looking for entities with value X, we’re looking for entities with the same value as the base entity [15:11:28] so they need to be in the same graph [15:11:53] >.< [15:12:01] we need to properly serialize the value we’re looking for into the query [15:12:23] (and use Wikibase’s RdfBuilder stuff for that, instead of the ridiculous home-grown getRdfLiteral() method that I slapped together in what I think was my first month or two at WMDE lol) [15:12:29] !log lucaswerkmeister-wmde@deploy2002 Sync cancelled. [15:12:31] so it relies on the query service to be updated to work properly [15:12:54] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Make WikibaseQualityConstraints use split-graph query service" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111253 (https://phabricator.wikimedia.org/T374021) [15:13:10] (03CR) 10Hnowlan: [C:03+1] shellbox-syntaxhighlight: all eqiad replicas on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087580 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [15:13:16] (03PS4) 10Herron: thanos-rule: manage retention setting [puppet] - 10https://gerrit.wikimedia.org/r/1111241 (https://phabricator.wikimedia.org/T352756) [15:13:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111253 (https://phabricator.wikimedia.org/T374021) (owner: 10Lucas Werkmeister (WMDE)) [15:13:59] (03Merged) 10jenkins-bot: Revert "Make WikibaseQualityConstraints use split-graph query service" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111253 (https://phabricator.wikimedia.org/T374021) (owner: 10Lucas Werkmeister (WMDE)) [15:14:58] ah, and the backport merged in the meantime [15:15:03] so MichaelG_WMF this deployment will include that :) [15:15:15] YaY 😊 [15:15:38] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111253|Revert "Make WikibaseQualityConstraints use split-graph query service" (T374021)]], [[gerrit:1105878|Make WikimediaCampaignEvents use split-graph query service (T377956)]] [15:15:43] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [15:15:43] T377956: Make WikimediaCampaignEvents use split-graph query service - https://phabricator.wikimedia.org/T377956 [15:17:54] (03PS1) 10Filippo Giunchedi: site: add prometheus200[78] [puppet] - 10https://gerrit.wikimedia.org/r/1111256 (https://phabricator.wikimedia.org/T383232) [15:18:20] dcausse: I left a comment in T374021 [15:18:38] (not sure if that should actually be in that task or a separate new task, we’ll see) [15:19:17] Lucas_WMDE: thanks! sure, I'll followup in phan [15:19:23] phab* [15:19:48] phan 🤝 phab: often need followups [15:21:16] (03CR) 10Dzahn: [C:03+2] scap: do not show logo when cleaning old versions [puppet] - 10https://gerrit.wikimedia.org/r/1111233 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [15:21:31] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4795/co" [puppet] - 10https://gerrit.wikimedia.org/r/1111256 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [15:21:44] !log lucaswerkmeister-wmde@deploy2002 stevemunene, lucaswerkmeister-wmde: Backport for [[gerrit:1111253|Revert "Make WikibaseQualityConstraints use split-graph query service" (T374021)]], [[gerrit:1105878|Make WikimediaCampaignEvents use split-graph query service (T377956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:21:49] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [15:21:49] T377956: Make WikimediaCampaignEvents use split-graph query service - https://phabricator.wikimedia.org/T377956 [15:23:04] now https://www.wikidata.org/wiki/Special:ConstraintReport/Q4115189 *only* shows the constraint violation on mwdebug o_O [15:23:16] but I guess that’s better than the other way around ^^ [15:23:20] MichaelG_WMF: can you test your change? [15:23:37] (I probably should’ve aborted the scap backport and instead added your change URL to the arguments so it would be included in all the messages, meh) [15:23:47] (though at that point the SAL would probably start to get truncated by IRC anyway) [15:24:05] Lucas_WMDE: no, not really. It is just a change to how we record metrics with the new system [15:24:11] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2128.codfw.wmnet - https://phabricator.wikimedia.org/T383572#10458286 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:24:12] ok now https://www.wikidata.org/wiki/Special:ConstraintReport/Q4115189 is working both with and without WikimediaDebug which is expected [15:24:14] oh right [15:24:17] !log lucaswerkmeister-wmde@deploy2002 stevemunene, lucaswerkmeister-wmde: Continuing with sync [15:24:19] syncing then [15:25:28] 10ops-codfw, 06SRE, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10458299 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [15:25:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10458301 (10Jhancock.wm) a:03Jhancock.wm [15:25:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Degraded RAID due to failed sdy on ms-be2075 - https://phabricator.wikimedia.org/T383530#10458302 (10Jhancock.wm) a:03Jhancock.wm [15:26:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2133,2160,2233].codfw.wmnet with reason: cloning [15:26:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2133,2160,2233].codfw.wmnet with reason: cloning [15:27:38] !log Stop in sync db2133 db2233 m2 codfw dbmaint T373579 [15:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:43] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [15:28:09] (03CR) 10Hnowlan: [C:03+1] mediawiki: enable mesh telemetry in mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818 (owner: 10Scott French) [15:29:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10458314 (10Jhancock.wm) [15:31:41] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111253|Revert "Make WikibaseQualityConstraints use split-graph query service" (T374021)]], [[gerrit:1105878|Make WikimediaCampaignEvents use split-graph query service (T377956)]] (duration: 16m 03s) [15:31:52] T374021: Make WikibaseQualityConstraints use split-graph query service - https://phabricator.wikimedia.org/T374021 [15:31:53] T377956: Make WikimediaCampaignEvents use split-graph query service - https://phabricator.wikimedia.org/T377956 [15:32:03] (03CR) 10Muehlenhoff: [C:03+2] Fix Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1111213 (owner: 10Muehlenhoff) [15:33:25] (03PS1) 10Marostegui: db2233.yaml: Make it master [puppet] - 10https://gerrit.wikimedia.org/r/1111258 (https://phabricator.wikimedia.org/T373579) [15:33:41] !log previous deployment also included [[gerrit:1111196|fix(tracking): TimingMetric:observe records milliseconds]] (T383208) [15:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:44] T383208: StatsLib timings MUST be recorded as milliseconds - https://phabricator.wikimedia.org/T383208 [15:33:49] (03CR) 10Marostegui: [C:03+2] db2233.yaml: Make it master [puppet] - 10https://gerrit.wikimedia.org/r/1111258 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [15:34:01] okay, I think that’s the deployment window done [15:34:09] !log UTC afternoon backport+config window done [15:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:26] godog: fyi ^ [15:34:51] hm, logspam watch has some “PHP Warning: Stats: Cannot add labels to a metric containing samples for 'update_mentee_data_seconds'” [15:34:56] MichaelG_WMF: could that be related to your change? [15:35:00] (but it looks like it stopped again) [15:35:04] * Lucas_WMDE looks at logstash [15:35:21] Mh, is that _new_? [15:35:35] maybe not [15:35:36] This is a know issue with our existing code, [15:35:41] last occurrence 15:23 [15:35:58] looks like it started Jan 7 [15:36:06] (03PS1) 10Brouberol: airflow: ensure the pooler URI uses a terninated FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111259 (https://phabricator.wikimedia.org/T383651) [15:36:07] (03PS1) 10Phuedx: Beta Cluster: Update MetricsPlatform extension config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111260 (https://phabricator.wikimedia.org/T381964) [15:36:09] (03PS1) 10Phuedx: Enable MetricsPlatform extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) [15:36:11] (03PS1) 10Phuedx: testwiki: Enable MetricsPlatform stream config fetching and merging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111262 (https://phabricator.wikimedia.org/T381964) [15:36:11] and really ramped up Jan 10 [15:36:31] so probably unrelated, and just showed up at the top of logspam-watch coincidentally [15:36:40] yes, this is something recently introduced and that part is also touched by my change, but my change should not affect that in particular [15:36:53] (we're working on a fix) [15:36:56] ok [15:37:22] and it’s all on mwmaint2002, so I guess the spike in the logs is just from whenever the maintenance script runs [15:37:30] (every 3 hours, judging by “last 24 hours” in logstash) [15:37:36] (03PS1) 10Marostegui: db2133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111263 (https://phabricator.wikimedia.org/T373579) [15:38:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10458388 (10JMeybohm) Hi @Jhancock.wm - I could do next Monday (20th January) 15:30Z, would that work for you? [15:38:27] (03CR) 10Marostegui: [C:03+2] db2133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111263 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [15:40:04] (03CR) 10Andrea Denisse: [C:03+2] profile::mediawiki::common: Remove obsolete DSH group check [puppet] - 10https://gerrit.wikimedia.org/r/1110872 (https://phabricator.wikimedia.org/T370527) (owner: 10Andrea Denisse) [15:40:29] (03CR) 10Btullis: [C:03+1] airflow: ensure the pooler URI uses a terninated FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111259 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol) [15:40:57] Lucas_WMDE: ack thx [15:42:07] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10458416 (10elukey) I fear that this SAS controller doesn't support JBOD unless it is configured via BIOS, so real... [15:42:33] (03CR) 10Brouberol: [C:03+2] airflow: ensure the pooler URI uses a terninated FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111259 (https://phabricator.wikimedia.org/T383651) (owner: 10Brouberol) [15:43:19] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1111264 (https://phabricator.wikimedia.org/T383689) [15:43:30] (03PS1) 10CDanis: urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) [15:43:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:43:45] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1111266 (https://phabricator.wikimedia.org/T383690) [15:43:50] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1111267 (https://phabricator.wikimedia.org/T383690) [15:44:48] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383595#10458451 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:44:50] !log homer 'lsw1-d3-codfw*' commit 'T377877' [15:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:54] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:45:33] !log homer 'cr*codfw*' commit 'T377877' [15:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:46:29] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 112, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:46:51] (03PS2) 10Kamila Součková: kubernetes: rename mw141[4-6,9] -> kubernetes-worker10[99-01] [puppet] - 10https://gerrit.wikimedia.org/r/1111225 (https://phabricator.wikimedia.org/T377876) [15:46:58] (03CR) 10David Caro: [C:03+2] toolforge::proxy: remove absenting statement [puppet] - 10https://gerrit.wikimedia.org/r/1111222 (https://phabricator.wikimedia.org/T314664) (owner: 10David Caro) [15:47:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:47:01] (03CR) 10David Caro: [C:03+2] toolforge::prometheus: remove frontproxy-redis [puppet] - 10https://gerrit.wikimedia.org/r/1111221 (https://phabricator.wikimedia.org/T314664) (owner: 10David Caro) [15:47:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2216-2219].codfw.wmnet [15:47:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2216-2219].codfw.wmnet [15:48:06] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383691 (10Jelto) 03NEW [15:48:42] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1414-1416,1419].eqiad.wmnet [15:49:55] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw141[4-6,9] -> kubernetes-worker10[99-01] [puppet] - 10https://gerrit.wikimedia.org/r/1111225 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [15:50:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1414-1416,1419].eqiad.wmnet [15:52:13] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1414 to wikikube-worker1098 [15:52:33] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:53:20] !log import prometheus-mysqld-exporter 0.13.0-1~bpo11+1 to the main component of bullseye-wikimedia (import from bullseye-backports which is going away) T383557 [15:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:23] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [15:53:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitorin [15:53:59] status [15:53:59] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitorin [15:53:59] status [15:54:53] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1415 to wikikube-worker1099 [15:54:57] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1416 to wikikube-worker1100 [15:55:05] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1419 to wikikube-worker1101 [15:55:24] (03PS1) 10Muehlenhoff: No longer import prometheus-mysqld-exporter from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1111269 (https://phabricator.wikimedia.org/T383557) [15:56:17] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1414 to wikikube-worker1098 - kamila@cumin1002" [15:56:25] !log kamila@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1414 to wikikube-worker1098 - kamila@cumin1002" [15:56:25] !log kamila@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:56:29] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:56:30] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1414 to wikikube-worker1098 [15:57:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111269 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [15:58:35] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1414 to wikikube-worker1098 [15:59:26] (03PS1) 10Jelto: Rename mw23[69-72] to wikikube-worker222[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1111271 (https://phabricator.wikimedia.org/T377877) [15:59:42] 07Puppet, 10MW-on-K8s, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Clean up "git repo needs merge" checks - https://phabricator.wikimedia.org/T370530#10458592 (10lmata) [15:59:44] 10SRE-swift-storage, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Remove load_average check for ms-be/thanos-be - https://phabricator.wikimedia.org/T370526#10458593 (10lmata) [16:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1600). [16:00:13] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q3): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#10458603 (10lmata) [16:00:51] 06SRE, 10observability, 10Observability-Logging, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10458606 (10lmata) [16:00:53] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1415 to wikikube-worker1099 - kamila@cumin1002" [16:01:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1415 to wikikube-worker1099 - kamila@cumin1002" [16:01:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:19] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1099 [16:01:45] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:02:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1099 [16:03:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1415 to wikikube-worker1099 [16:03:29] (03CR) 10Giuseppe Lavagetto: ClusterConfig: add support for dumps trait (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [16:04:34] (03PS4) 10Giuseppe Lavagetto: ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) [16:04:34] (03PS4) 10Giuseppe Lavagetto: Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) [16:05:30] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1414 to wikikube-worker1098 - kamila@cumin1002" [16:05:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1414 to wikikube-worker1098 - kamila@cumin1002" [16:05:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:35] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1098 [16:05:39] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1111267 (https://phabricator.wikimedia.org/T383690) (owner: 10Gerrit maintenance bot) [16:06:37] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:07:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1098 [16:07:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1414 to wikikube-worker1098 [16:09:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:06] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1100 [16:09:41] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:10:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1100 [16:11:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1416 to wikikube-worker1100 [16:12:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:04] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1101 [16:13:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1101 [16:13:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1419 to wikikube-worker1101 [16:14:03] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1098.eqiad.wmnet wikikube-worker1099.eqiad.wmnet wikikube-worker1100.eqiad.wmnet wikikube-worker1101.eqiad.wmnet on all recursors [16:14:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1098.eqiad.wmnet wikikube-worker1099.eqiad.wmnet wikikube-worker1100.eqiad.wmnet wikikube-worker1101.eqiad.wmnet on all recursors [16:15:53] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1099.eqiad.wmnet with OS bookworm [16:15:57] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1099 [16:15:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1099 [16:16:04] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1100.eqiad.wmnet with OS bookworm [16:16:08] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1100 [16:16:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1100 [16:16:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1101.eqiad.wmnet with OS bookworm [16:16:20] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1101 [16:16:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1101 [16:16:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1098.eqiad.wmnet with OS bookworm [16:16:30] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1098 [16:16:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1098 [16:17:17] (03CR) 10Giuseppe Lavagetto: "The change LGTM. I'd like to see also the addition of some httpbb tests, though." [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187) (owner: 10Gergő Tisza) [16:20:32] (03CR) 10Gergő Tisza: "The tests are in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1099339. Would you prefer them squashed in one commit?" [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187) (owner: 10Gergő Tisza) [16:21:08] 06SRE: contint1002 - puppet failure - https://phabricator.wikimedia.org/T383699 (10Dzahn) 03NEW [16:21:29] 06SRE: contint1002 - puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458781 (10Dzahn) [16:24:01] 06SRE: contint1002 - puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458790 (10Dzahn) maybe caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1108772 ? [16:25:55] (03CR) 10Dzahn: "does it seem possible this caused puppet breakage like "Error while evaluating a Function Call, value returned from k8s::fetch_clusters ha" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [16:27:26] 06SRE: contint1002 - puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458807 (10Dzahn) [16:29:37] 06SRE, 06collaboration-services, 10observability: contint1002 - puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458860 (10Dzahn) [16:31:59] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1099.eqiad.wmnet with reason: host reimage [16:32:09] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1098.eqiad.wmnet with reason: host reimage [16:32:39] (03CR) 10CDanis: [C:03+1] Added new stream config for haproxy_requestctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [16:33:43] 06SRE, 06collaboration-services, 10observability: contint1002 - puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458884 (10Dzahn) The line in question is: ` $kubernetes_clusters = k8s::fetch_clusters() ` which is a function describ... [16:35:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1099.eqiad.wmnet with reason: host reimage [16:38:46] 06SRE, 06collaboration-services, 10observability: contint*- puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458922 (10Dzahn) [16:39:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1098.eqiad.wmnet with reason: host reimage [16:40:04] 06SRE, 06collaboration-services, 10observability: contint*- puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10458941 (10Dzahn) https://puppetboard.wikimedia.org/nodes?status=failed [16:40:24] (03CR) 10Dzahn: "https://puppetboard.wikimedia.org/nodes?status=failed" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [16:47:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10458970 (10Jhancock.wm) We are off on the 20th in the US. but the rest of the week is good for me. [16:48:38] (03CR) 10Kamila Součková: [C:03+1] Rename mw23[69-72] to wikikube-worker222[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1111271 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [16:49:15] (03CR) 10Thcipriani: "Is there more context for this? The only relevant task I could find was from a few years ago (https://phabricator.wikimedia.org/T283607)." [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [16:50:45] (03PS4) 10Scott French: shellbox-syntaxhighlight: 1 eqiad replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087579 (https://phabricator.wikimedia.org/T377038) [16:50:45] (03PS4) 10Scott French: shellbox-syntaxhighlight: all eqiad replicas on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087580 (https://phabricator.wikimedia.org/T377038) [16:50:45] (03PS4) 10Scott French: shellbox-syntaxhighlight: 1 codfw replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087581 (https://phabricator.wikimedia.org/T377038) [16:50:45] (03PS4) 10Scott French: shellbox-syntaxhighlight: all replicas on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087582 (https://phabricator.wikimedia.org/T377038) [16:53:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1099.eqiad.wmnet with OS bookworm [17:58:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10459407 (10MatthewVernon) ` Jan 13 01:04:10 ms-be2075 kernel: [462667.760590] megaraid_sas 0000:18:00.0: 18530 (790045449s/0x0020/DEAD) - Fatal firmware error: Line 977 in ../..... [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1800) [18:00:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1101.eqiad.wmnet with reason: host reimage [18:04:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1100.eqiad.wmnet with OS bookworm [18:04:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1101.eqiad.wmnet with reason: host reimage [18:04:29] (03PS1) 10Majavah: hieradata: Bump striker-tools to 2025-01-13-165415-production [puppet] - 10https://gerrit.wikimedia.org/r/1111292 [18:04:50] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1109188/4797/" [puppet] - 10https://gerrit.wikimedia.org/r/1109188 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [18:06:29] (03CR) 10Majavah: [C:03+2] hieradata: Bump striker-tools to 2025-01-13-165415-production [puppet] - 10https://gerrit.wikimedia.org/r/1111292 (owner: 10Majavah) [18:06:49] !log kamila@deploy2002 Finished scap sync-world: enable auth.wikimedia.org (duration: 17m 55s) [18:07:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:29] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:08] (03CR) 10Andrea Denisse: [C:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1111285/4799/" [puppet] - 10https://gerrit.wikimedia.org/r/1111285 (https://phabricator.wikimedia.org/T383699) (owner: 10Filippo Giunchedi) [18:12:11] (03CR) 10Andrea Denisse: [C:03+2] ci: remove 'prometheus' section from kubernetes::clusters [puppet] - 10https://gerrit.wikimedia.org/r/1111285 (https://phabricator.wikimedia.org/T383699) (owner: 10Filippo Giunchedi) [18:12:39] ^ that me, I broke httpbb on metal [18:13:35] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:20] (03PS1) 10Dduvall: ci: Install memcached for MediaWiki success cache [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) [18:18:29] 06SRE, 06collaboration-services, 10observability, 13Patch-For-Review: contint*- puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10459468 (10andrea.denisse) 05Open→03Resolved a:03andrea.denisse I merged and applied patch #1111285... [18:18:31] 06SRE, 06Traffic, 13Patch-For-Review: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10459472 (10nshahquinn-wmf) Viewing and editing this task is not actually restricted. [18:20:48] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10459488 (10dcaro) Just finished restarting all the osd daemons, all the traffic should now being tagged correctly 👍 [18:22:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1101.eqiad.wmnet with OS bookworm [18:26:23] !log installing rsync security updates on bookworm [18:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:37] 06SRE, 06collaboration-services, 10observability, 13Patch-For-Review: contint*- puppet failure - value returned from k8s::fetch_clusters has wrong type - https://phabricator.wikimedia.org/T383699#10459497 (10Dzahn) Thanks for the quick response and fix. I confirm puppet works again :) [18:26:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099338 (https://phabricator.wikimedia.org/T380574) (owner: 10Gergő Tisza) [18:30:56] (03PS2) 10Dduvall: ci: Install memcached for MediaWiki success cache [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) [18:31:23] jouncebot: nowandnext [18:31:23] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1800) [18:31:23] In 0 hour(s) and 28 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1900) [18:34:12] (03CR) 10Scott French: [C:03+2] mediawiki: enable mesh telemetry in mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818 (owner: 10Scott French) [18:34:45] in the remainder of the infra window, I'm going to deploy some metrics collection fixes for mw-videoscaler [18:35:43] (03CR) 10Dduvall: "Tested via a cherry-pick on `integration-puppetserver-01` and puppet run on `integration-castor05`." [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) (owner: 10Dduvall) [18:35:49] (03CR) 10Dduvall: [C:03+1] ci: Install memcached for MediaWiki success cache [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) (owner: 10Dduvall) [18:36:39] (03Merged) 10jenkins-bot: mediawiki: enable mesh telemetry in mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818 (owner: 10Scott French) [18:41:51] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [18:41:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [18:45:04] (03PS1) 10DCausse: search: add alerts for weighted_tags indexing throughput [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) [18:46:43] (03CR) 10CI reject: [V:04-1] search: add alerts for weighted_tags indexing throughput [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) (owner: 10DCausse) [18:49:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [18:49:51] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [18:49:53] (03PS2) 10DCausse: search: add alerts for weighted_tags indexing throughput [alerts] - 10https://gerrit.wikimedia.org/r/1111300 (https://phabricator.wikimedia.org/T373459) [18:53:43] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1111303 [18:55:41] !log swfrench@deploy2002 Started scap sync-world: k8s-only deploy to clear noop chart version diffs [18:57:56] !log swfrench@deploy2002 Finished scap sync-world: k8s-only deploy to clear noop chart version diffs (duration: 02m 15s) [19:00:05] thcipriani and thcipriani: Time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T1900). [19:00:14] oh noes [19:00:16] o/ [19:00:36] looks like I didn't beat the automated deployment calendar run [19:00:40] heh [19:00:42] I'll fix that after this meeting [19:00:52] i'm taking a quick walk over to the post office pre-train, will go ahead here in ~5. [19:01:00] <3 [19:05:29] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:40] (03PS1) 10Scott French: shellbox-syntaxhighlight: revert eqiad to PHP 7.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111309 [19:06:43] ^ FYI, that's a "just in case" patch. no issues encountered so far :) [19:07:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:12:38] !log 1.44.0-wmf.12 train (T382363): no current blockers, rolling to group0 [19:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:42] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [19:14:38] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111311 (https://phabricator.wikimedia.org/T382363) [19:14:40] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111311 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [19:15:29] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111311 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [19:17:14] 07Puppet, 06SRE, 06Data-Engineering-Radar: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#10459746 (10Ottomata) [19:17:43] 07Puppet, 06SRE, 06Data-Engineering-Radar: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#10459748 (10Ottomata) Data-Engineering no longer operates udp2log. SRE should feel free to decline this task at will. [19:26:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:28:30] RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:06] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.12 refs T382363 [19:29:10] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [19:29:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:30:29] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:34:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:38:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:42:03] !log installing rsync security updates on bullseye [19:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:53:20] (03CR) 10JHathaway: [C:03+2] postfix: increase message size limit from 10MiB to 50MiB [puppet] - 10https://gerrit.wikimedia.org/r/1110873 (https://phabricator.wikimedia.org/T383271) (owner: 10JHathaway) [19:57:11] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, 05SUL3: Set up auth.wikimedia.org - https://phabricator.wikimedia.org/T377187#10460143 (10Tgr) [19:59:37] (03PS1) 10Clare Ming: Experiment Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111318 (https://phabricator.wikimedia.org/T374957) [20:01:22] !log cdanis@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin2002" [20:01:24] !log cdanis@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin2002 [20:01:55] !log cdanis@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: [not really into teleological thinking] - cdanis@cumin2002 [20:01:57] !log cdanis@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "[not really into teleological thinking] - cdanis@cumin2002" [20:03:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [20:03:47] (03PS1) 10Clare Ming: Experiment Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111321 (https://phabricator.wikimedia.org/T374957) [20:04:29] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10460171 (10jhathaway) a:03jhathaway [20:04:37] (03PS2) 10Clare Ming: Experiment Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111318 (https://phabricator.wikimedia.org/T374957) [20:08:55] (03CR) 10Santiago Faci: [C:03+2] Experiment Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111318 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:08:56] 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10460184 (10jhathaway) 05Open→03Resolved @DSeyfert_WMF this appears to be a regression in our mail servers when migrating from Exim to Postfix. Exim had a defa... [20:08:56] (03CR) 10Santiago Faci: [C:03+2] Experiment Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111321 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:09:52] (03Merged) 10jenkins-bot: Experiment Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111318 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:10:03] (03Merged) 10jenkins-bot: Experiment Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111321 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:11:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:11:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:11:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:12:46] (03PS2) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:13:00] (03CR) 10Scott French: "Thanks for the reviews, all!" [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [20:13:06] (03CR) 10Scott French: [C:03+2] P:conftool: allow the parsercache section flavor [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [20:13:57] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:14:44] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:15:06] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:15:32] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [20:15:58] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [20:18:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) (owner: 10NMW03) [20:20:51] (03PS3) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:22:02] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:22:21] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:22:37] (03CR) 10Herron: [V:03+1] "Thanks for having a look! Yes, in fact I looked into this route initially but min-time/max-time supports different formats from tsdb.rete" [puppet] - 10https://gerrit.wikimedia.org/r/1111241 (https://phabricator.wikimedia.org/T352756) (owner: 10Herron) [20:24:29] (03CR) 10Bking: [C:03+2] cloudelastic: remove cloudelastic100[56] from conftool, add 101[12] [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T380937) (owner: 10Bking) [20:25:29] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:26:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:28:21] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:28:28] (03PS4) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:29:39] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10460247 (10phaultfinder) [20:30:49] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:30:49] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:31:43] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, 05SUL3: Set up auth.wikimedia.org - https://phabricator.wikimedia.org/T377187#10460255 (10Tgr) 05Open→03Resolved Working as expected: * https://auth.wikimedia.org/enwiki/wiki/Special:UserLogin, https://auth.wikimedia.org/dewiki/wiki/Special:Use... [20:32:11] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cloudelastic1011.eqiad.wmnet [20:32:16] (03PS5) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:32:47] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cloudelastic1012.eqiad.wmnet [20:33:28] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:35:34] (03PS1) 10Subramanya Sastry: Turn on Parsoid Read Views on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) [20:35:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10460273 (10phaultfinder) [20:38:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:40:07] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:40:25] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:40:25] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:40:31] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:41:47] (03PS1) 10Bking: cloudelastic: remove references to cloudelastic hosts before 1007 [puppet] - 10https://gerrit.wikimedia.org/r/1111326 (https://phabricator.wikimedia.org/T380937) [20:42:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111326 (https://phabricator.wikimedia.org/T380937) (owner: 10Bking) [20:42:33] (03PS2) 10Ryan Kemper: cloudelastic: decom cloudelastic100[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/1111326 (https://phabricator.wikimedia.org/T380937) (owner: 10Bking) [20:42:38] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: decom cloudelastic100[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/1111326 (https://phabricator.wikimedia.org/T380937) (owner: 10Bking) [20:43:52] (03PS6) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:44:35] (03CR) 10Urbanecm: "Should now be in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [20:44:39] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [20:45:04] (03CR) 10Bking: [C:03+2] cloudelastic: decom cloudelastic100[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/1111326 (https://phabricator.wikimedia.org/T380937) (owner: 10Bking) [20:45:04] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:46:51] (03PS7) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:48:02] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:48:08] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic[1005-1006].eqiad.wmnet [20:48:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:53:13] (03PS8) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:54:26] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:55:20] (03PS9) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:55:44] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid Read Views on test2wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [20:56:33] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:56:37] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T2100). [21:00:05] fabfur, tgr, tacsipacsi, and Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:45] o/ [21:00:48] i can deploy [21:01:09] o/ I'll add one more config patch in a sec [21:01:18] np! [21:01:52] fabfur: are you around? [21:02:39] tgr: are you a self-deployer? happy to do them for you - up to you [21:03:01] My patch is not urgent, so if you run out of time, it’s okay to delay it. After all, it’s been broken for years without anyone complaining. :) [21:03:08] lol [21:03:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic[1005-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [21:04:10] Well, broken in that people were sent to a soft redirect. But that’s still annoying. [21:04:11] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic[1005-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [21:04:11] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:04:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic[1005-1006].eqiad.wmnet [21:04:23] tgr: shall i start with your first patch? [21:04:48] (03PS3) 10Gergő Tisza: SUL3: Add auth domain to URL tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099338 (https://phabricator.wikimedia.org/T380574) [21:05:39] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1098-1101].eqiad.wmnet [21:05:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10460385 (10phaultfinder) [21:05:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1098-1101].eqiad.wmnet [21:07:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10460390 (10kamila) [21:08:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099338 (https://phabricator.wikimedia.org/T380574) (owner: 10Gergő Tisza) [21:09:18] cjming: thanks! I'll self-deploy, want to do some extended testing. Can wait until all the other patches are deployed. [21:09:41] doh - i just started your first patch - sorry [21:09:46] though that one patch doesn't matter much [21:09:56] (03Merged) 10jenkins-bot: SUL3: Add auth domain to URL tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099338 (https://phabricator.wikimedia.org/T380574) (owner: 10Gergő Tisza) [21:10:29] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1099338|SUL3: Add auth domain to URL tests (T380574)]] [21:10:33] T380574: Add SUL3 authentication domain to deploy canary checks - https://phabricator.wikimedia.org/T380574 [21:10:37] in all honesty I'm not sure what it does :) we don't seem to use that file anymore, there is a puppet file that's actually used for URL tests but I figured better to keep this one in sync [21:11:06] tgr: ok - i'll let you handle whatever other patches you add to the queue [21:11:16] thanks! [21:12:00] (03PS1) 10Gergő Tisza: Enable SUL3 on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111330 (https://phabricator.wikimedia.org/T383729) [21:16:33] tgr: for your 1st patch tho -- can it be tested? up on mwdebug [21:17:20] !log cjming@deploy2002 cjming, tgr: Backport for [[gerrit:1099338|SUL3: Add auth domain to URL tests (T380574)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:23] T380574: Add SUL3 authentication domain to deploy canary checks - https://phabricator.wikimedia.org/T380574 [21:18:31] cjming: no [21:18:47] i'll just sync then [21:18:55] if it's still in use, scap will run the test automatically, I imagine [21:19:01] !log cjming@deploy2002 cjming, tgr: Continuing with sync [21:19:58] (03PS2) 10Jforrester: Turn on Parsoid Read Views on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [21:20:01] (03CR) 10Jforrester: [C:03+1] Turn on Parsoid Read Views on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [21:20:26] in theory these are URLs pinged during the scap canary check but I think they have been replaced by modules/profile/files/httpbb/appserver/ in puppet [21:20:47] cool - gtk [21:26:54] tacsipacsi: i'll do yours next [21:26:59] (03PS10) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:27:04] Thanks! [21:27:07] (03PS2) 10Tacsipacsi: Fix links pointing to m:Help:Export [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 [21:28:10] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:28:29] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099338|SUL3: Add auth domain to URL tests (T380574)]] (duration: 18m 00s) [21:28:33] T380574: Add SUL3 authentication domain to deploy canary checks - https://phabricator.wikimedia.org/T380574 [21:29:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [21:30:19] (03Merged) 10jenkins-bot: Fix links pointing to m:Help:Export [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106739 (owner: 10Tacsipacsi) [21:30:45] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1106739|Fix links pointing to m:Help:Export]] [21:35:24] tacsipacsi: up on test servers if you want to verify [21:35:41] !log cjming@deploy2002 tacsipacsi, cjming: Backport for [[gerrit:1106739|Fix links pointing to m:Help:Export]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:36:22] Thanks! Checked the URLs mentioned on Gerrit, and they look good. [21:36:28] cool ! syncing [21:36:31] !log cjming@deploy2002 tacsipacsi, cjming: Continuing with sync [21:37:52] Nemoralis: are you around? [21:38:28] fabfur: are you around? [21:39:48] o/ [21:39:53] I had a patch in this window [21:39:58] jouncebot: now [21:39:58] For the next 0 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T2100) [21:40:07] Nemoralis: i'll do your patch next [21:40:36] (03PS4) 10NMW03: Add azwiki to mobile-anon-talk dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) [21:42:14] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10460491 (10wiki_willy) Hi @dcaro - because this was taking so long, I escalated this up to our account team again last week...and they came back tod... [21:43:46] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106739|Fix links pointing to m:Help:Export]] (duration: 13m 00s) [21:44:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) (owner: 10NMW03) [21:44:58] (03Merged) 10jenkins-bot: Add azwiki to mobile-anon-talk dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109694 (https://phabricator.wikimedia.org/T383394) (owner: 10NMW03) [21:45:29] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1109694|Add azwiki to mobile-anon-talk dblist (T383394)]] [21:45:32] T383394: Enable talk for mobile anon users on azwiki - https://phabricator.wikimedia.org/T383394 [21:48:28] !log deployed conftool 4.2.0 fleet-wide as of ~ 20:00 UTC (previously 4.1.0) [21:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:14] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10460515 (10cmooney) >>! In T371501#10459488, @dcaro wrote: > Just finished restarting all the osd daemons, all the traffic should now being t... [21:51:56] Nemoralis: on test servers if you can verify - lmk if/when to sync [21:52:33] !log cjming@deploy2002 nmw03, cjming: Backport for [[gerrit:1109694|Add azwiki to mobile-anon-talk dblist (T383394)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:37] T383394: Enable talk for mobile anon users on azwiki - https://phabricator.wikimedia.org/T383394 [21:53:06] sure one sec [21:54:54] cjming: LGTM [21:55:01] great - syncing [21:55:04] !log cjming@deploy2002 nmw03, cjming: Continuing with sync [21:55:45] fabfur: last call [21:55:57] (03PS3) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) [21:59:47] (03PS11) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250114T2200) [22:00:59] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:02:03] (03PS1) 10JHathaway: kafka_shipper: when disabled, don't render templates [puppet] - 10https://gerrit.wikimedia.org/r/1111336 [22:02:17] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway) [22:02:24] (03CR) 10CI reject: [V:04-1] kafka_shipper: when disabled, don't render templates [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway) [22:02:32] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109694|Add azwiki to mobile-anon-talk dblist (T383394)]] (duration: 17m 03s) [22:02:36] T383394: Enable talk for mobile anon users on azwiki - https://phabricator.wikimedia.org/T383394 [22:02:45] tgr: all yours [22:03:52] thanks cjming! [22:03:59] (03PS12) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:05:10] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:06:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111330 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [22:06:51] (03PS13) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:07:09] (03Merged) 10jenkins-bot: Enable SUL3 on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111330 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [22:07:37] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1111330|Enable SUL3 on test wikis (T383729)]] [22:07:41] T383729: SUL3 Phase 0: Account creation and login on test wikis - https://phabricator.wikimedia.org/T383729 [22:08:04] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:14:01] !log tgr@deploy2002 tgr: Backport for [[gerrit:1111330|Enable SUL3 on test wikis (T383729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:14:01] (03PS2) 10JHathaway: kafka_shipper: when disabled, don't render templates [puppet] - 10https://gerrit.wikimedia.org/r/1111336 [22:14:05] T383729: SUL3 Phase 0: Account creation and login on test wikis - https://phabricator.wikimedia.org/T383729 [22:14:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111336 (owner: 10JHathaway) [22:15:14] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10460555 (10bking) [22:15:46] (03PS4) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) [22:16:29] (03PS1) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) [22:18:20] (03PS14) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:18:32] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10Observability-Alerting: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350#10460558 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T335350#10456238, @andrea.denisse wrote: > Hi @cmooney,... [22:19:32] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:24:11] !log tgr@deploy2002 Sync cancelled. [22:29:10] (03PS1) 10Andrea Denisse: wmcs: Remove Puppet files for migrated Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/1111340 (https://phabricator.wikimedia.org/T328502) [22:29:10] (03CR) 10Andrea Denisse: "To be merged once 1111328, and 1111338 are merged." [puppet] - 10https://gerrit.wikimedia.org/r/1111340 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [22:31:11] (03PS15) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:31:45] !log removing 8 files for legal compliance [22:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:23] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:33:17] (03PS16) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:34:28] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:37:24] (03PS17) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:38:35] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:43:26] (03PS1) 10TrainBranchBot: Revert "Enable SUL3 on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111341 [22:43:26] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as Ibe92907f535345a3ac1a266a4a86ede4f8d2887f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111330 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [22:44:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111341 (owner: 10TrainBranchBot) [22:44:54] (03Merged) 10jenkins-bot: Revert "Enable SUL3 on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111341 (owner: 10TrainBranchBot) [22:45:21] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1111341|Revert "Enable SUL3 on test wikis"]] [22:49:53] !log tgr@deploy2002 tgr, trainbranchbot: Backport for [[gerrit:1111341|Revert "Enable SUL3 on test wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:49:59] !log removing 5 files for legal compliance [22:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:40] !log tgr@deploy2002 tgr, trainbranchbot: Continuing with sync [22:53:44] (03PS1) 10Gergő Tisza: Yet more authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111343 (https://phabricator.wikimedia.org/T383729) [22:58:21] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111341|Revert "Enable SUL3 on test wikis"]] (duration: 12m 59s) [23:01:00] !log UTC late deploys done [23:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:32] (03PS1) 10Gergő Tisza: Add entry point names to all entry points under w/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111344 (https://phabricator.wikimedia.org/T383729) [23:14:44] !log removing 2 files for legal compliance [23:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:59] (03PS1) 10Chlod Alejandro: Increase Nuke max age to 90 days (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111350 (https://phabricator.wikimedia.org/T380846) [23:23:58] (03CR) 10Gergő Tisza: "`$wgFavicon` / `$wgAppleTouchIcon` are URLs, the script fetches them and outputs the content. `wmfStaticStreamFile()` expects a disk path." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [23:30:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:54:31] !log removing 2 files for legal compliance [23:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log