[00:00:47] (03CR) 10Krinkle: [C:03+1] MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [00:08:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10392233 (10Jclark-ctr) @elukey the 10g card is copper rj45 and not in use. AOC-ATGC-i2TM. The 10g port is connected... [00:38:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101604 [00:38:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101604 (owner: 10TrainBranchBot) [00:58:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101604 (owner: 10TrainBranchBot) [01:08:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101605 [01:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101605 (owner: 10TrainBranchBot) [01:20:32] 06SRE-OnFire, 10MW-on-K8s, 06serviceops, 13Patch-For-Review, 10Sustainability (Incident Followup): mwscript-k8s creates too many resources - https://phabricator.wikimedia.org/T376795#10392357 (10RLazarus) Yes, naively this would be too many invocations at present. We could easily add the release name to... [01:27:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101605 (owner: 10TrainBranchBot) [01:28:32] (03PS1) 10RLazarus: deployment_server: Add release to mwscript-k8s -ojson output [puppet] - 10https://gerrit.wikimedia.org/r/1101607 (https://phabricator.wikimedia.org/T376795) [01:29:38] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10392384 (10VRiley-WMF) [02:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:05] (03CR) 10Scott French: [C:03+1] deployment_server: Add release to mwscript-k8s -ojson output [puppet] - 10https://gerrit.wikimedia.org/r/1101607 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T0300) [03:01:07] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10392465 (10Dzahn) One thing to answer here would be how you would know who actually is WMDE staff. There used to be a public page that lists them but then that stopped... [03:06:03] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:09] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T0400) [04:16:39] RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [04:40:07] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 12109MiB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [04:49:29] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T381843 (10phaultfinder) 03NEW [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T0500) [05:01:27] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.4 (duration: 01m 25s) [06:04:48] (03PS1) 10Kevin Bazira: ml-services: update article-country deployment in the experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101741 (https://phabricator.wikimedia.org/T371897) [06:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:13] (03PS1) 10Kevin Bazira: ml-services: update article-country deployment in the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101743 (https://phabricator.wikimedia.org/T371897) [06:34:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:34:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:36:11] (03PS2) 10Stevemunene: Enable airflow task pods access to mx server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) [06:38:52] (03CR) 10Stevemunene: Enable airflow task pods access to mx server (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T0700) [07:00:05] marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T0700). [07:24:09] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update article-country deployment in the experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101741 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [07:24:44] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country deployment in the experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101741 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [07:24:55] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country deployment in the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101743 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [07:26:47] (03PS1) 10Jelto: Rename kubernetes[1051-1054] to wikikube-worker[1076-1079] [puppet] - 10https://gerrit.wikimedia.org/r/1101789 (https://phabricator.wikimedia.org/T377876) [07:32:40] Q: Where is source code of liveness_probe in deployment-charts configuration? We want to check how it is functioning.. [07:38:08] (03CR) 10Muehlenhoff: [C:03+2] netbox::db: Use new helper function [puppet] - 10https://gerrit.wikimedia.org/r/1101497 (owner: 10Muehlenhoff) [07:40:07] (03CR) 10Muehlenhoff: [C:03+2] prometheus/pop: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1100810 (owner: 10Muehlenhoff) [07:44:07] (03PS1) 10Muehlenhoff: profile::analytics::postgresql: Use debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1101791 [07:48:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101791 (owner: 10Muehlenhoff) [07:57:57] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10392591 (10MoritzMuehlenhoff) >>! In T381824#10392465, @Dzahn wrote: > One thing to answer here would be how you would know who actually is WMDE staff. There used to b... [08:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T0800). nyaa~ [08:00:04] gmodena: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:58] * gmodena waves [08:07:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71647 and previous config saved to /var/cache/conftool/dbconfig/20241210-080710-root.json [08:08:45] (03PS1) 10Marostegui: db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1101793 (https://phabricator.wikimedia.org/T381550) [08:10:23] (03CR) 10Marostegui: [C:03+2] db1159: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1101793 (https://phabricator.wikimedia.org/T381550) (owner: 10Marostegui) [08:10:40] Amir1 urbanecm anyone around for backport window deployments? [08:10:58] I can deploy 1100417 myself, but I'd like an ack from a responsible adult in case :) [08:13:28] (03CR) 10David Caro: [C:03+2] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) (owner: 10FNegri) [08:14:47] (03PS1) 10Marostegui: instances: Add db1159 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1101794 (https://phabricator.wikimedia.org/T381550) [08:15:07] (03Merged) 10jenkins-bot: WMCS: fix expr in TooManyCloud*Down [alerts] - 10https://gerrit.wikimedia.org/r/1101584 (https://phabricator.wikimedia.org/T381807) (owner: 10FNegri) [08:16:17] (03CR) 10Marostegui: [C:03+2] instances: Add db1159 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1101794 (https://phabricator.wikimedia.org/T381550) (owner: 10Marostegui) [08:16:48] (03CR) 10Brouberol: [C:03+1] "Perfect, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [08:18:39] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1101791 (owner: 10Muehlenhoff) [08:20:17] jouncebot I can do the deploys today! [08:20:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db1159 to dbctl depooled T381550', diff saved to https://phabricator.wikimedia.org/P71648 and previous config saved to /var/cache/conftool/dbconfig/20241210-082020-marostegui.json [08:20:25] T381550: Move db1159 to s5 - https://phabricator.wikimedia.org/T381550 [08:21:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gmodena@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100417 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [08:21:08] (03CR) 10Muehlenhoff: [C:03+2] profile::analytics::postgresql: Use debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1101791 (owner: 10Muehlenhoff) [08:21:49] (03Merged) 10jenkins-bot: EventStreamConfig: add content_history streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100417 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [08:22:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71649 and previous config saved to /var/cache/conftool/dbconfig/20241210-082216-root.json [08:22:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 10%: 5', diff saved to https://phabricator.wikimedia.org/P71650 and previous config saved to /var/cache/conftool/dbconfig/20241210-082221-root.json [08:22:32] !log gmodena@deploy2002 Started scap sync-world: Backport for [[gerrit:1100417|EventStreamConfig: add content_history streams. (T381322)]] [08:22:35] T381322: Rename Flink application and streams to match prod conventions - https://phabricator.wikimedia.org/T381322 [08:26:47] !log gmodena@deploy2002 gmodena: Backport for [[gerrit:1100417|EventStreamConfig: add content_history streams. (T381322)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:26:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101789 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [08:29:59] (03PS1) 10Muehlenhoff: Extend access for aarora [puppet] - 10https://gerrit.wikimedia.org/r/1101797 [08:31:57] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1051-1054].eqiad.wmnet [08:34:13] jouncebot Amir1 urbanecm tested on mwdebug host. Config changes (two new streams have been added) have been applied as expected. No regression found. I'll continue with sync. [08:34:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1051-1054].eqiad.wmnet [08:34:27] !log gmodena@deploy2002 gmodena: Continuing with sync [08:34:45] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1051-1054] to wikikube-worker[1076-1079] [puppet] - 10https://gerrit.wikimedia.org/r/1101789 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [08:35:55] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101797 (owner: 10Muehlenhoff) [08:36:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10392652 (10elukey) @Jclark-ctr @bking given that the 10g card will never be used (Rj45, coppet, etc..) we can go ahead wi... [08:36:58] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1051 to wikikube-worker1076 [08:37:17] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:37:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71652 and previous config saved to /var/cache/conftool/dbconfig/20241210-083721-root.json [08:37:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 25%: 5', diff saved to https://phabricator.wikimedia.org/P71653 and previous config saved to /var/cache/conftool/dbconfig/20241210-083726-root.json [08:39:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10392654 (10Marostegui) 05Open→03Resolved @Jhancock.wm The transfer didn't make the host crash. So I am going to start giving it some production traffic. Will reopen the task if... [08:39:25] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:39:47] (03CR) 10Muehlenhoff: [C:03+2] Extend access for aarora [puppet] - 10https://gerrit.wikimedia.org/r/1101797 (owner: 10Muehlenhoff) [08:39:48] !log gmodena@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100417|EventStreamConfig: add content_history streams. (T381322)]] (duration: 17m 16s) [08:39:52] T381322: Rename Flink application and streams to match prod conventions - https://phabricator.wikimedia.org/T381322 [08:40:16] (03PS1) 10Marostegui: es2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1101799 (https://phabricator.wikimedia.org/T381259) [08:41:07] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1051 to wikikube-worker1076 - jelto@cumin1002" [08:41:19] !log UTC morning backport deploys done [08:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:32] (03CR) 10Marostegui: [C:03+2] es2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1101799 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [08:41:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1051 to wikikube-worker1076 - jelto@cumin1002" [08:41:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:43] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1076 [08:42:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1076 [08:42:06] (03CR) 10Elukey: [C:03+1] maps::postgresql_common: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1101461 (owner: 10Muehlenhoff) [08:42:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1051 to wikikube-worker1076 [08:42:54] (03PS3) 10Slyngshede: Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) [08:43:01] (03CR) 10Elukey: [C:03+2] profile::k8s::deployment_server: add config for Kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1101483 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [08:43:41] (03Abandoned) 10Elukey: TEST: dump bios changes to be applied [cookbooks] - 10https://gerrit.wikimedia.org/r/1100996 (owner: 10Elukey) [08:43:49] (03Abandoned) 10Elukey: WIP: sre.hosts.provision: skip IPv6 autoconfig disable for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1091601 (owner: 10Elukey) [08:44:07] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1052 to wikikube-worker1077 [08:44:27] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:45:05] (03PS1) 10Marostegui: instances: Add es2045 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1101802 (https://phabricator.wikimedia.org/T381259) [08:46:21] (03CR) 10Marostegui: [C:03+2] instances: Add es2045 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1101802 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [08:47:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes1053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:47:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 1%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71654 and previous config saved to /var/cache/conftool/dbconfig/20241210-084743-root.json [08:48:05] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1052 to wikikube-worker1077 - jelto@cumin1002" [08:48:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2045 to dbctl T381259', diff saved to https://phabricator.wikimedia.org/P71655 and previous config saved to /var/cache/conftool/dbconfig/20241210-084844-marostegui.json [08:48:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1052 to wikikube-worker1077 - jelto@cumin1002" [08:48:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:48:49] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [08:48:49] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1077 [08:49:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1077 [08:49:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Change es2024 weight', diff saved to https://phabricator.wikimedia.org/P71656 and previous config saved to /var/cache/conftool/dbconfig/20241210-084932-marostegui.json [08:49:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1052 to wikikube-worker1077 [08:49:58] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1053 to wikikube-worker1078 [08:49:59] !log manual run of docker-system-prune-all on build2001 to free some space [08:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 1%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71657 and previous config saved to /var/cache/conftool/dbconfig/20241210-085006-root.json [08:50:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:52:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71658 and previous config saved to /var/cache/conftool/dbconfig/20241210-085227-root.json [08:52:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 50%: 5', diff saved to https://phabricator.wikimedia.org/P71659 and previous config saved to /var/cache/conftool/dbconfig/20241210-085232-root.json [08:53:47] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1053 to wikikube-worker1078 - jelto@cumin1002" [08:54:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1053 to wikikube-worker1078 - jelto@cumin1002" [08:54:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:54:04] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1078 [08:54:10] (03CR) 10DCausse: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1101607 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [08:54:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1078 [08:55:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1053 to wikikube-worker1078 [08:55:07] (03CR) 10Stevemunene: [C:03+2] Enable airflow task pods access to mx server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [08:55:37] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1054 to wikikube-worker1079 [08:55:57] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:56:26] (03Merged) 10jenkins-bot: Enable airflow task pods access to mx server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101527 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [08:56:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet with reason: Alter table [08:56:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet with reason: Alter table [08:59:36] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1054 to wikikube-worker1079 - jelto@cumin1002" [08:59:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1054 to wikikube-worker1079 - jelto@cumin1002" [08:59:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:59:53] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1079 [09:00:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1079 [09:00:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1054 to wikikube-worker1079 [09:01:18] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1076.eqiad.wmnet wikikube-worker1077.eqiad.wmnet wikikube-worker1078.eqiad.wmnet wikikube-worker1079.eqiad.wmnet on all recursors [09:01:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1076.eqiad.wmnet wikikube-worker1077.eqiad.wmnet wikikube-worker1078.eqiad.wmnet wikikube-worker1079.eqiad.wmnet on all recursors [09:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71660 and previous config saved to /var/cache/conftool/dbconfig/20241210-090248-root.json [09:04:12] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1076.eqiad.wmnet with OS bookworm [09:04:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1077.eqiad.wmnet with OS bookworm [09:05:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 5%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71661 and previous config saved to /var/cache/conftool/dbconfig/20241210-090511-root.json [09:05:17] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1078.eqiad.wmnet with OS bookworm [09:05:39] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1079.eqiad.wmnet with OS bookworm [09:06:51] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:07:13] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:07:31] (03PS4) 10Slyngshede: Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) [09:07:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71662 and previous config saved to /var/cache/conftool/dbconfig/20241210-090732-root.json [09:07:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 75%: 5', diff saved to https://phabricator.wikimedia.org/P71663 and previous config saved to /var/cache/conftool/dbconfig/20241210-090738-root.json [09:14:35] (03PS5) 10Slyngshede: Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) [09:15:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:16:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71664 and previous config saved to /var/cache/conftool/dbconfig/20241210-091754-root.json [09:19:50] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country deployment in the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101743 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:20:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71665 and previous config saved to /var/cache/conftool/dbconfig/20241210-092016-root.json [09:20:56] (03Merged) 10jenkins-bot: ml-services: update article-country deployment in the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101743 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:22:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1159 (re)pooling @ 100%: 5', diff saved to https://phabricator.wikimedia.org/P71666 and previous config saved to /var/cache/conftool/dbconfig/20241210-092243-root.json [09:23:30] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1076.eqiad.wmnet with reason: host reimage [09:24:00] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1077.eqiad.wmnet with reason: host reimage [09:24:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1078.eqiad.wmnet with reason: host reimage [09:24:44] (03PS2) 10Kevin Bazira: ml-services: update article-country deployment in the experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101741 (https://phabricator.wikimedia.org/T371897) [09:24:58] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1079.eqiad.wmnet with reason: host reimage [09:26:43] (03CR) 10Atieno: [C:03+1] "Confirming it's the one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [09:27:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1076.eqiad.wmnet with reason: host reimage [09:27:14] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country deployment in the experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101741 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:29:08] (03Merged) 10jenkins-bot: ml-services: update article-country deployment in the experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101741 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [09:29:30] 06SRE, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10392790 (10Marostegui) @xcollazo I like @BTullis idea. @BTullis do you think you could find some time to explore this idea. I am interesting in knowing how the... [09:30:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1077.eqiad.wmnet with reason: host reimage [09:32:20] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:33:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71667 and previous config saved to /var/cache/conftool/dbconfig/20241210-093259-root.json [09:34:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1079.eqiad.wmnet with reason: host reimage [09:34:25] !log rebalance Ganeti cluster in codfw/c following server refresh T376594 [09:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:29] T376594: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594 [09:35:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71668 and previous config saved to /var/cache/conftool/dbconfig/20241210-093522-root.json [09:36:03] (03CR) 10Brouberol: dse-k8s-services: introduce Blunderbuss config (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [09:36:03] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:37:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1078.eqiad.wmnet with reason: host reimage [09:41:28] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@7428c06]: Backfill webrequest actor metrics 2024 12 [09:41:43] !log joal@deploy2002 Started deploy [analytics/refinery@0ffc330]: Analytics backfill train [analytics/refinery@0ffc3306] [09:43:43] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:43:47] !log joal@deploy2002 Finished deploy [analytics/refinery@0ffc330]: Analytics backfill train [analytics/refinery@0ffc3306] (duration: 02m 04s) [09:44:05] !log joal@deploy2002 Started deploy [analytics/refinery@0ffc330] (thin): Analytics backfill train - THIN [analytics/refinery@0ffc3306] [09:44:06] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:44:22] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:44:36] !log joal@deploy2002 Finished deploy [analytics/refinery@0ffc330] (thin): Analytics backfill train - THIN [analytics/refinery@0ffc3306] (duration: 00m 31s) [09:44:50] !log joal@deploy2002 Started deploy [analytics/refinery@0ffc330] (hadoop-test): Analytics backfill train - TEST [analytics/refinery@0ffc3306] [09:45:17] !log joal@deploy2002 Finished deploy [analytics/refinery@0ffc330] (hadoop-test): Analytics backfill train - TEST [analytics/refinery@0ffc3306] (duration: 00m 26s) [09:46:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1076.eqiad.wmnet with OS bookworm [09:47:03] (03Abandoned) 10Muehlenhoff: Extend access request email template [software/bitu] - 10https://gerrit.wikimedia.org/r/1100133 (owner: 10Muehlenhoff) [09:48:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71669 and previous config saved to /var/cache/conftool/dbconfig/20241210-094805-root.json [09:48:51] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@7428c06]: Backfill webrequest actor metrics 2024 12 (duration: 07m 22s) [09:48:53] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@7428c06]: Backfill webrequest actor metrics 2024 12 [09:49:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1077.eqiad.wmnet with OS bookworm [09:50:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71670 and previous config saved to /var/cache/conftool/dbconfig/20241210-095027-root.json [09:53:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1079.eqiad.wmnet with OS bookworm [09:55:06] (03CR) 10Hamish: [C:03+1] "I've tested a similar patch on my private server, LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [09:56:30] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:31] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@7428c06]: Backfill webrequest actor metrics 2024 12 (duration: 07m 37s) [09:56:34] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@7428c06]: Backfill webrequest actor metrics 2024 12 [09:56:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1078.eqiad.wmnet with OS bookworm [09:57:12] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851 (10Ammarpad) 03NEW [09:57:41] !log homer 'lsw1-f3-eqiad*' commit 'T377876' , homer 'lsw1-e3-eqiad*' commit 'T377876' [09:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:45] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [09:59:00] (03CR) 10Slyngshede: "I forgot about this one." [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998) (owner: 10Slyngshede) [10:00:23] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1076-1079].eqiad.wmnet [10:00:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1076-1079].eqiad.wmnet [10:01:04] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10392875 (10Jelto) [10:03:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71672 and previous config saved to /var/cache/conftool/dbconfig/20241210-100310-root.json [10:04:19] (03PS1) 10Jelto: Rename kubernetes[1055-1058] to wikikube-worker[1080-1083] [puppet] - 10https://gerrit.wikimedia.org/r/1101818 (https://phabricator.wikimedia.org/T377876) [10:05:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71673 and previous config saved to /var/cache/conftool/dbconfig/20241210-100532-root.json [10:05:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) (owner: 10Slyngshede) [10:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:16] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10392897 (10Ammarpad) [10:12:09] (03CR) 10Slyngshede: Updated notification handling (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) (owner: 10Slyngshede) [10:12:13] (03CR) 10Slyngshede: [C:03+2] Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) (owner: 10Slyngshede) [10:12:32] (03CR) 10Muehlenhoff: [C:03+1] "The patch looks good, but I'm wondering about the context, all logins go via CAS,so this is only relevant for non-WMF deployments, right?" [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998) (owner: 10Slyngshede) [10:17:26] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@7428c06]: Backfill webrequest actor metrics 2024 12 (duration: 20m 51s) [10:17:43] (03Merged) 10jenkins-bot: Updated notification handling [software/bitu] - 10https://gerrit.wikimedia.org/r/1100388 (https://phabricator.wikimedia.org/T381075) (owner: 10Slyngshede) [10:18:14] (03PS1) 10Slyngshede: P:idm add notification settings to test system [puppet] - 10https://gerrit.wikimedia.org/r/1101820 (https://phabricator.wikimedia.org/T381075) [10:18:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71674 and previous config saved to /var/cache/conftool/dbconfig/20241210-101815-root.json [10:19:13] (03CR) 10Slyngshede: "This is for people who are not signed in, but goes to view the account block/unblock logs, which are public and have a menu." [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998) (owner: 10Slyngshede) [10:19:16] (03CR) 10Slyngshede: [C:03+2] Only show sign in link for anonymous users [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998) (owner: 10Slyngshede) [10:20:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71675 and previous config saved to /var/cache/conftool/dbconfig/20241210-102038-root.json [10:21:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101820 (https://phabricator.wikimedia.org/T381075) (owner: 10Slyngshede) [10:22:23] (03CR) 10Slyngshede: [C:03+2] P:idm add notification settings to test system [puppet] - 10https://gerrit.wikimedia.org/r/1101820 (https://phabricator.wikimedia.org/T381075) (owner: 10Slyngshede) [10:23:39] (03Merged) 10jenkins-bot: Only show sign in link for anonymous users [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998) (owner: 10Slyngshede) [10:29:16] (03PS1) 10Slyngshede: P:idm fix setting configuration error. [puppet] - 10https://gerrit.wikimedia.org/r/1101823 [10:30:14] (03CR) 10Slyngshede: [C:03+2] P:idm fix setting configuration error. [puppet] - 10https://gerrit.wikimedia.org/r/1101823 (owner: 10Slyngshede) [10:38:08] (03CR) 10Mvolz: "Yeah, sorry for the review spam! PipelineBot automatically adds the reviewers from the merged change." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099658 (owner: 10PipelineBot) [10:50:25] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1055-1058] to wikikube-worker[1080-1083] [puppet] - 10https://gerrit.wikimedia.org/r/1101818 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:53:59] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1055-1058].eqiad.wmnet [10:56:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1055-1058].eqiad.wmnet [11:00:00] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1055-1058] to wikikube-worker[1080-1083] [puppet] - 10https://gerrit.wikimedia.org/r/1101818 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1100) [11:00:54] (03PS1) 10Wangombe: Event Logging: Update streamName and schemaId [extensions/Translate] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101830 (https://phabricator.wikimedia.org/T364460) [11:01:57] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1055 to wikikube-worker1080 [11:02:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:02:56] (03PS1) 10Muehlenhoff: Remove obsolete reference to wikitech password changes [software/bitu] - 10https://gerrit.wikimedia.org/r/1101831 [11:03:21] (03PS4) 10Máté Szabó: Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) [11:03:28] (03CR) 10Máté Szabó: Prep pilot wiki config for IRS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099213 (https://phabricator.wikimedia.org/T374105) (owner: 10Máté Szabó) [11:03:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Translate] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101830 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [11:04:32] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:33] (03PS1) 10Muehlenhoff: Polish password reset statement a little [software/bitu] - 10https://gerrit.wikimedia.org/r/1101832 [11:05:49] !log Deploying no-op cfssl-issuer admin_ng change - 1101455 [11:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:14] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:07:24] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1055 to wikikube-worker1080 - jelto@cumin1002" [11:08:15] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:08:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1055 to wikikube-worker1080 - jelto@cumin1002" [11:08:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:08:21] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1080 [11:08:25] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:08:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1080 [11:09:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1055 to wikikube-worker1080 [11:09:13] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:09:45] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:09:59] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:10:04] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1056 to wikikube-worker1081 [11:10:25] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:10:33] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:10:52] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:11:02] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:11:14] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:11:51] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:12:10] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:12:27] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:12:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubernetes1057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:12:44] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:12:52] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:13:02] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:14:18] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1056 to wikikube-worker1081 - jelto@cumin1002" [11:14:19] !log Done deploying no-op cfssl-issuer admin_ng change - 1101455 [11:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1056 to wikikube-worker1081 - jelto@cumin1002" [11:14:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:14:34] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1081 [11:14:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1081 [11:15:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1056 to wikikube-worker1081 [11:15:50] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1057 to wikikube-worker1082 [11:16:10] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:18:26] (03PS1) 10Physikerwelt: Add new properties for math popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) [11:18:40] (03PS2) 10Physikerwelt: Add new properties for math popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) [11:19:48] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1057 to wikikube-worker1082 - jelto@cumin1002" [11:19:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) (owner: 10Physikerwelt) [11:20:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1057 to wikikube-worker1082 - jelto@cumin1002" [11:20:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:20:12] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1082 [11:21:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1082 [11:22:01] (03PS1) 10Muehlenhoff: puppetdb: Use debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1101835 [11:22:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1057 to wikikube-worker1082 [11:23:40] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1058 to wikikube-worker1083 [11:24:01] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:25:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101835 (owner: 10Muehlenhoff) [11:26:57] (03CR) 10Physikerwelt: "I scheduled this for tonight's deployment window. However, I understand that this can be merged at any time, as this only affects wikipedi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) (owner: 10Physikerwelt) [11:27:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1058 to wikikube-worker1083 - jelto@cumin1002" [11:27:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1058 to wikikube-worker1083 - jelto@cumin1002" [11:27:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:27:49] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1083 [11:28:50] (03CR) 10Bartosz Dziewoński: "Do we need the `ENV:RW_PROTO` thing? Can it just be `https`?" [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [11:28:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1083 [11:29:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1058 to wikikube-worker1083 [11:29:48] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1080.eqiad.wmnet wikikube-worker1081.eqiad.wmnet wikikube-worker1082.eqiad.wmnet wikikube-worker1083.eqiad.wmnet on all recursors [11:29:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1080.eqiad.wmnet wikikube-worker1081.eqiad.wmnet wikikube-worker1082.eqiad.wmnet wikikube-worker1083.eqiad.wmnet on all recursors [11:32:41] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1080.eqiad.wmnet with OS bookworm [11:33:00] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1081.eqiad.wmnet with OS bookworm [11:33:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1082.eqiad.wmnet with OS bookworm [11:33:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1083.eqiad.wmnet with OS bookworm [11:35:10] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101831 (owner: 10Muehlenhoff) [11:35:12] (03CR) 10Slyngshede: [C:03+2] Remove obsolete reference to wikitech password changes [software/bitu] - 10https://gerrit.wikimedia.org/r/1101831 (owner: 10Muehlenhoff) [11:35:46] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101832 (owner: 10Muehlenhoff) [11:35:50] (03CR) 10Slyngshede: [C:03+2] Polish password reset statement a little [software/bitu] - 10https://gerrit.wikimedia.org/r/1101832 (owner: 10Muehlenhoff) [11:38:12] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10393186 (10VRiley-WMF) [11:41:08] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10393191 (10VRiley-WMF) Hey @andrew I was able to do the first unit, however, when I was running the script on the other 2 devices, it s... [11:42:38] (03Merged) 10jenkins-bot: Remove obsolete reference to wikitech password changes [software/bitu] - 10https://gerrit.wikimedia.org/r/1101831 (owner: 10Muehlenhoff) [11:42:38] (03Merged) 10jenkins-bot: Polish password reset statement a little [software/bitu] - 10https://gerrit.wikimedia.org/r/1101832 (owner: 10Muehlenhoff) [11:45:12] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101839 [11:46:43] (03PS2) 10Samtar: IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101545 (https://phabricator.wikimedia.org/T377121) [11:47:06] jouncebot: nowandnext [11:47:06] For the next 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1100) [11:47:06] In 1 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1300) [11:49:25] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1082.eqiad.wmnet with reason: host reimage [11:49:44] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1083.eqiad.wmnet with reason: host reimage [11:50:23] I intend to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1101545 in a moment - any issues? :) [11:51:44] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1080.eqiad.wmnet with reason: host reimage [11:52:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101545 (https://phabricator.wikimedia.org/T377121) (owner: 10Samtar) [11:53:08] (03Merged) 10jenkins-bot: IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101545 (https://phabricator.wikimedia.org/T377121) (owner: 10Samtar) [11:53:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1082.eqiad.wmnet with reason: host reimage [11:53:29] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1101545|IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki (T377121)]] [11:53:32] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [11:53:34] (03CR) 10Abijeet Patro: [C:03+1] Event Logging: Update streamName and schemaId [extensions/Translate] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101830 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [11:56:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1080.eqiad.wmnet with reason: host reimage [11:56:34] (03PS8) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [11:58:32] !log samtar@deploy2002 samtar: Backport for [[gerrit:1101545|IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:58:35] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [11:58:37] * TheresNoTime testing.. ^ [12:00:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1083.eqiad.wmnet with reason: host reimage [12:00:26] RECOVERY - MariaDB Replica SQL: s2 on dbstore1007 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:02:31] !log samtar@deploy2002 samtar: Continuing with sync [12:07:35] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101545|IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki (T377121)]] (duration: 14m 06s) [12:07:39] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [12:07:53] (03PS1) 10Phuedx: Beta Cluster: Enable MetricsPlatform extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101840 (https://phabricator.wikimedia.org/T381849) [12:10:59] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10393376 (10VRiley-WMF) Making a note that we have very little room for 10 gig servers in rows A-D. However, we have more room in E and F. As long as they are not in the same... [12:11:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1082.eqiad.wmnet with OS bookworm [12:13:56] (03CR) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [12:15:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1080.eqiad.wmnet with OS bookworm [12:19:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1083.eqiad.wmnet with OS bookworm [12:36:08] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [12:39:18] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:54] 06SRE, 06Infrastructure-Foundations, 10netops: Manage VRRP priority from Netbox - https://phabricator.wikimedia.org/T381873 (10cmooney) 03NEW p:05Triage→03Low [12:51:52] (03PS1) 10Gmodena: data-engineering: add alerts for dumps2 flink app. [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) [12:52:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [12:53:05] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1081.eqiad.wmnet with OS bookworm [12:53:07] (03CR) 10CI reject: [V:04-1] data-engineering: add alerts for dumps2 flink app. [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [12:53:43] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [12:54:12] (03CR) 10Gmodena: data-engineering: add alerts for dumps2 flink app. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena) [12:55:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1081.eqiad.wmnet with OS bookworm [12:57:09] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:56] (03PS1) 10Slyngshede: Show expiry date for password reset link [software/bitu] - 10https://gerrit.wikimedia.org/r/1101851 [12:58:59] (03CR) 10Muehlenhoff: [C:03+2] maps::postgresql_common: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1101461 (owner: 10Muehlenhoff) [12:59:14] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1300) [13:00:12] !log klausman@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ml-lab1002.eqiad.wmnet with reason: Moving to analytics network [13:00:12] (03PS2) 10Slyngshede: Show expiry date for password reset link [software/bitu] - 10https://gerrit.wikimedia.org/r/1101851 [13:00:26] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ml-lab1002.eqiad.wmnet with reason: Moving to analytics network [13:01:52] !log klausman@cumin1002 START - Cookbook sre.hosts.decommission for hosts ml-lab1002.eqiad.wmnet [13:03:43] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:07:43] RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:10:21] (03CR) 10Michael Große: [C:03+1] "Currently, `WMF_MAINTENANCE_OFFLINE` is set in `mwscript.py`. I don't understand this part well enough to say whether that is also execute" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) (owner: 10Cwhite) [13:13:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101851 (owner: 10Slyngshede) [13:18:31] (03CR) 10Slyngshede: [C:03+2] Show expiry date for password reset link [software/bitu] - 10https://gerrit.wikimedia.org/r/1101851 (owner: 10Slyngshede) [13:19:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [13:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [13:22:45] (03PS1) 10Muehlenhoff: Clarify access request text [software/bitu] - 10https://gerrit.wikimedia.org/r/1101857 [13:24:53] (03Merged) 10jenkins-bot: Show expiry date for password reset link [software/bitu] - 10https://gerrit.wikimedia.org/r/1101851 (owner: 10Slyngshede) [13:26:00] PROBLEM - Host poolcounter2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:08] PROBLEM - Host irc2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:22] PROBLEM - Host config-master2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:26] PROBLEM - Host crm2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:52] PROBLEM - ganeti-noded running on ganeti2027 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:26:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:09] FIRING: [2x] ProbeDown: Service irc2003:6667 has failed probes (tcp_ircstream_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2003:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:24] checking [13:27:36] I'm powercycling ganeti2027 [13:28:35] thx moritzm, thumbor paged and I'm not sure that's related? [13:29:05] Bet thumbor has failed probes because poolcounter2005 is down [13:29:07] ganeti2027 is one of the ganeti nodes still running Bullseye and these sometimes hit a DRBD bug T348730 [13:29:08] T348730: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye - https://phabricator.wikimedia.org/T348730 [13:29:14] !incidents [13:29:15] 5533 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:29:15] 5530 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [13:29:19] yeah, it hardcodes poolcounter2005 [13:29:53] hah [13:29:55] ganeti2027 is rebooting and in POST stage, should be back in a minute [13:30:04] PROBLEM - Host ganeti2027 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:42] FIRING: [2x] JobUnavailable: Reduced availability for job ircstream in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:52] RECOVERY - Host ganeti2027 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [13:31:54] RECOVERY - ganeti-noded running on ganeti2027 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:32:09] FIRING: [3x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:44] (03PS1) 10Cathal Mooney: Update JunOS templates to use VRRP priority exposed from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1101861 (https://phabricator.wikimedia.org/T381873) [13:33:58] (03PS2) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101092 [13:34:14] FIRING: [3x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:35:26] RECOVERY - Host config-master2001 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [13:35:40] RECOVERY - Host irc2003 is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms [13:35:40] RECOVERY - Host poolcounter2005 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [13:35:56] RECOVERY - Host crm2001 is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms [13:36:26] 06SRE, 06Infrastructure-Foundations: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye - https://phabricator.wikimedia.org/T348730#10393651 (10MoritzMuehlenhoff) Happened again on ganeti2027 today. [13:36:47] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101092 (owner: 10PipelineBot) [13:36:57] we're down to like 50% bullseye nodes, the remaining ones will be reimaged in January [13:36:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:09] RESOLVED: [4x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:18] ack thx [13:38:09] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101092 (owner: 10PipelineBot) [13:40:23] (03PS1) 10Muehlenhoff: Configure new maps nodes with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1101864 (https://phabricator.wikimedia.org/T381565) [13:40:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job ircstream in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:50:05] (03PS1) 10أنون: [enwikinews] & [plwikinews]: Upgrade license to CC BY 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) [13:54:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [13:56:14] (03CR) 10أنون: [C:03+1] [enwikinews] & [plwikinews]: Upgrade license to CC BY 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1400). [14:00:05] Daimona, wangombe_g, and lolekek: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:33] Hello there [14:00:39] o/ [14:00:45] o/ [14:02:11] I’m slightly busy right now but can probably deploy soon [14:02:59] Mine is a beta change BTW, so fire and forget [14:03:16] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Comm Error: backplane 0 when reimaging wikikube-worker1081 - https://phabricator.wikimedia.org/T381878 (10Jelto) 03NEW [14:03:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101596 (https://phabricator.wikimedia.org/T380077) (owner: 10Daimona Eaytoy) [14:03:59] yeah, let’s start with that one [14:04:29] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Comm Error: backplane 0 when reimaging wikikube-worker1081 - https://phabricator.wikimedia.org/T381878#10393729 (10Jelto) The following commands have to be executed when the host is back (just noting it down so I don't forget it): ` c... [14:04:38] (03Merged) 10jenkins-bot: beta: Enable $wgCampaignEventsEnableEventWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101596 (https://phabricator.wikimedia.org/T380077) (owner: 10Daimona Eaytoy) [14:05:44] (03PS1) 10Samtar: Revert "IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101871 [14:06:09] (03CR) 10Slyngshede: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101857 (owner: 10Muehlenhoff) [14:06:11] (03CR) 10Slyngshede: [C:03+2] Clarify access request text [software/bitu] - 10https://gerrit.wikimedia.org/r/1101857 (owner: 10Muehlenhoff) [14:09:08] (03CR) 10Elukey: [C:03+1] "LGTM! I have limited understanding of this plugin but the logic looks good! And it is impressive how fast it runs, we'll surely benefit of" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [14:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:14] (03Merged) 10jenkins-bot: Clarify access request text [software/bitu] - 10https://gerrit.wikimedia.org/r/1101857 (owner: 10Muehlenhoff) [14:14:12] Thanks, Lucas! [14:14:51] (03PS1) 10DDesouza: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) [14:15:27] (03CR) 10CI reject: [V:04-1] Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [14:15:54] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1081.eqiad.wmnet with OS bookworm [14:16:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [14:17:53] !log homer 'lsw1-f3-eqiad*' commit 'T377876' , homer 'cr*eqiad*' commit 'T377876' [14:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:58] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [14:18:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:18:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:18:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T381532)', diff saved to https://phabricator.wikimedia.org/P71678 and previous config saved to /var/cache/conftool/dbconfig/20241210-141820-marostegui.json [14:18:24] T381532: Fix AntiSpoof database schema drifts in production - https://phabricator.wikimedia.org/T381532 [14:18:45] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1101588 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [14:19:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Translate] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101830 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [14:20:58] (03CR) 10Ottomata: dse-k8s-services: rename mw-dumps helmfiles. (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [14:21:05] (03PS2) 10Samtar: Revert "IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101871 [14:23:06] (03PS1) 10Elukey: Add postgresql maps replica config to k8s' external services [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) [14:24:27] TheresNoTime: zuul says 14mins left for the current backport, if you want to roll out your config change before then… [14:24:36] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381881 (10phaultfinder) 03NEW [14:25:15] Lucas_WMDE: likely will, just double-checking I should do it :D [14:26:07] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4655/co" [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:26:38] ok ^^ [14:27:43] Lucas_WMDE: going to do it now [14:27:52] alright, I’ll Ctrl+C my scap backport [14:27:57] (but let gate-and-submit run through) [14:28:03] done [14:28:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101871 (owner: 10Samtar) [14:28:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:28:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:28:49] (03Merged) 10jenkins-bot: Revert "IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101871 (owner: 10Samtar) [14:29:00] I would like to cancel my patch for now [14:29:06] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1101871|Revert "IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki"]] [14:29:07] !log revert 1101545 for T377121 [14:29:07] Apologies for inconvinience [14:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:12] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [14:29:39] lolekek: no problem – you can just remove it from the deployment calendar [14:29:53] (or I can do it if you prefer) [14:31:06] !log klausman@cumin1002 START - Cookbook sre.dns.netbox [14:31:51] (03CR) 10Bking: [C:03+2] wdqs1025: enable as wdqs-internal-main host [puppet] - 10https://gerrit.wikimedia.org/r/1101588 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [14:32:20] !log samtar@deploy2002 samtar: Backport for [[gerrit:1101871|Revert "IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:32:34] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1080,1082-1083].eqiad.wmnet [14:32:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1080,1082-1083].eqiad.wmnet [14:32:36] * TheresNoTime testing ^ [14:32:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:32:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:32:58] !log samtar@deploy2002 samtar: Continuing with sync [14:33:25] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10393828 (10Jelto) [14:33:55] (03CR) 10Gergő Tisza: "Everything else is using the env variable. Not sure if that's intentional or just leftover from the time when we were mixed-protocol. Mayb" [puppet] - 10https://gerrit.wikimedia.org/r/1101462 (https://phabricator.wikimedia.org/T381625) (owner: 10Gergő Tisza) [14:35:26] (03CR) 10Brouberol: Add postgresql maps replica config to k8s' external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:35] (03Merged) 10jenkins-bot: Event Logging: Update streamName and schemaId [extensions/Translate] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101830 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [14:39:22] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101871|Revert "IS/IS-l: wgUseCodexSpecialBlock for beta, prod test.wiki"]] (duration: 10m 15s) [14:39:37] Lucas_WMDE: all yours [14:39:47] thanks! [14:40:00] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1101830|Event Logging: Update streamName and schemaId (T364460)]] [14:40:03] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [14:40:22] (03CR) 10Ottomata: dse-k8s-services: rename mw-dumps helmfiles. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [14:42:24] (03CR) 10Elukey: [V:03+1] Add postgresql maps replica config to k8s' external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:42:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:42:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:44:30] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, wangombe: Backport for [[gerrit:1101830|Event Logging: Update streamName and schemaId (T364460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:49] wangombe_g: please test on mwdebug :) [14:46:21] (03CR) 10Brouberol: Add postgresql maps replica config to k8s' external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:46:26] !log klausman@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ml-lab1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1002" [14:47:37] !log klausman@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ml-lab1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1002" [14:47:38] !log klausman@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:39] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ml-lab1002.eqiad.wmnet [14:49:34] (03CR) 10Brouberol: dse-k8s-services: rename mw-dumps helmfiles. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [14:49:58] wangombe_g: are you still there? [14:49:59] (03CR) 10JHathaway: [C:03+1] puppetdb: Use debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1101835 (owner: 10Muehlenhoff) [14:51:30] (03PS2) 10Elukey: Add postgresql maps replica config to k8s' external services [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) [14:51:59] lolekek: I’ve removed it now https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2253259 [14:53:52] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4656/co" [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:53:56] testing... [14:54:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 820.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381881#10393910 (10phaultfinder) [14:55:55] (03CR) 10Elukey: [V:03+1] Add postgresql maps replica config to k8s' external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [14:59:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 812.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:00:02] My patch llooks good. [15:00:06] (03CR) 10JMeybohm: [C:03+1] "Don't forget to add the service/users to `deployment_server.yaml`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:00:15] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, wangombe: Continuing with sync [15:00:21] alright, thanks! [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:58] (03CR) 10Kamila Součková: [C:03+1] mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:03:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71681 and previous config saved to /var/cache/conftool/dbconfig/20241210-150300-root.json [15:05:09] 10ops-codfw, 06SRE, 06DC-Ops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10393948 (10Jhancock.wm) p:05Medium→03Low [15:05:40] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101830|Event Logging: Update streamName and schemaId (T364460)]] (duration: 25m 40s) [15:05:44] T364460: Implement the instrumentation to track usage of MinT in the Translate extension - https://phabricator.wikimedia.org/T364460 [15:06:22] !log UTC afternoon backport+config window done [15:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:25] FIRING: [2x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:00] RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:11:21] (03CR) 10Elukey: "already done thanks! <3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:11:36] (03PS1) 10Hnowlan: kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) [15:11:55] (03PS1) 10Bking: wdqs1025: remove unneeded host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1101888 (https://phabricator.wikimedia.org/T376150) [15:12:11] (03PS1) 10Hnowlan: mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) [15:12:38] (03PS2) 10Hnowlan: kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) [15:13:19] (03CR) 10CI reject: [V:04-1] kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [15:13:53] (03PS4) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098936 [15:13:56] (03PS3) 10Hnowlan: kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) [15:14:26] (03PS3) 10Elukey: Add postgresql maps replica config to k8s' external services [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) [15:15:04] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [15:15:07] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [15:15:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1025.eqiad.wmnet with reason: T376150 [15:16:50] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4657/co" [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:17:47] (03CR) 10Elukey: [V:03+1] Add postgresql maps replica config to k8s' external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71682 and previous config saved to /var/cache/conftool/dbconfig/20241210-151805-root.json [15:23:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:25:26] (03CR) 10JMeybohm: services: add helmfile config for Kartotherian (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:28:09] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10394045 (10Jhancock.wm) @Jelto heads up, these are showing up in a netbox report. >Device is Active in Netbox but is missing from PuppetDB (should be ('decommis... [15:30:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10394062 (10Jhancock.wm) @MoritzMuehlenhoff heads up, ganeti1009 is triggering an alert in netbox. > Device is in PuppetDB but is D... [15:30:54] (03PS1) 10Slyngshede: Inform users that their permission request have been approved/rejected [software/bitu] - 10https://gerrit.wikimedia.org/r/1101894 [15:32:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [15:33:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71683 and previous config saved to /var/cache/conftool/dbconfig/20241210-153311-root.json [15:33:13] (03CR) 10Kamila Součková: [C:03+1] kubernetes: add mw-videoscaler to scap deployments [puppet] - 10https://gerrit.wikimedia.org/r/1101887 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [15:33:55] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10394076 (10MoritzMuehlenhoff) [15:34:04] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10394077 (10Andrew) Timing is flexible although I'd like to do a graceful shutdown and check after the fact. Can this wait until next wee... [15:34:13] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10394079 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All resolved. [15:35:52] !log installing imagemagick security updates [15:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:59] (03CR) 10Elukey: [V:03+1 C:03+2] Add postgresql maps replica config to k8s' external services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101876 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [15:37:36] (03CR) 10Hnowlan: [C:03+2] mediawiki: add multi-job support to mercurius (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:39:40] (03Merged) 10jenkins-bot: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:42:33] (03PS1) 10Herron: pyrra: onboard liftwing api ng latency/availability [puppet] - 10https://gerrit.wikimedia.org/r/1101896 (https://phabricator.wikimedia.org/T302995) [15:44:28] 06SRE, 10SRE-swift-storage, 06Commons: Interieur - 's-Gravenhage - 20085391 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381891 (10MatthewVernon) 03NEW [15:44:52] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T381843#10394134 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated mgmt cable. pings [15:45:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:45:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:46:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10394138 (10Jhancock.wm) [15:47:40] (03PS6) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [15:47:40] (03PS2) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [15:47:40] (03PS2) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [15:47:41] (03PS1) 10Elukey: services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 [15:48:13] !log installing usb.ids updates from Bullseye point release [15:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71684 and previous config saved to /var/cache/conftool/dbconfig/20241210-154816-root.json [15:48:50] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T380479#10394157 (10Jhancock.wm) It can wait until you're back from offsite. Enjoy =) [15:51:13] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10394163 (10MoritzMuehlenhoff) [15:52:08] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10394165 (10Jelto) >>! In T381504#10394045, @Jhancock.wm wrote: > @Jelto heads up, these are showing up in a netbox report. >>Device is Active in Netbox but is m... [15:52:33] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:52:39] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:53:02] (03PS1) 10Clare Ming: Remove extraneous config for Metrics Platform instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101898 (https://phabricator.wikimedia.org/T356939) [15:53:23] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [15:53:28] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [15:58:39] jouncebot: nowandnext [15:58:39] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [15:58:39] In 0 hour(s) and 1 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1600) [15:58:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101840 (https://phabricator.wikimedia.org/T381849) (owner: 10Phuedx) [15:59:43] (03CR) 10Andrea Denisse: "Adding Scott as reviewer since he's On Clinic Duty this week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [16:00:04] eoghan, jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1600). Please do the needful. [16:00:33] (03PS2) 10Elukey: services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 [16:00:33] (03PS7) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [16:00:33] (03PS3) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [16:00:34] (03PS3) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [16:00:35] (03PS2) 10Herron: pyrra: onboard liftwing api ng latency/availability [puppet] - 10https://gerrit.wikimedia.org/r/1101896 (https://phabricator.wikimedia.org/T302995) [16:00:36] (03CR) 10Herron: [C:03+2] "self merge for onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101896 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [16:03:04] (03CR) 10SBassett: [C:03+1] Add Atieno's public key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [16:03:16] is anyone planning to deploy anything in the collab window? [16:03:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71685 and previous config saved to /var/cache/conftool/dbconfig/20241210-160322-root.json [16:04:10] (03PS3) 10Elukey: services: use external_services for maps read replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 [16:04:10] (03PS8) 10Elukey: charts: Add kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101452 (https://phabricator.wikimedia.org/T216826) [16:04:10] (03PS4) 10Elukey: admin_ng: add the kartotherian namespace on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101487 (https://phabricator.wikimedia.org/T216826) [16:04:11] (03PS4) 10Elukey: services: add helmfile config for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101488 (https://phabricator.wikimedia.org/T216826) [16:06:23] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: ganeti1009.eqiad.wmnet [16:06:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: ganeti1009.eqiad.wmnet [16:06:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10394193 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: ganeti1009.eqiad.wmnet [16:07:02] !log manually clean out ganeti1009 from puppetdb, decom cookbook got interrupted T381652 [16:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:05] T381652: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652 [16:07:56] (03CR) 10Kamila Součková: [C:03+1] mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [16:08:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10394216 (10MoritzMuehlenhoff) >>! In T381652#10394062, @Jhancock.wm wrote: > @MoritzMuehlenhoff heads up, ganeti1009 is triggering... [16:08:31] !log klausman@cumin1002 START - Cookbook sre.dns.netbox [16:08:55] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/admin 'sync'. [16:09:21] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [16:09:27] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on phab1004.eqiad.wmnet with reason: nftables [16:09:29] !log phabricator production host needs a maintenance reboot - expect short downtime [16:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:40] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/admin 'sync'. [16:09:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on phab1004.eqiad.wmnet with reason: nftables [16:09:48] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [16:09:57] (03CR) 10PleaseStand: "I notice that the new key was added at the end, below a few keys that seem to be marked as historic or no longer in use, in that a range o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [16:10:30] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Use debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1101835 (owner: 10Muehlenhoff) [16:12:11] (03CR) 10Elukey: "Hey folks I am trying to simplify the config, and I noticed that we use the maps masters for read traffic. I created read-replicas-only la" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101897 (owner: 10Elukey) [16:12:46] !log klausman@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS for newly-provisioned ml-lab1002 - klausman@cumin1002" [16:12:50] !log klausman@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS for newly-provisioned ml-lab1002 - klausman@cumin1002" [16:12:50] !log klausman@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:07] !log installing postgresql-15 security updates [16:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:24] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [16:13:36] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [16:15:02] PROBLEM - Host gitlab-runner2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:41] !log klausman@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1002 [16:15:59] !log klausman@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1002 [16:20:07] !log klausman@cumin1002 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:21:47] (03PS1) 10Hnowlan: videoscaling: disable changeprop webVideoTranscode, enable mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101899 (https://phabricator.wikimedia.org/T371701) [16:23:00] (03CR) 10Scott French: [C:03+1] "Confirmed out of band with Atieno that this is in fact their signing public key." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [16:24:26] 06SRE, 10SRE-swift-storage, 06Commons: Interieur - 's-Gravenhage - 20089866 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381893 (10MatthewVernon) 03NEW [16:25:20] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:26:01] !log klausman@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1002.eqiad.wmnet with OS bookworm [16:27:09] (03CR) 10Arlolra: "Maybe? Do you think the keys will interacted with other than with `gpg --fetch-keys "https://www.mediawiki.org/keys/keys.txt"`?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [16:30:08] (03CR) 10Scott French: [C:03+1] videoscaling: disable changeprop webVideoTranscode, enable mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101899 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:30:23] (03CR) 10Cathal Mooney: [C:03+2] Adjust how we build list of server BGP peerings for CRs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1099225 (https://phabricator.wikimedia.org/T381175) (owner: 10Cathal Mooney) [16:31:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101577 (owner: 10Arlolra) [16:35:55] (03CR) 10Hnowlan: [C:03+2] videoscaling: disable changeprop webVideoTranscode, enable mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101899 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:35:59] (03PS1) 10Cathal Mooney: Expose VRRP group assignment priority to Homer templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1101903 (https://phabricator.wikimedia.org/T381873) [16:37:32] (03Merged) 10jenkins-bot: videoscaling: disable changeprop webVideoTranscode, enable mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101899 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:37:57] (03PS3) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) [16:38:37] (03PS4) 10Fabfur: Enable new countries for magru (Cohort 3) [dns] - 10https://gerrit.wikimedia.org/r/1100084 (https://phabricator.wikimedia.org/T371141) [16:38:41] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.10.0 - T381785 [16:38:43] 06SRE, 10SRE-swift-storage, 06Commons: Interieur - 's-Gravenhage - 20085391 - RCE.jpg inconsistent, needs new upload - https://phabricator.wikimedia.org/T381891#10394295 (10MatthewVernon) FTR, `rclone` does at least notice something went wrong: ` Dec 9 03:26:31 ms-be2069 swift-rclone-sync[2652562]: ERROR :... [16:38:55] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 24.10.0 - T381785 (duration: 00m 13s) [16:39:28] (03PS1) 10Scott French: shellbox: release latest image 2024-12-07-073046 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101902 (https://phabricator.wikimedia.org/T381830) [16:40:07] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [16:41:00] (03CR) 10Hnowlan: [C:03+1] shellbox: release latest image 2024-12-07-073046 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101902 (https://phabricator.wikimedia.org/T381830) (owner: 10Scott French) [16:43:27] (03CR) 10Scott French: [C:03+2] shellbox: release latest image 2024-12-07-073046 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101902 (https://phabricator.wikimedia.org/T381830) (owner: 10Scott French) [16:44:25] FIRING: SystemdUnitFailed: librenms-alerts.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:47] (03Merged) 10jenkins-bot: shellbox: release latest image 2024-12-07-073046 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101902 (https://phabricator.wikimedia.org/T381830) (owner: 10Scott French) [16:46:14] (03PS2) 10Hnowlan: mediawiki: get mercurius label from mediawiki image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) [16:47:53] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [16:48:30] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:48:42] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:49:03] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:49:14] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:49:25] FIRING: [5x] SystemdUnitFailed: librenms-alerts.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:29] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:49:41] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:50:16] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:50:27] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:50:49] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:51:00] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:51:19] (03PS2) 10Cathal Mooney: Update JunOS templates to use VRRP priority exposed from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/1101861 (https://phabricator.wikimedia.org/T381873) [16:51:25] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:54:25] RESOLVED: [5x] SystemdUnitFailed: librenms-alerts.service on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:35] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/archiva 8805 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [16:56:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [16:56:53] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [16:57:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:57:36] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:57:57] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:58:16] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:58:37] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:58:58] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:59:03] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:59:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:59:25] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:59:55] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:16] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:01:00] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:02:38] (03PS1) 10CDobbins: Remove eqiad from public and private IP spaces [dns] - 10https://gerrit.wikimedia.org/r/1101908 (https://phabricator.wikimedia.org/T380858) [17:02:59] (03PS1) 10Fabfur: varnish: pass WME HEAD reqs to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) [17:03:41] (03CR) 10CI reject: [V:04-1] varnish: pass WME HEAD reqs to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur) [17:03:56] !log restarting eventgate-analytics to pick up stream config changes for T381322 [17:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:02] T381322: Rename Flink application and streams to match prod conventions - https://phabricator.wikimedia.org/T381322 [17:04:32] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [17:05:11] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [17:05:45] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [17:06:39] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [17:07:41] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [17:08:28] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [17:08:35] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [17:09:09] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [17:09:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [17:09:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [17:10:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:10:20] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:10:41] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [17:10:57] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [17:11:18] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [17:11:41] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [17:12:02] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:12:28] !log klausman@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-lab1002.eqiad.wmnet with OS bookworm [17:12:58] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:13:15] !log deployed shellbox 2024-12-07-073046 for T381830 [17:13:17] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1002.eqiad.wmnet with OS bookworm [17:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:18] T381830: Deploy Shellbox 4.1.1 server - https://phabricator.wikimedia.org/T381830 [17:13:58] PROBLEM - MariaDB Replica SQL: s6 #page on db2158 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: ruwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:14:34] !incidents [17:14:35] 5534 (UNACKED) db2158 (paged)/MariaDB Replica SQL: s6 (paged) [17:14:35] 5533 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [17:15:15] !ack 5534 [17:15:16] 5534 (ACKED) db2158 (paged)/MariaDB Replica SQL: s6 (paged) [17:15:45] (03PS2) 10Fabfur: varnish: pass WME HEAD reqs to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) [17:17:44] I'll depool db2158 in a min unless I hear an objection [17:18:43] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [17:18:48] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [17:19:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [17:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [17:21:58] PROBLEM - MariaDB Replica Lag: s6 #page on db2158 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 652.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:23:27] !incidents [17:23:28] 5534 (ACKED) db2158 (paged)/MariaDB Replica SQL: s6 (paged) [17:23:28] 5535 (UNACKED) db2158 (paged)/MariaDB Replica Lag: s6 (paged) [17:23:28] 5533 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [17:23:33] !ack 5534 [17:23:33] 5534 (ACKED) db2158 (paged)/MariaDB Replica SQL: s6 (paged) [17:24:25] !log herron@cumin1002 dbctl commit (dc=all): 'depooling db2158 T381901', diff saved to https://phabricator.wikimedia.org/P71687 and previous config saved to /var/cache/conftool/dbconfig/20241210-172424-herron.json [17:24:29] T381901: MariaDB Replica SQL: s6 on db2158 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: ruwiki. - https://phabricator.wikimedia.org/T381901 [17:25:15] I'll fix that [17:25:38] thanks marostegui [17:25:48] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-lab1002.eqiad.wmnet with reason: host reimage [17:27:58] RECOVERY - MariaDB Replica SQL: s6 #page on db2158 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:28:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2158.codfw.wmnet with reason: maintenance [17:29:00] RECOVERY - MariaDB Replica Lag: s6 #page on db2158 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:29:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: maintenance [17:30:15] (03PS1) 10Herron: pyrra: onboard liftwing slos [puppet] - 10https://gerrit.wikimedia.org/r/1101911 (https://phabricator.wikimedia.org/T302995) [17:30:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-lab1002.eqiad.wmnet with reason: host reimage [17:33:09] (03PS1) 10Hnowlan: mediawiki: fix mercurius multi-job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101912 (https://phabricator.wikimedia.org/T371701) [17:33:54] (03CR) 10Kamila Součková: [C:03+1] mediawiki: fix mercurius multi-job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101912 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:35:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71688 and previous config saved to /var/cache/conftool/dbconfig/20241210-173524-root.json [17:36:26] (03CR) 10Hnowlan: [C:03+2] mediawiki: fix mercurius multi-job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101912 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:38:21] RECOVERY - Host gitlab-runner2004 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [17:38:36] (03Merged) 10jenkins-bot: mediawiki: fix mercurius multi-job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101912 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [17:39:59] (03PS2) 10Herron: pyrra: onboard liftwing slos [puppet] - 10https://gerrit.wikimedia.org/r/1101911 (https://phabricator.wikimedia.org/T302995) [17:39:59] (03CR) 10Herron: [C:03+2] "self merge for onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101911 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:41:56] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [17:42:01] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [17:42:27] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10394660 (10odimitrijevic) Yes, I approve streamlining the access to WMDE staff in the same way that we do for WMF staff as proposed in https://phabricator.wikimedia.or... [17:42:38] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [17:42:50] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [17:43:02] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Stuck/bugged BMC on ml-lab1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381902 (10klausman) 03NEW [17:43:27] 10ops-eqiad, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Stuck/bugged BMC on ml-lab1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381902#10394680 (10klausman) [17:47:03] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [17:47:16] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [17:48:01] (03PS1) 10Cwhite: Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101913 [17:48:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) (owner: 10Cwhite) [17:49:29] 06SRE: Console domain and property access request - https://phabricator.wikimedia.org/T381904 (10NBaca-WMF) 03NEW [17:49:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101913 (owner: 10Cwhite) [17:50:18] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101915 [17:50:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71690 and previous config saved to /var/cache/conftool/dbconfig/20241210-175029-root.json [17:50:33] 06SRE, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10netops: Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10394722 (10cmooney) >>! In T381389#10389706, @BTullis wrote: > This change looks fine to me, but would it be OK to wait until the New Ye... [17:54:41] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:55:29] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:56:17] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101915 (owner: 10PipelineBot) [17:57:23] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101915 (owner: 10PipelineBot) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T1800) [18:00:24] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:00:53] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:01:22] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [18:01:35] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1002" [18:02:07] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [18:02:20] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [18:02:52] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [18:05:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71691 and previous config saved to /var/cache/conftool/dbconfig/20241210-180534-root.json [18:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71692 and previous config saved to /var/cache/conftool/dbconfig/20241210-182040-root.json [18:22:44] (03PS1) 10Herron: pyrra: disable liftwing-readability-latency slo [puppet] - 10https://gerrit.wikimedia.org/r/1101916 [18:25:37] (03PS6) 10Gmodena: dse-k8s-services: rename mw-dumps helmfiles. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) [18:26:51] (03CR) 10Herron: [C:03+2] pyrra: disable liftwing-readability-latency slo [puppet] - 10https://gerrit.wikimedia.org/r/1101916 (owner: 10Herron) [18:31:07] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10394858 (10Scott_French) [18:31:37] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10394860 (10Scott_French) 05Open→03Stalled Thanks, @Ammarpad - It would great if you could you please confirm your SSH public key via a second authenticated channel. A common solution for... [18:33:24] (03CR) 10Brouberol: dse-k8s-services: rename mw-dumps helmfiles. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100420 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [18:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71693 and previous config saved to /var/cache/conftool/dbconfig/20241210-183545-root.json [18:39:16] (03PS1) 10Hnowlan: mesh.configuration: dummy commit for 1.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101917 [18:39:16] (03PS1) 10Hnowlan: mesh.configuration: add tcp_keepalive support in 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [18:39:18] (03PS1) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) [18:40:06] (03CR) 10CI reject: [V:04-1] mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [18:41:54] (03PS2) 10Hnowlan: mesh.configuration: add tcp_keepalive support in 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [18:42:24] (03PS1) 10Dzahn: Revert "miscweb: Update Envoy firewall config" [puppet] - 10https://gerrit.wikimedia.org/r/1101922 [18:42:44] (03CR) 10CI reject: [V:04-1] mesh.configuration: add tcp_keepalive support in 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [18:43:20] (03CR) 10Dzahn: [C:03+2] Revert "miscweb: Update Envoy firewall config" [puppet] - 10https://gerrit.wikimedia.org/r/1101922 (owner: 10Dzahn) [18:46:07] (03PS3) 10Gmodena: data-engineering: add alerts for dumps2 flink app. [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) [18:50:07] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 9609MiB (2% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [18:54:27] (03CR) 10Dzahn: [C:03+2] "this caused problems for https://commons-query.wikimedia.org/ and reverting fixed that for now. this will either move to k8s or go away or" [puppet] - 10https://gerrit.wikimedia.org/r/1092827 (owner: 10Muehlenhoff) [18:56:32] (03CR) 10Dzahn: [C:03+2] "https://phabricator.wikimedia.org/T381909" [puppet] - 10https://gerrit.wikimedia.org/r/1101922 (owner: 10Dzahn) [18:56:42] (03CR) 10Dzahn: [C:03+2] "https://phabricator.wikimedia.org/T381909" [puppet] - 10https://gerrit.wikimedia.org/r/1092827 (owner: 10Muehlenhoff) [18:57:52] (03PS2) 10Hnowlan: mesh.configuration: dummy commit for 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101917 [18:58:39] (03PS8) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventBus->send [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [19:04:49] (03CR) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventBus->send (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:05:02] (03CR) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventBus->send (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:09:59] (03PS9) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventBus->send [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [19:12:54] (03PS1) 10Aleksandar Mastilovic: Add the GitLab runner firewall rule for Blunderbuss [puppet] - 10https://gerrit.wikimedia.org/r/1101925 [19:16:18] (03CR) 10Ebernhardson: Add the GitLab runner firewall rule for Blunderbuss (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (owner: 10Aleksandar Mastilovic) [19:21:57] (03PS2) 10DDesouza: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) [19:22:37] (03PS2) 10Hnowlan: mediawiki: use mesh.configuration 1.11 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) [19:23:38] (03PS5) 10Hnowlan: mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) [19:25:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381881#10395002 (10phaultfinder) [19:28:09] (03PS2) 10Aleksandar Mastilovic: Added some explanatory comments [puppet] - 10https://gerrit.wikimedia.org/r/1101925 [19:35:35] (03CR) 10Scott French: [C:03+1] mesh.configuration: dummy commit for 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101917 (owner: 10Hnowlan) [19:39:09] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [19:39:44] (03CR) 10Pppery: "Why was this scheduled for deployment on December 10? The task says it shouldn't be deployed until December 16." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101867 (https://phabricator.wikimedia.org/T381421) (owner: 10أنون) [19:40:53] (03PS3) 10Aleksandar Mastilovic: Add Blunderbuss firewall rule to GitLab runner set [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) [19:41:31] (03CR) 10CI reject: [V:04-1] Add Blunderbuss firewall rule to GitLab runner set [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) (owner: 10Aleksandar Mastilovic) [19:43:01] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.019e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [19:46:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10395080 (10Jdlrobson) Ammarpad has had +2 for some time and has demonstrated a good knowledge of our code and how it interconnects, particularly in the rendering layer. He has been super help... [19:48:27] (03CR) 10Scott French: [C:03+1] mesh.configuration: add tcp_keepalive/idle_timeout to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101918 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [19:51:56] (03PS4) 10Aleksandar Mastilovic: Add Blunderbuss firewall rule to GitLab runner set [puppet] - 10https://gerrit.wikimedia.org/r/1101925 (https://phabricator.wikimedia.org/T371994) [19:58:26] (03CR) 10Scott French: [C:03+1] mediawiki: use mesh.configuration 1.11 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101919 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [20:04:39] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be1088.eqiad.wmnet with reason: T381919 [20:04:43] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [20:04:52] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be1088.eqiad.wmnet with reason: T381919 [20:05:39] (03CR) 10JHathaway: [C:03+1] wdqs1025: remove unneeded host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1101888 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [20:28:08] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on ms-be1088.eqiad.wmnet with reason: T381919 [20:28:10] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be1088.eqiad.wmnet with reason: T381919 [20:28:12] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [20:32:35] !log mforns@deploy2002 Started deploy [analytics/refinery@25c1946]: Regular analytics weekly train [analytics/refinery@25c1946c] [20:35:37] (03CR) 10Ryan Kemper: [C:03+2] wdqs1025: remove unneeded host hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1101888 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [20:36:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381881#10395227 (10VRiley-WMF) rebalanced power [20:36:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381881#10395228 (10VRiley-WMF) 05Open→03Resolved [20:37:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, xfer wdqs scholarly 2023(public)->2026(internal)) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [20:37:19] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [20:38:12] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T376150, xfer wdqs scholarly 2023(public)->2026(internal)) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2026.codfw.wmnet, repooling source-only afterwards [20:38:22] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - ps1-b4-eqiad.mgmt.eqiad - https://phabricator.wikimedia.org/T381540#10395232 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF rebalanced power [20:38:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, xfer wdqs scholarly 2023(public)->2026(internal)) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2026.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [20:45:47] !log mforns@deploy2002 Finished deploy [analytics/refinery@25c1946]: Regular analytics weekly train [analytics/refinery@25c1946c] (duration: 13m 12s) [20:45:55] !log mforns@deploy2002 Started deploy [analytics/refinery@25c1946] (thin): Regular analytics weekly train THIN [analytics/refinery@25c1946c] [20:46:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC - https://phabricator.wikimedia.org/T379717#10395266 (10VRiley-WMF) Can we proceed with swapping these? [20:46:26] !log mforns@deploy2002 Finished deploy [analytics/refinery@25c1946] (thin): Regular analytics weekly train THIN [analytics/refinery@25c1946c] (duration: 00m 31s) [20:46:45] !log mforns@deploy2002 Started deploy [analytics/refinery@25c1946] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@25c1946c] [20:47:12] !log mforns@deploy2002 Finished deploy [analytics/refinery@25c1946] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@25c1946c] (duration: 00m 27s) [20:54:28] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@2af4e1a]: Fix for the Commons Impact Metrics job [20:55:56] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@2af4e1a]: Fix for the Commons Impact Metrics job (duration: 01m 38s) [20:58:55] (03PS1) 10Kgraessle: Enable AutoModerator on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101937 (https://phabricator.wikimedia.org/T381000) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241210T2100). [21:00:05] bvibber, physikerwelt, danisztls, cjming, arlolra, and cwhite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] o/ [21:00:15] o/ [21:00:17] i can deploy [21:00:20] here [21:00:39] o/ [21:00:55] physikerwelt: i'll start with yours [21:01:06] thank you [21:01:08] (03PS3) 10Physikerwelt: Add new properties for math popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) [21:01:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) (owner: 10Physikerwelt) [21:02:03] o/ [21:02:19] (03Merged) 10jenkins-bot: Add new properties for math popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101834 (https://phabricator.wikimedia.org/T381046) (owner: 10Physikerwelt) [21:02:26] hi vibber: i'll do yours next [21:02:28] tx [21:02:34] *bvibber [21:03:03] (03PS2) 10Bvibber: LanguageConverter: Ignore content inside and elements [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101600 (https://phabricator.wikimedia.org/T381617) [21:03:09] 06SRE: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10395283 (10Scott_French) a:03Scott_French Thanks for the summary @NBaca-WMF. > 1. For the specific request, is there a way to get a list of all domains and properties that we own as an organization, so I can be sure... [21:03:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101600 (https://phabricator.wikimedia.org/T381617) (owner: 10Bvibber) [21:11:29] cjming: Thank you again. Tested everything on https://en.wikipedia.beta.wmflabs.org/wiki/T381046. Works fine as expected. Have a great day/night/... [21:11:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10395298 (10Jhancock.wm) An exception occurred: KeyError: 'device_name' Traceback (most recent call last): File "/srv/netbox/customscrip... [21:12:27] (03PS3) 10DDesouza: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) [21:12:38] physikerwelt: ur welcome :) [21:19:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [21:19:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [21:19:50] (03CR) 10RLazarus: [C:03+2] deployment_server: Add release to mwscript-k8s -ojson output [puppet] - 10https://gerrit.wikimedia.org/r/1101607 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [21:22:52] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, xfer wdqs scholarly 2023(public)->2026(internal)) xfer scholarly_articles from wdqs2023.codfw.wmnet -> wdqs2026.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [21:22:56] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [21:23:03] (03Merged) 10jenkins-bot: LanguageConverter: Ignore content inside and elements [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101600 (https://phabricator.wikimedia.org/T381617) (owner: 10Bvibber) [21:23:21] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101600|LanguageConverter: Ignore content inside and elements (T381617)]] [21:24:07] whee [21:24:56] 😁 [21:27:05] bvibber: on test servers [21:27:40] !log cjming@deploy2002 bvibber, cjming: Backport for [[gerrit:1101600|LanguageConverter: Ignore content inside and elements (T381617)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:48] testing [21:29:55] cjming: looks good! [21:29:56] proceed :D [21:30:01] yay! [21:30:03] !log cjming@deploy2002 bvibber, cjming: Continuing with sync [21:30:39] (03PS4) 10DDesouza: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) [21:34:11] 06SRE, 06Infrastructure-Foundations, 10Mail: Log tls cipher information - https://phabricator.wikimedia.org/T381927 (10jhathaway) 03NEW [21:35:17] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101600|LanguageConverter: Ignore content inside and elements (T381617)]] (duration: 11m 55s) [21:36:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:36:50] (03Merged) 10jenkins-bot: Reader Survey: Increase coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101875 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:37:05] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101875|Reader Survey: Increase coverage (T378660)]] [21:37:09] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:41:00] danisztls: your patch is up on test servers if you'd like to verify [21:41:23] cjming: all looks good thanks [21:41:43] !log cjming@deploy2002 cjming, dani: Backport for [[gerrit:1101875|Reader Survey: Increase coverage (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:41:46] !log cjming@deploy2002 cjming, dani: Continuing with sync [21:42:36] (03PS2) 10Phuedx: Beta Cluster: Enable MetricsPlatform extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101840 (https://phabricator.wikimedia.org/T381849) [21:47:07] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101875|Reader Survey: Increase coverage (T378660)]] (duration: 10m 02s) [21:47:11] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:47:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101840 (https://phabricator.wikimedia.org/T381849) (owner: 10Phuedx) [21:48:14] (03Merged) 10jenkins-bot: Beta Cluster: Enable MetricsPlatform extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101840 (https://phabricator.wikimedia.org/T381849) (owner: 10Phuedx) [21:48:31] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1101840|Beta Cluster: Enable MetricsPlatform extension on all wikis (T381849 T381853)]] [21:48:37] T381849: Community Updates module impressions lack experiment and variant information - https://phabricator.wikimedia.org/T381849 [21:48:37] T381853: MetricsPlatform: Update MetricsPlatformEnable config variable - https://phabricator.wikimedia.org/T381853 [21:52:45] !log cjming@deploy2002 cjming, phuedx: Backport for [[gerrit:1101840|Beta Cluster: Enable MetricsPlatform extension on all wikis (T381849 T381853)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:53:58] !log cjming@deploy2002 cjming, phuedx: Continuing with sync [21:55:48] 06SRE: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10395444 (10Scott_French) @NBaca-WMF - When you get a chance could you please confirm the following: * That my (edited) interpretation for question #1 in T381904#10395283 is correct - i.e., you're interested in enumerat... [21:56:30] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be1088.eqiad.wmnet with OS bookworm [21:59:22] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101840|Beta Cluster: Enable MetricsPlatform extension on all wikis (T381849 T381853)]] (duration: 10m 50s) [21:59:27] T381849: Community Updates module impressions lack experiment and variant information - https://phabricator.wikimedia.org/T381849 [21:59:27] T381853: MetricsPlatform: Update MetricsPlatformEnable config variable - https://phabricator.wikimedia.org/T381853 [22:00:30] arlolra: are you around? [22:00:57] cwhite: are you around? [22:01:26] o/ [22:02:38] (03PS3) 10Cwhite: Disable stats collection when WMF_MAINTENANCE_OFFLINE is set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) [22:02:58] cwhite: i'll pick up with your config patch [22:03:06] Thank you! [22:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) (owner: 10Cwhite) [22:04:06] (03Merged) 10jenkins-bot: Disable stats collection when WMF_MAINTENANCE_OFFLINE is set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) (owner: 10Cwhite) [22:04:22] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1100864|Disable stats collection when WMF_MAINTENANCE_OFFLINE is set (T380609)]] [22:04:26] T380609: Maintenance scripts do not emit StatsLib metrics - https://phabricator.wikimedia.org/T380609 [22:04:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:07:32] cwhite: on mwdebug [22:08:04] !log cjming@deploy2002 cwhite, cjming: Backport for [[gerrit:1100864|Disable stats collection when WMF_MAINTENANCE_OFFLINE is set (T380609)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:09:29] cwhite: lmk if/when to sync [22:09:29] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:01] cjming: looks good on mwdebug [22:10:09] nice [22:10:11] !log cjming@deploy2002 cwhite, cjming: Continuing with sync [22:11:07] (03PS2) 10Cwhite: Revert^2 "Stats: Move StatsFactory flush into emitBufferedStats" [core] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101913 [22:15:47] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100864|Disable stats collection when WMF_MAINTENANCE_OFFLINE is set (T380609)]] (duration: 11m 24s) [22:15:51] T380609: Maintenance scripts do not emit StatsLib metrics - https://phabricator.wikimedia.org/T380609 [22:19:19] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [22:22:01] cwhite: i'll do your backport now [22:22:42] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1088.eqiad.wmnet with reason: host reimage [22:23:25] cwhite: unless it can wait? [22:24:19] cjming: It can wait if you need. There's a chance it will fail in the scap mwscript steps. [22:27:11] that would be great - i gotta run [22:28:47] ok :) [22:31:53] (03PS1) 10Scott French: shellbox-video: allow egress to swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101944 (https://phabricator.wikimedia.org/T292322) [22:36:08] !log end of UTC late backport window [22:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:58] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1088.eqiad.wmnet with OS bookworm [22:54:18] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be1088.eqiad.wmnet with reason: T381919 [22:54:20] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1088.eqiad.wmnet with reason: T381919 [22:54:21] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [22:57:53] 06SRE: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10395576 (10NBaca-WMF) Hi @Scott_French - Thank you for taking a look at this! 1. Yes - this is a good summary. We can currently only see a subset of domains and properties, but I know there are many more out there, bo... [23:23:53] (03CR) 10Alexandros Kosiaris: [C:03+1] shellbox-video: allow egress to swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101944 (https://phabricator.wikimedia.org/T292322) (owner: 10Scott French) [23:40:45] (03CR) 10Scott French: [C:03+1] mediawiki: get mercurius label from mediawiki image version (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101889 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)