[00:00:28] (03CR) 10Zabe: [C:03+2] Initial configuration for tcywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084269 (https://phabricator.wikimedia.org/T377919) (owner: 10Zabe) [00:01:19] (03Merged) 10jenkins-bot: Initial configuration for tcywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084269 (https://phabricator.wikimedia.org/T377919) (owner: 10Zabe) [00:03:08] !log zabe@deploy2002 Started scap sync-world: Creating tcywikisource (T377919) [00:03:14] T377919: Create Wikisource Tulu - https://phabricator.wikimedia.org/T377919 [00:08:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P70670 and previous config saved to /var/cache/conftool/dbconfig/20241030-000833-ladsgroup.json [00:11:21] !log zabe@deploy2002 Finished scap sync-world: Creating tcywikisource (T377919) (duration: 08m 13s) [00:11:29] T377919: Create Wikisource Tulu - https://phabricator.wikimedia.org/T377919 [00:14:20] !log zabe@mwmaint2002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=tcywikisource --cluster=all 2>&1 | tee /tmp/tcywikisource.UpdateSearchIndexConfig.log # T377919 [00:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:25] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084273 [00:18:25] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084273 (owner: 10Zabe) [00:19:08] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084273 (owner: 10Zabe) [00:19:31] !log zabe@deploy2002 Started scap sync-world: update interwiki cache [00:23:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P70671 and previous config saved to /var/cache/conftool/dbconfig/20241030-002340-ladsgroup.json [00:26:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:33] !log zabe@deploy2002 Finished scap sync-world: update interwiki cache (duration: 09m 01s) [00:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1084278 [00:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1084278 (owner: 10TrainBranchBot) [00:38:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T376905)', diff saved to https://phabricator.wikimedia.org/P70672 and previous config saved to /var/cache/conftool/dbconfig/20241030-003847-ladsgroup.json [00:42:17] FIRING: JobUnavailable: Reduced availability for job cephadm in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:44:27] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 806 MB (0% inode=98%): /tmp 806 MB (0% inode=98%): /var/tmp 806 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [00:46:26] (03PS1) 10RLazarus: scap: Exclude importImages from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1084279 (https://phabricator.wikimedia.org/T377497) [00:56:44] (03CR) 10Scott French: [C:03+1] scap: Exclude importImages from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1084279 (https://phabricator.wikimedia.org/T377497) (owner: 10RLazarus) [00:58:08] (03CR) 10RLazarus: [C:03+2] scap: Exclude importImages from mwscript deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/1084279 (https://phabricator.wikimedia.org/T377497) (owner: 10RLazarus) [01:08:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084281 [01:08:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084281 (owner: 10TrainBranchBot) [01:10:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1084278 (owner: 10TrainBranchBot) [01:17:15] (03Abandoned) 10Reedy: Make wikimania.wikimedia.org redirect to mobile site [puppet] - 10https://gerrit.wikimedia.org/r/601923 (owner: 10Reedy) [01:17:18] (03Abandoned) 10Reedy: Make api.wikimedia redirect to mobile site [puppet] - 10https://gerrit.wikimedia.org/r/601924 (https://phabricator.wikimedia.org/T254185) (owner: 10Reedy) [01:21:25] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:21:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:38:59] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1084281 (owner: 10TrainBranchBot) [01:55:01] (03PS3) 10Reedy: index.html: Minor clarification of compromise text [software/klaxon] - 10https://gerrit.wikimedia.org/r/1078077 [02:04:07] (03Abandoned) 10Reedy: Remove old wgAbuseFilterActorTableSchemaMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041255 (https://phabricator.wikimedia.org/T188180) (owner: 10Reedy) [02:19:02] (03PS5) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 [02:19:14] (03PS2) 10Reedy: InitialiseSettings.php: Fix comment about $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077467 [02:19:56] (03PS5) 10Reedy: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 [02:37:17] FIRING: [2x] JobUnavailable: Reduced availability for job cephadm in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [02:56:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:02:17] FIRING: [2x] JobUnavailable: Reduced availability for job cephadm in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:11:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:27] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:26:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T0600) [06:46:25] !log arnaudb@cumin1002 START - Cookbook sre.mysql.sanitize-pii Checking PII for wikis tcywikisource in section s5 [06:47:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-pii (exit_code=0) Checking PII for wikis tcywikisource in section s5 [06:47:53] !log arnaudb@cumin1002 START - Cookbook sre.mysql.sanitize-pii Managing PII for wikis tcywikisource, tcywiktionary in section s5 [06:53:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-pii (exit_code=0) Managing PII for wikis tcywikisource, tcywiktionary in section s5 [06:59:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 1%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70673 and previous config saved to /var/cache/conftool/dbconfig/20241030-065920-arnaudb.json [07:00:49] (03CR) 10Arnaudb: "thanks for the confirmation!" [puppet] - 10https://gerrit.wikimedia.org/r/1084145 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [07:02:17] FIRING: JobUnavailable: Reduced availability for job cephadm in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:11:05] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:14:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 2%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70674 and previous config saved to /var/cache/conftool/dbconfig/20241030-071425-arnaudb.json [07:19:29] (03PS33) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) [07:20:26] (03PS34) 10Arnaudb: mariadb: pii cleaner cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) [07:20:56] (03CR) 10Arnaudb: "thanks for the feedback!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [07:24:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:24:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:25:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:25:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:25:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70675 and previous config saved to /var/cache/conftool/dbconfig/20241030-072520-arnaudb.json [07:25:25] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:29:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 4%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70676 and previous config saved to /var/cache/conftool/dbconfig/20241030-072930-arnaudb.json [07:37:36] 06SRE, 10Wikimedia-Mailing-lists: Create Mailing list for Karavali Wikimedians User Group - https://phabricator.wikimedia.org/T378560#10275657 (10Aklapper) [07:40:59] (03CR) 10Arnaudb: [C:03+1] "https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/auto_schema/auto_schema/host.py#58" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [07:44:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 5%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70677 and previous config saved to /var/cache/conftool/dbconfig/20241030-074436-arnaudb.json [07:52:29] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [07:55:56] (03CR) 10Arnaudb: [C:03+1] orchestrator: do not retry on 500s [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084170 (owner: 10Volans) [07:55:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10275715 (10elukey) Hey folks! I've uploaded all the Redfish licenses to these ganeti nodes, and ran provision again up to ganeti1043. I tried 1044 but it s... [07:57:33] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [07:57:43] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: accept any exit code for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084171 (owner: 10Volans) [07:59:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 10%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70678 and previous config saved to /var/cache/conftool/dbconfig/20241030-075941-arnaudb.json [08:00:04] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:19] (03PS1) 10KCVelaga: Update stream registration and config for MinT for Readers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084704 (https://phabricator.wikimedia.org/T378565) [08:03:07] (03PS1) 10Muehlenhoff: Apply the ganeti role to ganeti204[34] [puppet] - 10https://gerrit.wikimedia.org/r/1084705 (https://phabricator.wikimedia.org/T376594) [08:07:14] (03PS1) 10Elukey: sre.hosts.provision: improve supermicro class [cookbooks] - 10https://gerrit.wikimedia.org/r/1084706 (https://phabricator.wikimedia.org/T365372) [08:07:19] (03CR) 10Slyngshede: [C:03+2] Blocklog: Show the username of the admin on the public log. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084063 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [08:07:39] (03CR) 10Volans: [C:03+2] orchestrator: do not retry on 500s [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084170 (owner: 10Volans) [08:07:52] (03CR) 10Volans: [C:03+2] mysql_legacy: accept any exit code for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084171 (owner: 10Volans) [08:09:35] (03Merged) 10jenkins-bot: Blocklog: Show the username of the admin on the public log. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084063 (https://phabricator.wikimedia.org/T376991) (owner: 10Slyngshede) [08:11:38] (03CR) 10Slyngshede: [C:03+1] "LGTM, It shouldn't affect that CAS integration." [puppet] - 10https://gerrit.wikimedia.org/r/1075608 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [08:14:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 25%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70680 and previous config saved to /var/cache/conftool/dbconfig/20241030-081446-arnaudb.json [08:15:05] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:idp-test: Enable Redis on all test hosts. [puppet] - 10https://gerrit.wikimedia.org/r/1084045 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [08:17:21] (03Merged) 10jenkins-bot: orchestrator: do not retry on 500s [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084170 (owner: 10Volans) [08:17:21] (03Merged) 10jenkins-bot: mysql_legacy: accept any exit code for status [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084171 (owner: 10Volans) [08:20:19] (03CR) 10Arnaudb: [C:03+2] mariadb: productionize db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1084145 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [08:25:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70682 and previous config saved to /var/cache/conftool/dbconfig/20241030-082547-arnaudb.json [08:25:52] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:26:31] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:26:33] (03CR) 10Muehlenhoff: [C:03+2] Apply the ganeti role to ganeti204[34] [puppet] - 10https://gerrit.wikimedia.org/r/1084705 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [08:28:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: host in preparation [08:28:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: host in preparation [08:29:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 50%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70683 and previous config saved to /var/cache/conftool/dbconfig/20241030-082952-arnaudb.json [08:33:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:25] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:07] (03PS1) 10MVernon: prometheus: set cephadm scrape interval to 60s [puppet] - 10https://gerrit.wikimedia.org/r/1084710 (https://phabricator.wikimedia.org/T279621) [08:37:17] RESOLVED: JobUnavailable: Reduced availability for job cephadm in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:16] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:40:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70684 and previous config saved to /var/cache/conftool/dbconfig/20241030-084054-arnaudb.json [08:41:54] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:44:25] RESOLVED: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti2044:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 75%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70685 and previous config saved to /var/cache/conftool/dbconfig/20241030-084457-arnaudb.json [08:49:45] (03CR) 10Arnaudb: [C:03+1] prometheus: set cephadm scrape interval to 60s [puppet] - 10https://gerrit.wikimedia.org/r/1084710 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [08:50:09] (03CR) 10MVernon: [C:03+2] prometheus: set cephadm scrape interval to 60s [puppet] - 10https://gerrit.wikimedia.org/r/1084710 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [08:56:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70687 and previous config saved to /var/cache/conftool/dbconfig/20241030-085601-arnaudb.json [08:56:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [09:00:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 100%: post clone repool', diff saved to https://phabricator.wikimedia.org/P70688 and previous config saved to /var/cache/conftool/dbconfig/20241030-090002-arnaudb.json [09:01:12] (03CR) 10Volans: [C:03+2] "Let's move the discussion to a task, seems more appropriate: https://phabricator.wikimedia.org/T378572" [cookbooks] - 10https://gerrit.wikimedia.org/r/1077101 (https://phabricator.wikimedia.org/T377738) (owner: 10Volans) [09:03:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [09:07:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2043.codfw.wmnet to cluster codfw and group D [09:08:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2043.codfw.wmnet to cluster codfw and group D [09:11:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70689 and previous config saved to /var/cache/conftool/dbconfig/20241030-091108-arnaudb.json [09:11:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [09:11:14] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:11:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [09:11:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70690 and previous config saved to /var/cache/conftool/dbconfig/20241030-091131-arnaudb.json [09:11:48] (03CR) 10Peter Fischer: [C:03+1] Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [09:13:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70691 and previous config saved to /var/cache/conftool/dbconfig/20241030-091343-arnaudb.json [09:22:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [09:22:55] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084706 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [09:25:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [09:26:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet [09:28:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70692 and previous config saved to /var/cache/conftool/dbconfig/20241030-092850-arnaudb.json [09:30:52] (03PS3) 10Slyngshede: P:idp rewrite tgt lookup logic for idp-logout script [puppet] - 10https://gerrit.wikimedia.org/r/1084037 (https://phabricator.wikimedia.org/T377728) [09:33:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [09:33:22] 10SRE-swift-storage: Set up new S3-level replicated storage cluster "apus" - https://phabricator.wikimedia.org/T279621#10275951 (10MatthewVernon) >>! In T279621#10274769, @colewhite wrote: > @MatthewVernon cephadm clusters are now being scraped, however the ones in codfw (moss-be200[123]) don't appear to have an... [09:33:24] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling reboot on A:docker-registry [09:34:34] (03PS15) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [09:38:13] !log importing haproxykafka package into apt repository (T377613) [09:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:19] T377613: Provide Debian packetization - https://phabricator.wikimedia.org/T377613 [09:40:45] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 40676 [09:41:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40676 [09:43:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P70693 and previous config saved to /var/cache/conftool/dbconfig/20241030-094357-arnaudb.json [09:50:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling reboot on A:docker-registry [09:51:29] (03PS16) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [09:55:14] (03PS1) 10Volans: mysql_legacy: add getter for the Instance socket [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084720 [09:58:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10276034 (10jcrespo) @wiki_willy I am going to remove backup1010 and backup2010 from bacula and use it for mediabackups instead. This will solve my imm... [09:59:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P70694 and previous config saved to /var/cache/conftool/dbconfig/20241030-095904-arnaudb.json [09:59:10] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:59:14] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: improve supermicro class [cookbooks] - 10https://gerrit.wikimedia.org/r/1084706 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1000) [10:04:24] !log installing python-idna security updates [10:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:04] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: add getter for the Instance socket [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084720 (owner: 10Volans) [10:12:40] (03PS27) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [10:12:40] (03PS37) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [10:15:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:15:23] (03CR) 10Volans: [C:03+2] mysql_legacy: add getter for the Instance socket [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084720 (owner: 10Volans) [10:18:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 6461 [10:21:25] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6461 [10:22:03] (03PS8) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [10:24:49] (03Merged) 10jenkins-bot: mysql_legacy: add getter for the Instance socket [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084720 (owner: 10Volans) [10:28:14] (03CR) 10Brouberol: "Good questions! Every instance will ship with a kerberos keytab provisioned and provided by the DPE SRE team. So the Kerberos token will b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [10:29:51] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 14593 [10:31:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10276126 (10MoritzMuehlenhoff) [10:31:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 14593 [10:32:00] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 852 [10:32:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 852 [10:38:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10276151 (10SLyngshede-WMF) [10:39:28] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 16347 [10:39:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16347 [10:40:03] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 16347 [10:40:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16347 [10:40:42] (03PS1) 10Peter Fischer: CirrusSearch: Enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084729 (https://phabricator.wikimedia.org/T377150) [10:42:43] (03PS1) 10DCausse: rdf-streaming-updater: bump image to flink-1.17.1-rdf-0.3.149-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084731 (https://phabricator.wikimedia.org/T377938) [10:42:44] (03PS1) 10Muehlenhoff: Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1084732 (https://phabricator.wikimedia.org/T376790) [10:43:07] (03CR) 10Vgutierrez: haproxykafka: haproxykafka module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:43:21] (03PS1) 10Volans: * remote: add remote_hosts getter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 [10:44:29] (03PS1) 10Arnaudb: mariadb: add mycli [puppet] - 10https://gerrit.wikimedia.org/r/1084730 [10:44:29] (03CR) 10Arnaudb: "This CR adds mycli which ease up querying with autocomplete and syntax coloration on mysql prompt." [puppet] - 10https://gerrit.wikimedia.org/r/1084730 (owner: 10Arnaudb) [10:44:49] (03CR) 10Vgutierrez: haproxykafka: profile and hiera files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [10:44:58] (03PS1) 10Volans: Use new remote_hosts getter in RemoteHostsAdapter [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 [10:46:10] (03CR) 10DCausse: [C:03+1] CirrusSearch: Enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084729 (https://phabricator.wikimedia.org/T377150) (owner: 10Peter Fischer) [10:46:43] (03CR) 10Arnaudb: "I already know where to remove rogue accesses to private _remote_hosts! thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 (owner: 10Volans) [10:46:48] (03CR) 10Arnaudb: [C:03+1] * remote: add remote_hosts getter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 (owner: 10Volans) [10:48:46] (03CR) 10Arnaudb: [C:03+1] Use new remote_hosts getter in RemoteHostsAdapter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 (owner: 10Volans) [10:51:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084729 (https://phabricator.wikimedia.org/T377150) (owner: 10Peter Fischer) [10:51:42] (03CR) 10Slyngshede: [C:03+1] "Makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/1084732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [10:52:56] (03CR) 10Volans: mariadb: pii cleaner cookbook (0319 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [10:54:25] (03PS2) 10Volans: remote: add remote_hosts getter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 [10:58:04] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10276208 (10MoritzMuehlenhoff) [10:58:42] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10276205 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff irc.wikimedia.org is powered... [11:00:05] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1100). Please do the needful. [11:01:19] (03CR) 10Ladsgroup: [C:03+1] Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1084732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [11:01:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2044.codfw.wmnet to cluster codfw and group D [11:02:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2044.codfw.wmnet to cluster codfw and group D [11:03:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10276216 (10MoritzMuehlenhoff) [11:04:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10276212 (10tappof) I'm adding the serviceops tag to the task. I believe the issue is real for /dev/sda. ` root@wikikube-worker2068:~# udevadm info --query=all --name=/dev/sda | gr... [11:06:05] (03CR) 10FNegri: [C:03+2] alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [11:06:38] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [11:08:43] (03PS38) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [11:09:20] (03CR) 10Fabfur: haproxykafka: profile and hiera files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [11:09:23] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:09:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10276240 (10elukey) >>! In T371416#10273774, @wiki_willy wrote: > > Also, @elukey - the RAID controller kit that Supermicro is currently suggesting fo... [11:09:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:10:20] (03CR) 10Tiziano Fogli: [C:03+2] admin: add cdobbins to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1084218 (https://phabricator.wikimedia.org/T378517) (owner: 10CDobbins) [11:11:49] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10276258 (10MatthewVernon) [11:12:52] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10276267 (10elukey) [11:14:16] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:14:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to the analytics cluster for CDobbins - https://phabricator.wikimedia.org/T378517#10276269 (10tappof) 05Open→03Resolved a:03tappof merged. [11:15:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:15:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:23] PROBLEM - Host ml-serve1009 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:22] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye [11:17:31] (03CR) 10Vgutierrez: haproxy: add ring support to haproxy configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:18:51] RECOVERY - Host ml-serve1009 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:19:10] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [11:19:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [11:19:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:19:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10276291 (10ops-monitoring-bot) Draining ganeti2016.codfw.wmnet of running VMs [11:22:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [11:23:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10276297 (10Ladsgroup) Thanks for digging out the bug! [11:23:29] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:25:40] (03CR) 10Elukey: [C:03+1] remote: add remote_hosts getter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 (owner: 10Volans) [11:25:43] PROBLEM - Host ml-serve1010 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2003.codfw.wmnet to drbd [11:26:33] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:26:49] (03CR) 10Elukey: [C:03+1] Use new remote_hosts getter in RemoteHostsAdapter [cookbooks] - 10https://gerrit.wikimedia.org/r/1084734 (owner: 10Volans) [11:26:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10276304 (10ops-monitoring-bot) VM ml-etcd2003.codfw.wmnet switching disk type to drbd [11:28:13] RECOVERY - Host ml-serve1010 is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [11:28:35] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:43] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1010.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:31:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084200 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [11:33:14] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:34:47] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10276326 (10elukey) [11:34:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078700 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [11:35:07] PROBLEM - Host ml-serve1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:01] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:36:22] (03Abandoned) 10Hnowlan: sessionstore: temporarily disable mesh on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048013 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:36:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2003.codfw.wmnet to drbd [11:36:33] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:17] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [11:37:35] RECOVERY - Host ml-serve1011 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:37:52] (03Abandoned) 10Hnowlan: check_mw_versions: increase grace period from 1 hour to 2 [puppet] - 10https://gerrit.wikimedia.org/r/619482 (owner: 10Hnowlan) [11:37:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [11:38:01] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10276330 (10ops-monitoring-bot) Draining ganeti2016.codfw.wmnet of running VMs [11:38:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [11:38:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1011.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [11:38:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd2003.codfw.wmnet to plain [11:39:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10276334 (10ops-monitoring-bot) VM ml-etcd2003.codfw.wmnet switching disk type to plain [11:39:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd2003.codfw.wmnet to plain [11:39:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [11:40:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10276335 (10ops-monitoring-bot) Draining ganeti2016.codfw.wmnet of running VMs [11:41:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:41:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:43:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [11:44:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [11:47:15] !log joal@deploy2002 Started deploy [analytics/refinery@0855ce2]: Regular analytics weekly train [analytics/refinery@0855ce28] [11:47:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [11:48:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [11:48:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T376905)', diff saved to https://phabricator.wikimedia.org/P70696 and previous config saved to /var/cache/conftool/dbconfig/20241030-114808-ladsgroup.json [11:53:10] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10276344 (10elukey) [11:55:30] !log joal@deploy2002 Finished deploy [analytics/refinery@0855ce2]: Regular analytics weekly train [analytics/refinery@0855ce28] (duration: 08m 14s) [11:57:34] !log joal@deploy2002 Started deploy [analytics/refinery@0855ce2] (thin): Regular analytics weekly train THIN [analytics/refinery@0855ce28] [11:57:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T376905)', diff saved to https://phabricator.wikimedia.org/P70697 and previous config saved to /var/cache/conftool/dbconfig/20241030-115735-ladsgroup.json [12:00:32] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10276380 (10elukey) 05Open→03Resolved a:03elukey Finally all the hosts without the license, that were manually configured, should be ok. The on... [12:00:52] jouncebot: nowandnext [12:00:52] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [12:00:52] In 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1300) [12:02:12] (03PS1) 10Dreamy Jazz: Handle a missing parent block in GlobalBlockLookup::getUserBlock [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084742 (https://phabricator.wikimedia.org/T378447) [12:02:34] (03CR) 10Dreamy Jazz: [C:03+2] Handle a missing parent block in GlobalBlockLookup::getUserBlock [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084742 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:02:55] (03CR) 10Ladsgroup: [C:03+1] mariadb: add 12 new es hosts [puppet] - 10https://gerrit.wikimedia.org/r/1083758 (https://phabricator.wikimedia.org/T378143) (owner: 10Arnaudb) [12:03:51] (03PS1) 10Dreamy Jazz: Handle a missing parent block in GlobalBlockLookup::getUserBlock [extensions/GlobalBlocking] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084743 (https://phabricator.wikimedia.org/T378447) [12:04:06] (03CR) 10Dreamy Jazz: [C:03+2] Handle a missing parent block in GlobalBlockLookup::getUserBlock [extensions/GlobalBlocking] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084743 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:04:28] !log joal@deploy2002 Finished deploy [analytics/refinery@0855ce2] (thin): Regular analytics weekly train THIN [analytics/refinery@0855ce28] (duration: 06m 54s) [12:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084743 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:04:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084742 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:05:24] (03CR) 10Volans: [C:03+2] remote: add remote_hosts getter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 (owner: 10Volans) [12:07:48] !log joal@deploy2002 Started deploy [analytics/refinery@0855ce2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0855ce28] [12:08:53] (03PS1) 10Dreamy Jazz: globalblocks API: Hide autoblocks when target param has username and IP [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084748 (https://phabricator.wikimedia.org/T377855) [12:09:11] (03CR) 10Dreamy Jazz: [C:03+2] globalblocks API: Hide autoblocks when target param has username and IP [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084748 (https://phabricator.wikimedia.org/T377855) (owner: 10Dreamy Jazz) [12:09:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084743 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:09:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084742 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:09:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084748 (https://phabricator.wikimedia.org/T377855) (owner: 10Dreamy Jazz) [12:09:33] jouncebot: nowandnext [12:09:33] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [12:09:34] In 0 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1300) [12:09:56] I missed my window due to daylight confusion time. Anyone mind if I use this one? [12:10:16] I'm currently deploying [12:10:45] But I don't think my changes should clash with yours [12:11:10] yeah probably not, I'm just deploying on k8s [12:11:29] !log joal@deploy2002 Finished deploy [analytics/refinery@0855ce2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0855ce28] (duration: 03m 41s) [12:11:43] (03Merged) 10jenkins-bot: Handle a missing parent block in GlobalBlockLookup::getUserBlock [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084742 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:11:53] (03PS1) 10Muehlenhoff: Remove ganeti role from ganeti2016 [puppet] - 10https://gerrit.wikimedia.org/r/1084749 (https://phabricator.wikimedia.org/T376594) [12:12:19] !log installing podman security updates [12:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:31] (03CR) 10Mvolz: [C:03+2] Use 0 workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084076 (owner: 10Mvolz) [12:12:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P70698 and previous config saved to /var/cache/conftool/dbconfig/20241030-121242-ladsgroup.json [12:14:04] (03Merged) 10jenkins-bot: Use 0 workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084076 (owner: 10Mvolz) [12:14:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10276434 (10MoritzMuehlenhoff) [12:15:14] (03CR) 10Muehlenhoff: [C:03+2] Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1084732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [12:15:40] (03Merged) 10jenkins-bot: remote: add remote_hosts getter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084733 (owner: 10Volans) [12:15:58] (03Merged) 10jenkins-bot: Handle a missing parent block in GlobalBlockLookup::getUserBlock [extensions/GlobalBlocking] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084743 (https://phabricator.wikimedia.org/T378447) (owner: 10Dreamy Jazz) [12:16:27] (03CR) 10Ammarpad: [C:04-1] "This is not what caused the issue" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084255 (https://phabricator.wikimedia.org/T378531) (owner: 10Zabe) [12:16:57] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:17:10] (03CR) 10Ayounsi: [C:03+2] Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi) [12:17:50] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:17:55] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10276436 (10JoelyRooke-WMDE) Hi @thcipriani, I haven't yet received the training on this so I'm not sure exactly what I will need, but I trust your judgement! I will also confirm w... [12:17:57] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:18:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:18:43] (03Merged) 10jenkins-bot: globalblocks API: Hide autoblocks when target param has username and IP [extensions/GlobalBlocking] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084748 (https://phabricator.wikimedia.org/T377855) (owner: 10Dreamy Jazz) [12:19:09] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:19:29] (03Merged) 10jenkins-bot: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi) [12:19:47] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:20:08] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1084743|Handle a missing parent block in GlobalBlockLookup::getUserBlock (T378447)]], [[gerrit:1084742|Handle a missing parent block in GlobalBlockLookup::getUserBlock (T378447)]], [[gerrit:1084748|globalblocks API: Hide autoblocks when target param has username and IP (T377855)]] [12:20:29] T378447: TypeError: Argument 1 passed to MediaWiki\Extension\GlobalBlocking\Services\GlobalBlockLookup::getAutoblockReason() must be an instance of stdClass, bool given - https://phabricator.wikimedia.org/T378447 [12:21:25] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:21:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:22:16] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:22:28] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1084752 [12:22:45] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1084743|Handle a missing parent block in GlobalBlockLookup::getUserBlock (T378447)]], [[gerrit:1084742|Handle a missing parent block in GlobalBlockLookup::getUserBlock (T378447)]], [[gerrit:1084748|globalblocks API: Hide autoblocks when target param has username and IP (T377855)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:22:48] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:22:57] (03Abandoned) 10Zabe: Revert "Skin: [BREAKING CHANGE] Remove support for rendering outside body element" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084255 (https://phabricator.wikimedia.org/T378531) (owner: 10Zabe) [12:24:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084036 (https://phabricator.wikimedia.org/T376952) (owner: 10Sergio Gimeno) [12:25:03] (03CR) 10Sergio Gimeno: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [12:25:56] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:27:36] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:27:40] (03PS28) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [12:27:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P70699 and previous config saved to /var/cache/conftool/dbconfig/20241030-122749-ladsgroup.json [12:27:54] (03CR) 10Fabfur: haproxykafka: haproxykafka module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [12:28:13] (03PS39) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [12:28:28] (03CR) 10CI reject: [V:04-1] haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [12:30:36] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084743|Handle a missing parent block in GlobalBlockLookup::getUserBlock (T378447)]], [[gerrit:1084742|Handle a missing parent block in GlobalBlockLookup::getUserBlock (T378447)]], [[gerrit:1084748|globalblocks API: Hide autoblocks when target param has username and IP (T377855)]] (duration: 10m 28s) [12:30:42] T378447: TypeError: Argument 1 passed to MediaWiki\Extension\GlobalBlocking\Services\GlobalBlockLookup::getAutoblockReason() must be an instance of stdClass, bool given - https://phabricator.wikimedia.org/T378447 [12:31:14] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [12:31:20] (03CR) 10Brouberol: [C:03+2] Add cu_log table to sqoop job [puppet] - 10https://gerrit.wikimedia.org/r/1082800 (https://phabricator.wikimedia.org/T364398) (owner: 10Snwachukwu) [12:31:57] (03PS1) 10Slyngshede: Start migrating Netbox alerts from Icinga. [alerts] - 10https://gerrit.wikimedia.org/r/1084758 (https://phabricator.wikimedia.org/T350694) [12:33:31] (03PS1) 10Ammarpad: Revert "Use array instead of string for class list" [skins/MinervaNeue] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084759 (https://phabricator.wikimedia.org/T378531) [12:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T377942#10276538 (10phaultfinder) [12:36:49] (03PS1) 10Ayounsi: Add temporary LVS community for liberica test [homer/public] - 10https://gerrit.wikimedia.org/r/1084760 (https://phabricator.wikimedia.org/T378453) [12:37:13] (03PS1) 10Dreamy Jazz: [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) [12:37:23] (03PS1) 10Dreamy Jazz: [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084762 (https://phabricator.wikimedia.org/T378563) [12:38:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084762 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [12:38:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [12:41:24] (03PS3) 10Herron: profile::syslog::centralserver: use prometheus cert for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) [12:41:53] (03PS4) 10Herron: profile::syslog::centralserver: use prometheus cert for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) [12:42:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T376905)', diff saved to https://phabricator.wikimedia.org/P70700 and previous config saved to /var/cache/conftool/dbconfig/20241030-124256-ladsgroup.json [12:43:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [12:43:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [12:43:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:43:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:43:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T376905)', diff saved to https://phabricator.wikimedia.org/P70701 and previous config saved to /var/cache/conftool/dbconfig/20241030-124316-ladsgroup.json [12:43:25] (03CR) 10Herron: "nice! yeah I think that could work. updated to try that route" [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [12:48:19] 06SRE, 10Wikimedia-Mailing-lists: Create Mailing list for Karavali Wikimedians User Group - https://phabricator.wikimedia.org/T378560#10276585 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikimedia-kvl.lists.wikimedia.org [12:50:01] (03PS17) 10Brouberol: airflow: enable Kerberos auth on the API backend [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) [12:51:27] (03CR) 10Kosta Harlan: [C:03+1] [GlobalBlocking] Enable global autoblocks on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084152 (https://phabricator.wikimedia.org/T377760) (owner: 10Dreamy Jazz) [12:51:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T376905)', diff saved to https://phabricator.wikimedia.org/P70702 and previous config saved to /var/cache/conftool/dbconfig/20241030-125150-ladsgroup.json [12:53:59] (03PS1) 10Mhorsey: Exclude affiliates from P&E dashboard integration for CampaignEvents Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) [12:54:46] (03CR) 10CI reject: [V:04-1] Exclude affiliates from P&E dashboard integration for CampaignEvents Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [12:54:56] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@ec4746b]: (no justification provided) [12:55:06] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@ec4746b]: (no justification provided) (duration: 00m 11s) [13:00:05] Urbanecm and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1300). Please do the needful. [13:00:05] sergi0, pfischer, and Dreamy Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] i can deploy today! [13:00:17] o/ [13:00:19] hey sergi0 [13:00:23] o/ [13:00:25] and pfischer [13:00:29] and Dreamy_Jazz [13:00:36] \o [13:00:50] It seems that my IRC nick was wrong, so didn't get pinged. [13:00:51] (03CR) 10Urbanecm: [C:03+2] [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084762 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:00:52] (03CR) 10Urbanecm: [C:03+2] [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:01:01] o yeah, a space... [13:01:08] isn't that illegal in IRC nicks? [13:01:26] Probably. I typed it wrong into the schedule deployment tool. [13:01:57] (03CR) 10Urbanecm: [C:03+2] [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:02:10] sergi0: do you want to do the backport too? [13:02:53] you mean https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1084181? [13:03:00] sergi0: correct [13:03:01] (03PS2) 10Sergio Gimeno: Growth [test2wiki]: enable community updates module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084036 (https://phabricator.wikimedia.org/T376952) [13:03:09] (03CR) 10Urbanecm: [C:03+2] Growth [test2wiki]: enable community updates module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084036 (https://phabricator.wikimedia.org/T376952) (owner: 10Sergio Gimeno) [13:03:18] (03PS2) 10Mhorsey: Exclude affiliates from P&E dashboard integration for CampaignEvents Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) [13:03:26] 06SRE, 06Infrastructure-Foundations: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#10276681 (10MoritzMuehlenhoff) For transparency: The ssotest03 user is used by myself for tests and has been temporarily added to cn=logstash-access. [13:03:45] yeah, but doesn't the change needed to be +2'ed by jenkins first? [13:04:07] (03Merged) 10jenkins-bot: Growth [test2wiki]: enable community updates module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084036 (https://phabricator.wikimedia.org/T376952) (owner: 10Sergio Gimeno) [13:04:57] sergi0: yes, but we now know what the test fixes were for that, so we just need to backport them too? [13:05:35] that's why I wasn't sure about. I believe yes. [13:06:49] sergi0: so if you can prepare backports for those in the meantime, and add depends-on to the backport, we can do that [13:06:56] but i'm not sure if you know what the missing patches are [13:06:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P70703 and previous config saved to /var/cache/conftool/dbconfig/20241030-130657-ladsgroup.json [13:07:10] let me try to find them [13:07:49] (03PS5) 10Urbanecm: [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:07:52] (03CR) 10Urbanecm: [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:07:56] (03CR) 10Urbanecm: [C:03+2] [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:08:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:09:05] 06SRE, 06Infrastructure-Foundations, 06SRE Observability: Split the permission to access Logstash from the cn=wmf and cn=nda groups - https://phabricator.wikimedia.org/T376790#10276699 (10herron) [13:10:09] come on, CI... [13:10:11] it's just a config patch [13:10:30] (03PS29) 10Fabfur: haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) [13:10:47] (03PS40) 10Fabfur: haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) [13:11:12] (03PS1) 10Sergio Gimeno: Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084774 (https://phabricator.wikimedia.org/T378573) [13:12:57] (03PS1) 10Sergio Gimeno: Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) [13:13:35] (03PS1) 10Sergio Gimeno: Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084776 [13:13:39] (03PS3) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) [13:13:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [13:13:49] PROBLEM - Disk space on snapshot1012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=98%): /tmp 0 MB (0% inode=98%): /var/tmp 0 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=snapshot1012&var-datasource=eqiad+prometheus/ops [13:14:55] 06SRE, 10decommission-hardware: decommission ganeti2013/ganeti2014 - https://phabricator.wikimedia.org/T378596 (10MoritzMuehlenhoff) 03NEW [13:16:56] @urbanecm I belive the proper dependencies are now added in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1084181, let's see [13:17:05] sergi0: thanks, let's see what ci says [13:17:31] why is CI refusing to merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1081099 ... [13:18:44] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@ec4746b]: (no justification provided) [13:18:50] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@ec4746b]: (no justification provided) (duration: 00m 07s) [13:19:03] (03Merged) 10jenkins-bot: [Growth] beta: configure the A/B test experiment variants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081099 (https://phabricator.wikimedia.org/T377233) (owner: 10Sergio Gimeno) [13:19:17] (03PS1) 10Slyngshede: Disable LDAPPasswordValidator. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084777 [13:19:29] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1084036|Growth [test2wiki]: enable community updates module (T376952)]], [[gerrit:1081099|[Growth] beta: configure the A/B test experiment variants (T377233)]] [13:19:35] T376952: Community updates module: Update Superset dashboard to support pilot wiki experiment - https://phabricator.wikimedia.org/T376952 [13:19:36] T377233: Show Community updates module based on experiment variant - https://phabricator.wikimedia.org/T377233 [13:20:01] here we og [13:20:02] !log upgrade PHP 7.4 on mwdebug* to 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u2+icu67u3 T378173 [13:20:04] *go [13:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:59] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10276765 (10tappof) [13:22:49] (03CR) 10Daimona Eaytoy: [C:03+1] Exclude affiliates from P&E dashboard integration for CampaignEvents Extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084765 (https://phabricator.wikimedia.org/T377252) (owner: 10Mhorsey) [13:31:17] finally [13:31:18] sergi0: can you test your changes at mwdebug? [13:31:18] where are logmsgbot announcements? [13:31:18] and wikibugs, while we're at it [13:31:28] http://wm-bot.wmcloud.org/dump/%23wikimedia-operations.htm [13:31:28] @info [13:31:30] ok, some messages are coming through [13:32:13] sergi0: how are the tests looking? [13:32:25] (03PS1) 10Urbanecm: cswiki: Add celebration logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084779 (https://phabricator.wikimedia.org/T378597) [13:32:34] (03PS2) 10Peter Fischer: CirrusSearch: Enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084729 (https://phabricator.wikimedia.org/T377150) [13:32:35] (03CR) 10Urbanecm: [C:03+2] CirrusSearch: Enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084729 (https://phabricator.wikimedia.org/T377150) (owner: 10Peter Fischer) [13:32:46] (03PS3) 10Arnaudb: mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) [13:32:49] (03Merged) 10jenkins-bot: [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084762 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:32:51] (03Merged) 10jenkins-bot: CirrusSearch: Enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084729 (https://phabricator.wikimedia.org/T377150) (owner: 10Peter Fischer) [13:32:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1084777 (owner: 10Slyngshede) [13:33:09] (03CR) 10Slyngshede: [C:03+1] Disable LDAPPasswordValidator. [software/bitu] - 10https://gerrit.wikimedia.org/r/1084777 (owner: 10Slyngshede) [13:33:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10276797 (10MoritzMuehlenhoff) [13:33:27] (03PS1) 10Tiziano Fogli: add Joely Rooke WMDE to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/1084780 (https://phabricator.wikimedia.org/T378082) [13:33:35] Annoyingly one of my wmf backports has had a flaky selenium test cause a failure [13:34:41] (03CR) 10CI reject: [V:04-1] Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084776 (owner: 10Sergio Gimeno) [13:34:57] (03CR) 10CI reject: [V:04-1] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [13:35:05] (03CR) 10CI reject: [V:04-1] [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:35:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P70704 and previous config saved to /var/cache/conftool/dbconfig/20241030-132204-ladsgroup.json [13:35:36] testing now [13:35:38] (03CR) 10Dreamy Jazz: [C:03+2] "Flaky selenium test - try gate-and-submit-wmf again" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:39:56] sergi0: ping? [13:40:00] (03CR) 10CI reject: [V:04-1] [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:40:05] (03CR) 10CI reject: [V:04-1] mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [13:40:06] yeah, nothing breaks [13:46:27] (03CR) 10CI reject: [V:04-1] Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [13:46:27] Dreamy_Jazz: that's quite annoying... [13:46:27] (03PS1) 10FNegri: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) [13:46:27] (03PS2) 10FNegri: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) [13:46:27] !log urbanecm@deploy2002 sgimeno, urbanecm: Continuing with sync [13:46:27] (03CR) 10Dreamy Jazz: [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:46:27] (03CR) 10Dreamy Jazz: [C:03+2] [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [13:46:27] And again... [13:46:27] Apologies, got confused about which change you were referring, test2wiki is fine [13:46:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961#10276849 (10Jclark-ctr) Dell has agreed to replace mainboard and cpu. should be this week [13:46:27] sergi0: thanks! [13:46:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1084780 (https://phabricator.wikimedia.org/T378082) (owner: 10Tiziano Fogli) [13:46:27] (03CR) 10CI reject: [V:04-1] WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [13:46:27] at least one Dreamy_Jazz backport got through [13:46:42] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove stub certs for ms-fe [labs/private] - 10https://gerrit.wikimedia.org/r/1084150 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:47:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70707 and previous config saved to /var/cache/conftool/dbconfig/20241030-134715-ladsgroup.json [13:47:25] (03Abandoned) 10Muehlenhoff: Remove obsolete role [puppet] - 10https://gerrit.wikimedia.org/r/1083159 (https://phabricator.wikimedia.org/T359387) (owner: 10Muehlenhoff) [13:47:37] (03PS4) 10Arnaudb: mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) [13:48:17] Dreamy_Jazz: do the backports need to go before the config change? [13:48:24] Yes [13:48:27] okay, i thought so [13:48:30] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084036|Growth [test2wiki]: enable community updates module (T376952)]], [[gerrit:1081099|[Growth] beta: configure the A/B test experiment variants (T377233)]] (duration: 29m 00s) [13:48:35] T376952: Community updates module: Update Superset dashboard to support pilot wiki experiment - https://phabricator.wikimedia.org/T376952 [13:48:36] T377233: Show Community updates module based on experiment variant - https://phabricator.wikimedia.org/T377233 [13:48:49] The config change can wait and I can self-deploy it when there is a free point in the calendar today [13:49:03] (03PS2) 10Urbanecm: cswiki: Add celebration logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084779 (https://phabricator.wikimedia.org/T378597) [13:49:08] (03CR) 10Urbanecm: [C:03+2] cswiki: Add celebration logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084779 (https://phabricator.wikimedia.org/T378597) (owner: 10Urbanecm) [13:49:17] Dreamy_Jazz: sounds good [13:49:29] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10276878 (10Jclark-ctr) @fnegri @wiki_willy just to be advised they are having issues on getting parts and will not get them for a little over a week. ` Dell Tech... [13:49:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084779 (https://phabricator.wikimedia.org/T378597) (owner: 10Urbanecm) [13:50:23] (03Merged) 10jenkins-bot: cswiki: Add celebration logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084779 (https://phabricator.wikimedia.org/T378597) (owner: 10Urbanecm) [13:50:52] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1084762|[BlockManager] Don't assume autoblocks have ::getParentBlockId (T378563)]], [[gerrit:1084729|CirrusSearch: Enable offloading weighted tags via EventBus (T377150)]], [[gerrit:1084779|cswiki: Add celebration logo (T378597)]] [13:50:57] (03PS5) 10Arnaudb: mysql_legacy: fix _list_host_instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) [13:51:08] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [13:51:08] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [13:51:08] T378597: Add celebration logo for cswiki - https://phabricator.wikimedia.org/T378597 [13:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:53:20] !log urbanecm@deploy2002 dreamyjazz, pfischer, urbanecm: Backport for [[gerrit:1084762|[BlockManager] Don't assume autoblocks have ::getParentBlockId (T378563)]], [[gerrit:1084729|CirrusSearch: Enable offloading weighted tags via EventBus (T377150)]], [[gerrit:1084779|cswiki: Add celebration logo (T378597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:53:33] Dreamy_Jazz: can you test the backport at wmf.28? [13:53:37] My change isn't really testable [13:53:40] okay [13:53:44] It requires global autoblocks to be enabled [13:53:49] makes sense [13:53:57] pfischer: is it possible to test yours? or do we need to do the scripts to know for sure? [13:54:22] my change works [13:54:43] urbanecm: I would at least expect the config flag to show up on meta (via mwdebug) [13:54:52] pfischer: okay, waiting for confirmation then [13:57:31] !log joal@deploy2002 Started deploy [airflow-dags/analytics@ec4746b]: Regular analytics weekly train [airflow-dags/analytics@ec4746b5] [13:58:13] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@ec4746b]: Regular analytics weekly train [airflow-dags/analytics@ec4746b5] (duration: 00m 41s) [13:59:31] urbanecm: I cannot see the flag on https://meta.wikimedia.org/wiki/Special:ApiSandbox#action=cirrus-config-dump&format=json&formatversion=2 (via mwdebug2002.codfw.wmnet) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1400) [14:00:33] spilling over B&C a little [14:00:43] pfischer: i don't see it in the list of variables to dump (https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/62c90b5579a60a82016ae4245dcd9ac9707e0750/includes/Api/ConfigDump.php#35) though [14:01:02] so unless you think this is reason to pause and investigate, i think we can go ahead and see what the script does? [14:01:31] ouch, my bad, so no obvious way to test it right now, we can go on then [14:01:41] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10276940 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:01:46] !log urbanecm@deploy2002 dreamyjazz, pfischer, urbanecm: Continuing with sync [14:01:49] proceeding then [14:02:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P70709 and previous config saved to /var/cache/conftool/dbconfig/20241030-140222-ladsgroup.json [14:02:39] (03CR) 10Fabfur: [C:03+2] haproxykafka: haproxykafka module [puppet] - 10https://gerrit.wikimedia.org/r/1083203 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:02:49] 06SRE, 10SRE-Access-Requests: Access to ops mailing list - https://phabricator.wikimedia.org/T378484#10276947 (10zoe) It says that my request is already pending, though I see there's an email address I can try. Thank you! [14:03:20] (03CR) 10Fabfur: [C:03+2] haproxykafka: profile and hiera files [puppet] - 10https://gerrit.wikimedia.org/r/1083204 (https://phabricator.wikimedia.org/T374128) (owner: 10Fabfur) [14:06:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1084037 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:06:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10276958 (10Jhancock.wm) I got the memory in. Is it safe to proceed with the upgrade at this time? I didn't see if it got depooled already. [14:06:22] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084762|[BlockManager] Don't assume autoblocks have ::getParentBlockId (T378563)]], [[gerrit:1084729|CirrusSearch: Enable offloading weighted tags via EventBus (T377150)]], [[gerrit:1084779|cswiki: Add celebration logo (T378597)]] (duration: 15m 30s) [14:06:28] and deployed [14:06:39] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [14:06:39] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [14:06:40] T378597: Add celebration logo for cswiki - https://phabricator.wikimedia.org/T378597 [14:06:41] urbanecm: thanks! [14:06:51] 10ops-codfw, 06SRE, 06DC-Ops: codfw puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10276962 (10Jhancock.wm) @MoritzMuehlenhoff I have all the ram in for this. we can schedule these when you're ready. [14:07:15] Dreamy_Jazz: can you do the remaining 2 changes of yours when the backport merges, or do you want me to finish it up too? [14:07:33] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [14:07:49] pfischer: should i restart the script now as well? [14:08:03] urbanecm: yes please! [14:08:21] urbanecm: I already see stuff coming in [14:08:31] (on kafka) [14:08:38] pfischer: additions? or removals? [14:08:44] I can self-merge urbanecm [14:08:56] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:57] Dreamy_Jazz: ack, leaving deployment host over to you then :) [14:09:29] urbanecm: removals [14:09:49] pfischer: okay, i'll restart the script anyway in that case, as removals are issued in the post-edit hook [14:09:58] (03PS1) 10Muehlenhoff: Remove puppetserver2002 out of active service for RAM upgrade [dns] - 10https://gerrit.wikimedia.org/r/1084792 (https://phabricator.wikimedia.org/T376057) [14:10:36] (03CR) 10Ssingh: [C:03+1] Remove puppetserver2002 out of active service for RAM upgrade [dns] - 10https://gerrit.wikimedia.org/r/1084792 (https://phabricator.wikimedia.org/T376057) (owner: 10Muehlenhoff) [14:10:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker2068 - https://phabricator.wikimedia.org/T378255#10276995 (10Jhancock.wm) thanks for the confirmation! the server was out of warranty so I replaced the bad drive with one from stock. please let us know if that helped clear it up an... [14:10:59] (03PS1) 10Fabfur: haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) [14:11:25] (03CR) 10Andrea Denisse: [C:03+1] "LGTM,thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [14:11:36] !log mwmaint2002: kill all running instances of `refreshLinkRecommendations.php` (T377150) [14:11:40] (03CR) 10CI reject: [V:04-1] haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] T377150: Config: enable CirrusSearchEnableEventBusWeightedTags - https://phabricator.wikimedia.org/T377150 [14:11:44] (03CR) 10Ssingh: [C:03+1] haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:12:06] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2002 out of active service for RAM upgrade [dns] - 10https://gerrit.wikimedia.org/r/1084792 (https://phabricator.wikimedia.org/T376057) (owner: 10Muehlenhoff) [14:12:15] (03PS2) 10Fabfur: haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) [14:12:22] pfischer: you should be seeing additions shortly too [14:13:26] (03CR) 10Ssingh: [C:03+1] "If spec passes which it should now 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:13:34] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: remove user creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:13:48] (03PS3) 10Fabfur: haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) [14:14:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 3f5945bb0b6f61c43857b638da7c5e0696e3addd, dns.git is 4786f0078ac84ef3366937ae6cd4bed271fb2e51) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:14:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 3f5945bb0b6f61c43857b638da7c5e0696e3addd, dns.git is 4786f0078ac84ef3366937ae6cd4bed271fb2e51) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:14:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 3f5945bb0b6f61c43857b638da7c5e0696e3addd, dns.git is 4786f0078ac84ef3366937ae6cd4bed271fb2e51) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:14:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 3f5945bb0b6f61c43857b638da7c5e0696e3addd, dns.git is 4786f0078ac84ef3366937ae6cd4bed271fb2e51) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:14:42] yeah, we need to bump up the timing a bit for this [14:14:53] updating [14:15:25] FIRING: [6x] SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s1.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:11] (03PS4) 10Fabfur: haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) [14:16:30] (03Merged) 10jenkins-bot: [BlockManager] Don't assume autoblocks have ::getParentBlockId [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084761 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [14:16:40] Dreamy_Jazz: ^^ fyi [14:16:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:16:51] Thanks. Are you done with the other deploys? [14:16:55] (03CR) 10Fabfur: haproxykafka: remove user creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:17:17] Dreamy_Jazz: with deploys, yes. running some maintennace scripts, but that shouldn't impact you [14:17:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P70710 and previous config saved to /var/cache/conftool/dbconfig/20241030-141729-ladsgroup.json [14:17:35] (03PS1) 10Ssingh: P:dns::auth::update: bump retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/1084794 [14:17:42] feel free to take over [14:17:49] (03PS2) 10Ssingh: P:dns::auth::update: bump retry_interval for authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1084794 [14:18:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084152 (https://phabricator.wikimedia.org/T377760) (owner: 10Dreamy Jazz) [14:18:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083848 (https://phabricator.wikimedia.org/T377990) (owner: 10Esanders) [14:18:52] (03Merged) 10jenkins-bot: [GlobalBlocking] Enable global autoblocks on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084152 (https://phabricator.wikimedia.org/T377760) (owner: 10Dreamy Jazz) [14:19:23] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1084761|[BlockManager] Don't assume autoblocks have ::getParentBlockId (T378563)]], [[gerrit:1084152|[GlobalBlocking] Enable global autoblocks on all WMF wikis (T377760)]] [14:19:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:19:29] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [14:19:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:19:29] T377760: Enable global autoblocks on WMF wikis - https://phabricator.wikimedia.org/T377760 [14:19:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:19:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [14:20:20] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetserver2002.codfw.wmnet with reason: RAM expansion [14:20:31] (03CR) 10Ssingh: [C:03+2] P:dns::auth::update: bump retry_interval for authdns-update check [puppet] - 10https://gerrit.wikimedia.org/r/1084794 (owner: 10Ssingh) [14:20:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetserver2002.codfw.wmnet with reason: RAM expansion [14:21:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:21:47] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1084761|[BlockManager] Don't assume autoblocks have ::getParentBlockId (T378563)]], [[gerrit:1084152|[GlobalBlocking] Enable global autoblocks on all WMF wikis (T377760)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:50] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host aux-k8s-ctrl1002.eqiad.wmnet [14:22:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host aux-k8s-ctrl1002.eqiad.wmnet [14:23:35] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl1002.eqiad.wmnet with OS bookworm [14:23:39] pfischer: we should hopefully have some additions too [14:23:40] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: bump image to flink-1.17.1-rdf-0.3.149-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084731 (https://phabricator.wikimedia.org/T377938) (owner: 10DCausse) [14:23:55] the script's not reporting any errors on Growth's side [14:23:59] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:25:37] (03CR) 10Vgutierrez: [C:03+1] "please submit different CRs in the future" [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:25:45] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump image to flink-1.17.1-rdf-0.3.149-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084731 (https://phabricator.wikimedia.org/T377938) (owner: 10DCausse) [14:25:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv6: Connect - aux-k8s-eqiad, AS64610/IPv4: Active - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:25:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64610/IPv4: Active - aux-k8s-eqiad, AS64610/IPv6: Connect - aux-k8s-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:26:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:28:33] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084761|[BlockManager] Don't assume autoblocks have ::getParentBlockId (T378563)]], [[gerrit:1084152|[GlobalBlocking] Enable global autoblocks on all WMF wikis (T377760)]] (duration: 09m 10s) [14:28:38] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [14:28:39] T377760: Enable global autoblocks on WMF wikis - https://phabricator.wikimedia.org/T377760 [14:28:44] I'm done. [14:28:56] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:02] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:30:22] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:30:25] RESOLVED: [6x] SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s1.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T376905)', diff saved to https://phabricator.wikimedia.org/P70711 and previous config saved to /var/cache/conftool/dbconfig/20241030-143236-ladsgroup.json [14:32:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [14:32:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [14:33:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T376905)', diff saved to https://phabricator.wikimedia.org/P70712 and previous config saved to /var/cache/conftool/dbconfig/20241030-143303-ladsgroup.json [14:34:33] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:34:35] (03PS1) 10Muehlenhoff: Revert "Remove puppetserver2002 out of active service for RAM upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1084810 (https://phabricator.wikimedia.org/T376058) [14:34:43] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:34:46] (03PS1) 10Fabfur: hiera: fix haproxykafka socket file mode [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) [14:35:25] (03PS1) 10Gergő Tisza: Increase log level for autocreation callback [extensions/CentralAuth] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084812 (https://phabricator.wikimedia.org/T378289) [14:35:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:35:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084812 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [14:35:54] (03CR) 10Ssingh: [C:03+1] Revert "Remove puppetserver2002 out of active service for RAM upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1084810 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [14:36:53] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetserver2002 out of active service for RAM upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1084810 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [14:37:35] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:37:36] FIRING: [3x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:43] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [14:37:48] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:37:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [14:38:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10277126 (10elukey) 05Open→03Resolved Declaring this as closed since we have tested everything that we needed :) [14:41:05] (03PS2) 10Arnaudb: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) [14:41:05] (03CR) 10Arnaudb: "this is a code proposition, not the definitive CR" [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [14:41:14] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [14:42:52] !log joal@deploy2002 Started deploy [airflow-dags/analytics@ec02629]: Regular analytics weekly train SECOND [airflow-dags/analytics@ec02629d] [14:43:47] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@ec02629]: Regular analytics weekly train SECOND [airflow-dags/analytics@ec02629d] (duration: 00m 55s) [14:44:11] (03PS3) 10Arnaudb: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) [14:44:52] (03CR) 10Volans: mysql_legacy: fix _list_host_instances (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1084132 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [14:45:49] (03PS1) 10Muehlenhoff: Remove puppetserver2003 out of active service for RAM upgrade [dns] - 10https://gerrit.wikimedia.org/r/1084815 (https://phabricator.wikimedia.org/T376058) [14:46:40] (03CR) 10Ssingh: [C:03+1] Remove puppetserver2003 out of active service for RAM upgrade [dns] - 10https://gerrit.wikimedia.org/r/1084815 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [14:47:21] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2003 out of active service for RAM upgrade [dns] - 10https://gerrit.wikimedia.org/r/1084815 (https://phabricator.wikimedia.org/T376058) (owner: 10Muehlenhoff) [14:47:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:48:35] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti role from ganeti2016 [puppet] - 10https://gerrit.wikimedia.org/r/1084749 (https://phabricator.wikimedia.org/T376594) (owner: 10Muehlenhoff) [14:49:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti2035 to ganeti2044 and decom ganeti2009 to ganeti2018 - https://phabricator.wikimedia.org/T376594#10277244 (10MoritzMuehlenhoff) [14:50:19] PROBLEM - ganeti-noded running on ganeti2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:50:19] PROBLEM - ganeti-confd running on ganeti2016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:51:12] (03CR) 10CI reject: [V:04-1] sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [14:51:43] (03PS2) 10Fabfur: hiera: fix haproxykafka socket file mode [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) [14:51:48] ^ ganeti alert is just noise, host got it's ganeti role removed [14:52:05] (03PS5) 10Fabfur: haproxykafka: remove user creation [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) [14:52:36] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:38] (03PS3) 10Scott French: Add JobQueueLowTrafficProcessingRateTooHigh alert [alerts] - 10https://gerrit.wikimedia.org/r/1083904 (https://phabricator.wikimedia.org/T378609) [14:54:33] PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:37] (03CR) 10Slyngshede: [C:03+2] P:idp rewrite tgt lookup logic for idp-logout script [puppet] - 10https://gerrit.wikimedia.org/r/1084037 (https://phabricator.wikimedia.org/T377728) (owner: 10Slyngshede) [14:56:22] !log importing haproxykafka 0.2 package into apt repository (T377613) [14:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:28] T377613: Provide Debian packetization - https://phabricator.wikimedia.org/T377613 [14:57:03] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:57:08] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [14:58:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl1002.eqiad.wmnet with OS bookworm [14:58:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetserver2003.codfw.wmnet with reason: RAM expansion [14:58:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetserver2003.codfw.wmnet with reason: RAM expansion [15:00:53] (03PS1) 10Muehlenhoff: Cleanup more obsolete memcached-related IDP tooling [puppet] - 10https://gerrit.wikimedia.org/r/1084817 [15:00:56] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host aux-k8s-ctrl1002.eqiad.wmnet [15:00:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host aux-k8s-ctrl1002.eqiad.wmnet [15:02:17] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bullseye [15:02:36] FIRING: [2x] JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:25] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on an-presto[1017-1019].eqiad.wmnet with reason: reimaging the hosts to bullseye [15:05:41] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on an-presto[1017-1019].eqiad.wmnet with reason: reimaging the hosts to bullseye [15:05:59] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on an-presto1020.eqiad.wmnet with reason: reimaging the hosts to bullseye [15:06:04] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [15:06:09] (03PS1) 10Brouberol: Publish JDK8 images based on Debian Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1084818 (https://phabricator.wikimedia.org/T377928) [15:06:13] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on an-presto1020.eqiad.wmnet with reason: reimaging the hosts to bullseye [15:06:54] (03CR) 10Slyngshede: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1084817 (owner: 10Muehlenhoff) [15:07:46] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1017.eqiad.wmnet with OS bullseye [15:07:48] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10277369 (10phaultfinder) [15:11:30] (03PS1) 10Muehlenhoff: Revert "Remove puppetserver2003 out of active service for RAM upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1084821 [15:11:47] (03CR) 10Muehlenhoff: [C:03+2] Cleanup more obsolete memcached-related IDP tooling [puppet] - 10https://gerrit.wikimedia.org/r/1084817 (owner: 10Muehlenhoff) [15:15:25] (03CR) 10Ssingh: [C:03+1] Revert "Remove puppetserver2003 out of active service for RAM upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1084821 (owner: 10Muehlenhoff) [15:15:40] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove puppetserver2003 out of active service for RAM upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1084821 (owner: 10Muehlenhoff) [15:17:36] FIRING: [5x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:06] (03PS1) 10Dreamy Jazz: Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) [15:18:59] (03PS1) 10Dreamy Jazz: Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) [15:20:08] jouncebot: nowandnext [15:20:08] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [15:20:08] In 1 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1700) [15:20:29] (03CR) 10Dreamy Jazz: [C:03+2] Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:20:37] (03CR) 10Dreamy Jazz: [C:03+2] Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:22:25] FIRING: SystemdUnitFailed: sync-puppet-ca.service on puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:32] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1017.eqiad.wmnet with OS bullseye [15:24:47] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:24:51] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:25:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:25:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:25:26] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bullseye [15:26:06] !log disable Puppet fleet-wide for puppetserver2001 maintenance [15:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:02] PROBLEM - Check unit status of sync-puppet-ca on puppetserver2002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:27:10] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:27:20] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:28:21] (03PS1) 10Peter Fischer: Search update pipeline: bump flink version, fix NPE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084839 [15:29:23] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetserver2001.codfw.wmnet with reason: puppetserver2001 maintenance [15:29:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetserver2001.codfw.wmnet with reason: puppetserver2001 maintenance [15:29:43] (03CR) 10DCausse: [C:03+1] Search update pipeline: bump flink version, fix NPE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084839 (owner: 10Peter Fischer) [15:31:57] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:32:08] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:32:53] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:32:54] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump flink version, fix NPE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084839 (owner: 10Peter Fischer) [15:33:33] (03PS3) 10Fabfur: hiera: fix haproxykafka socket file mode using Integer type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) [15:34:09] (03Merged) 10jenkins-bot: Search update pipeline: bump flink version, fix NPE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1084839 (owner: 10Peter Fischer) [15:35:14] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:35:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [15:35:29] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:35:33] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:35:49] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:38:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1045.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:39:29] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:39:31] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [15:39:41] !log re-enable Puppet fleet-wide for puppetserver2001 maintenance [15:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:47] !log re-enable Puppet fleet-wide after puppetserver2001 maintenance [15:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:25] (03PS2) 10Urbanecm: build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [15:40:31] (03PS4) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) [15:41:38] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10277551 (10elukey) 05Resolved→03Open [15:42:12] 10ops-codfw, 06SRE, 06DC-Ops: codfw puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10277545 (10MoritzMuehlenhoff) 05Open→03Resolved a:05RobH→03Jhancock.wm [15:43:09] (03PS1) 10Majavah: openstack: designate: Fix proxy zone filtering logic in revdns upgrader [puppet] - 10https://gerrit.wikimedia.org/r/1084842 [15:43:24] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1017.eqiad.wmnet with OS bullseye [15:44:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1046.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:45:08] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:45:31] (03CR) 10Aleksandar Mastilovic: "I think as long as you can guarantee that the URL will be in that format, and as long as we maintain the mapping between instance names an" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:47:35] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:47:48] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:48:58] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10277606 (10elukey) [15:49:53] (03CR) 10CI reject: [V:04-1] Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:49:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10277617 (10elukey) ETOOSOON :) It seems that ganeti1044+ hosts were already provisioned, and I didn't notice an error when uploading the license to 104... [15:50:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1047.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:51:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:51:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:52:02] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:52:08] (03CR) 10Dreamy Jazz: Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:52:11] (03CR) 10Dreamy Jazz: [C:03+2] Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:52:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:52:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:52:43] (03CR) 10Brouberol: "Yep, we cam guarantee that indeed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075875 (https://phabricator.wikimedia.org/T375716) (owner: 10Brouberol) [15:54:53] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:55:49] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:55:56] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:56:06] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:57:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1048.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:57:46] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10277661 (10fnegri) p:05Medium→03High [15:57:49] (03CR) 10Dreamy Jazz: Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [15:57:59] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1017.eqiad.wmnet with reason: host reimage [15:59:09] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [15:59:40] (03CR) 10Ssingh: [C:03+1] "assuming you have looked at cp3066'c PCC output, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:01:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [16:01:28] (03Merged) 10jenkins-bot: Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084831 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [16:01:29] PROBLEM - Check unit status of sync-puppet-ca on puppetserver2003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:01:32] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1017.eqiad.wmnet with reason: host reimage [16:01:33] (03CR) 10CI reject: [V:04-1] Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [16:01:56] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1084831|Fix bug in BlockManager::getUniqueBlocks (T378563)]] [16:02:06] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10277683 (10fnegri) [16:02:08] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [16:02:09] !log pfischer@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:02:25] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:28] !log pfischer@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:03:12] (03CR) 10Fabfur: [C:03+2] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1084793 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:04:17] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1084831|Fix bug in BlockManager::getUniqueBlocks (T378563)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:04:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1049.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:04:25] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [16:06:26] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:06:37] !log pfischer@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:06:38] !log pfischer@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:06:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:06:57] !log pfischer@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:07:06] !log pfischer@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:07:22] !log pfischer@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:07:29] !log pfischer@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:08:31] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:08:37] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:09:02] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084831|Fix bug in BlockManager::getUniqueBlocks (T378563)]] (duration: 07m 06s) [16:09:17] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [16:11:01] 06SRE, 10SRE-Access-Requests: Access to ops mailing list - https://phabricator.wikimedia.org/T378484#10277808 (10zoe) 05Open→03Resolved [16:11:45] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:11:55] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1019.eqiad.wmnet with OS bullseye [16:14:48] (03CR) 10CI reject: [V:04-1] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [16:16:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1050.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:18:05] (03PS4) 10Fabfur: hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) [16:19:32] (03CR) 10Volans: [C:04-1] "The code has some logic supporting multiple hosts but other bits supporting only a single host. Please decide in which direction we should" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [16:20:30] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): cloudcontrol2006-dev struggling with memory - https://phabricator.wikimedia.org/T370401#10277879 (10aborrero) >>! In T370401#10276958, @Jhancock.wm wrote: > I got the memory in. Is it safe to proceed with the upgrade at this time? I didn't see... [16:21:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:22:43] (03CR) 10Sergio Gimeno: "recheck" [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [16:24:33] (03CR) 10Ssingh: [C:03+1] hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:24:57] (03PS5) 10Fabfur: hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) [16:25:20] (03CR) 10Dreamy Jazz: "I'll merge this later, as I won't have time to wait for this." [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [16:25:42] (03CR) 10Ssingh: [C:03+1] "[single-quote should be OK but no issues, +1]" [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:25:43] (03CR) 10CI reject: [V:04-1] hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:26:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1051.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:26:54] (03CR) 10Ssingh: [C:03+1] hiera: fix haproxykafka socket file mode using String type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [16:27:26] (03CR) 10Volans: "It seems that almost all the operations are in common with the reboot/upgrade procedure for normal DBs. I would suggest to just make one r" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [16:30:41] (03PS4) 10Scott French: shellbox: upgrade to 2024-10-15-214239 (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082319 (https://phabricator.wikimedia.org/T375243) [16:31:35] PROBLEM - Check unit status of sync-puppet-ca on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:25] FIRING: [3x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:36] RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:31] (03CR) 10Scott French: [C:03+2] shellbox: upgrade to 2024-10-15-214239 (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082319 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [16:33:57] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:34:50] (03Merged) 10jenkins-bot: shellbox: upgrade to 2024-10-15-214239 (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082319 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [16:37:05] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [16:37:41] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:37:47] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:38:13] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:38:19] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:38:41] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:38:47] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10278006 (10elukey) [16:38:47] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:38:50] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:38:56] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:39:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1052.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:39:25] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:39:31] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:39:57] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:42:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10278019 (10VRiley-WMF) 05Open→03In progress @bking and @MoritzMuehlenhoff We have recieved the memory and I will replace it very soon. I will update when t... [16:42:47] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10278021 (10elukey) Run provision on ganeti1045+, and fixed the ADMIN password as well. Last step is to figure out why ganeti1044's license doesn't work... [16:44:17] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1017.eqiad.wmnet with OS bullseye [16:51:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10278100 (10VRiley-WMF) Memory has been replaced in B4 [16:51:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10278101 (10VRiley-WMF) 05In progress→03Resolved [16:53:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267#10278107 (10VRiley-WMF) Swapping out the memory on A6 [16:53:26] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [16:54:11] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [16:56:11] RECOVERY - Host an-worker1165 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:57:59] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:58:23] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:58:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [16:59:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1700) [17:00:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:01:39] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:03:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [17:03:39] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1234 crashed - faulty memory stick on A6 (0x4E42) - https://phabricator.wikimedia.org/T378267#10278137 (10VRiley-WMF) 05In progress→03Resolved The memory has been swapped. [17:03:44] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [17:04:54] 10ops-codfw, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628 (10Ladsgroup) 03NEW [17:05:29] 10ops-codfw, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10278155 (10Ladsgroup) Forced a reboot via mgmt console. [17:10:07] (03PS2) 10Sergio Gimeno: Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) [17:11:24] (03PS5) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) [17:11:52] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5 [17:11:58] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [17:13:10] (03PS2) 10Sergio Gimeno: Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084776 [17:13:17] (03CR) 10CI reject: [V:04-1] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [17:18:56] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [17:19:01] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5 [17:19:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:39] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission ganeti2013/ganeti2014 - https://phabricator.wikimedia.org/T378596#10278199 (10tappof) [17:20:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:20:36] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 [17:20:37] (03CR) 10Majavah: [C:03+2] openstack: designate: Fix proxy zone filtering logic in revdns upgrader [puppet] - 10https://gerrit.wikimedia.org/r/1084842 (owner: 10Majavah) [17:20:40] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s6 [17:21:48] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 [17:21:54] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s6 [17:23:58] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [17:24:19] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [17:26:07] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [17:27:04] 10ops-codfw, 06DBA, 06DC-Ops: db2190 is not coming back online - https://phabricator.wikimedia.org/T378628#10278225 (10Ladsgroup) The reboot didn't work. It's still not up. [17:31:01] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [17:33:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:35:34] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [17:35:38] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [17:35:47] (03CR) 10Tiziano Fogli: [C:03+2] add Joely Rooke WMDE to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/1084780 (https://phabricator.wikimedia.org/T378082) (owner: 10Tiziano Fogli) [17:36:12] (03PS1) 10MusikAnimal: [CommunityRequests] disable wgCommunityRequestsEnable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) [17:36:53] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:38:25] (03CR) 10Majavah: "fwiw, the way i18n works on wikimedia wikis means that you can use the messages even if the extension is not enabled on that wiki. so I wo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [17:39:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:43:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'deployment' for 'Joely Rooke WMDE' - https://phabricator.wikimedia.org/T378082#10278255 (10tappof) 05Stalled→03Resolved patch merged. [17:44:45] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:44:52] (03CR) 10MusikAnimal: "Ah, interesting! Is that also true if the extension is not enabled *anywhere* on our cluster?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [17:44:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:45:19] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52775 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:45:57] (03CR) 10Majavah: "what matters is whether the extension is listed in the `extension-list` file." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [17:47:18] (03PS1) 10BCornwall: varnish: Increase RSA cert warnings to 2% of views [puppet] - 10https://gerrit.wikimedia.org/r/1084855 (https://phabricator.wikimedia.org/T370837) [17:50:01] (03PS2) 10MusikAnimal: [CommunityRequests] disable everywhere by default, including Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) [17:50:32] (03CR) 10MusikAnimal: "Brilliant! I've revised this patch accordingly. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [17:51:48] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [17:57:17] (03PS3) 10MusikAnimal: [CommunityRequests] disable wgCommunityRequestsEnable by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) [17:58:07] 06SRE, 10AbuseFilter, 06Data Products, 10Data-Services, and 8 others: Public wiki replicas contain abuse filter logs for filters that are private or protected - https://phabricator.wikimedia.org/T375751#10278291 (10fnegri) 05In progress→03Resolved I did manually run `maintain-views --all-databases... [17:58:47] (03CR) 10Ssingh: [C:03+1] "Looks good, 🚢 it." [puppet] - 10https://gerrit.wikimedia.org/r/1084855 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [17:59:03] (03CR) 10MusikAnimal: "Actually, I realized that we'll still want the feature flag to control rollout of specific bits of the extension. We're migrating from a g" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084853 (https://phabricator.wikimedia.org/T366194) (owner: 10MusikAnimal) [18:00:04] dduvall and dancy: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T1800). [18:00:21] o/ [18:00:30] i'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1084759 prior to train today [18:01:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084759 (https://phabricator.wikimedia.org/T378531) (owner: 10Ammarpad) [18:05:59] (03PS6) 10Sergio Gimeno: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) [18:06:12] (03CR) 10BCornwall: [C:03+2] varnish: Increase RSA cert warnings to 2% of views [puppet] - 10https://gerrit.wikimedia.org/r/1084855 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [18:08:57] (03PS1) 10Ladsgroup: db2190: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/1084857 (https://phabricator.wikimedia.org/T378628) [18:10:05] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database nrwiki (T375101) [18:10:11] T375101: Prepare and check storage layer for nrwiki - https://phabricator.wikimedia.org/T375101 [18:10:18] (03CR) 10Ladsgroup: [C:03+2] db2190: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/1084857 (https://phabricator.wikimedia.org/T378628) (owner: 10Ladsgroup) [18:19:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:29] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:21:01] (03Merged) 10jenkins-bot: Revert "Use array instead of string for class list" [skins/MinervaNeue] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084759 (https://phabricator.wikimedia.org/T378531) (owner: 10Ammarpad) [18:21:29] !log dduvall@deploy2002 Started scap sync-world: Backport for [[gerrit:1084759|Revert "Use array instead of string for class list" (T378531)]] [18:21:34] T378531: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T378531 [18:23:54] (03PS6) 10Scott French: shellbox-syntaxhighlight: add "migration" in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081266 (https://phabricator.wikimedia.org/T375243) [18:23:54] !log dduvall@deploy2002 ammarpad, dduvall: Backport for [[gerrit:1084759|Revert "Use array instead of string for class list" (T378531)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:24:12] (03PS3) 10Scott French: shellbox: add migration release (all) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082572 (https://phabricator.wikimedia.org/T375243) [18:27:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084774 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [18:27:16] !log monitoring testwiki error rates for a few minutes to see if the error related to T378531 subsides (current rate is 23 errors in the last 15 minutes) [18:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:21] T378531: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T378531 [18:27:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [18:27:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084776 (owner: 10Sergio Gimeno) [18:27:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [18:28:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): an-worker1165: Broken RAM - https://phabricator.wikimedia.org/T378454#10278438 (10bking) On the DPE side, I've confirmed that the host is back up and part of the cluster using [[ https://wikitech.wikimedia.org/wiki/Data_Platfo... [18:28:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [18:35:39] !log error is still occurring following backport deployment of https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1084759 (T378531) [18:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:46] T378531: PHP Notice: Array to string conversion - https://phabricator.wikimedia.org/T378531 [18:35:55] !log dduvall@deploy2002 ammarpad, dduvall: Continuing with sync [18:35:56] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database nrwiki (T375101) [18:36:17] T375101: Prepare and check storage layer for nrwiki - https://phabricator.wikimedia.org/T375101 [18:39:44] !log bking@stat1008,stat1009,stat1010.mgmt racadm jobqueue delete -i $job T376813 [18:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] T376813: Implement non-cgroups-related performance optimizations on stat hosts - https://phabricator.wikimedia.org/T376813 [18:40:34] !log dduvall@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084759|Revert "Use array instead of string for class list" (T378531)]] (duration: 19m 04s) [18:49:58] (03PS4) 10Sergio Gimeno: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) [18:50:42] (03CR) 10CI reject: [V:04-1] GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) (owner: 10Sergio Gimeno) [18:57:38] 10SRE-swift-storage, 10Observability-Metrics: Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#10278605 (10lmata) p:05Triage→03Medium [19:02:41] (03PS6) 10Scott French: trafficserver: Lua script for routing 8.1-enrolled traffic [puppet] - 10https://gerrit.wikimedia.org/r/1072821 (https://phabricator.wikimedia.org/T377042) [19:02:56] (03PS5) 10Sergio Gimeno: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) [19:07:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:07:29] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:09:07] alright. continuing with train [19:09:36] (the errors related to https://phabricator.wikimedia.org/T378531 have subsided) [19:09:49] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084870 (https://phabricator.wikimedia.org/T375660) [19:09:50] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084870 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [19:09:55] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:23] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52777 bytes in 3.789 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:25] (03PS1) 10Ladsgroup: tables-catalog: Add CreditSource tables [puppet] - 10https://gerrit.wikimedia.org/r/1084871 (https://phabricator.wikimedia.org/T363581) [19:10:39] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084870 (https://phabricator.wikimedia.org/T375660) (owner: 10TrainBranchBot) [19:10:45] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:10:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:48] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add CreditSource tables [puppet] - 10https://gerrit.wikimedia.org/r/1084871 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [19:17:36] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:17:39] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.1 refs T375660 [19:17:45] T375660: 1.44.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T375660 [19:18:42] (03PS6) 10Dzahn: site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 [19:19:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: Maintenance [19:20:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: Maintenance [19:20:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2115 (T376905)', diff saved to https://phabricator.wikimedia.org/P70714 and previous config saved to /var/cache/conftool/dbconfig/20241030-192011-ladsgroup.json [19:21:55] (03CR) 10Dzahn: "I am not sure why: Error: Could not call 'find' on 'catalog': Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: " [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [19:27:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T376905)', diff saved to https://phabricator.wikimedia.org/P70715 and previous config saved to /var/cache/conftool/dbconfig/20241030-192744-ladsgroup.json [19:30:13] (03PS1) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [19:32:43] (03CR) 10Cwhite: [C:03+1] "Worth a shot!" [puppet] - 10https://gerrit.wikimedia.org/r/1084199 (https://phabricator.wikimedia.org/T359293) (owner: 10Herron) [19:33:22] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [19:34:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [19:34:56] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [19:35:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [19:35:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [19:36:10] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [19:36:34] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [19:37:01] RECOVERY - Check unit status of sync-puppet-ca on puppetserver2002 is OK: OK: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:37:24] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [19:37:25] RESOLVED: [3x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:48] !log gitlab - deleting user "jfk" on main server and both replicas T376936 [19:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [19:39:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [19:40:16] !log all shellbox instances updated to shellbox 2024-10-15-214239 - T375243 [19:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:20] T375243: Turn up PHP 8.1 Shellbox deployments - https://phabricator.wikimedia.org/T375243 [19:41:29] RECOVERY - Check unit status of sync-puppet-ca on puppetserver2003 is OK: OK: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:41:35] RECOVERY - Check unit status of sync-puppet-ca on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:42:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P70716 and previous config saved to /var/cache/conftool/dbconfig/20241030-194251-ladsgroup.json [19:44:11] (03CR) 10Scott French: "Hugh, since you previously reviewed I9b804f60b3e83a516ebbd5d552ce3cfb3e61f36e, could I ask you to review this and the next patch as well a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1081266 (https://phabricator.wikimedia.org/T375243) (owner: 10Scott French) [19:48:54] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T378646 (10phaultfinder) 03NEW [19:51:46] (03CR) 10Dzahn: [C:04-1] "is it known that "mx1001.wikimedia.org is unknown in Netbox"?" [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [19:56:08] (03CR) 10JHathaway: "yes, the exim servers have been decommed, sorry for not pulling them from site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [19:57:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [19:57:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P70717 and previous config saved to /var/cache/conftool/dbconfig/20241030-195758-ladsgroup.json [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T2000). [20:00:05] sergi0, edsanders, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] o/ [20:00:32] (03PS2) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [20:00:51] Present [20:01:03] o/ [20:01:08] (03Abandoned) 10Dzahn: site/mx: move interface::alias out of site.pp to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1082266 (owner: 10Dzahn) [20:01:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:01:36] Apologies for the long chain of changes upfront, I had to make CI happy for the relevant patch. I'm fine if edsanders tgr changes go first and we do as far as we get for mine [20:02:02] (03PS1) 10Scardenasmolinar: Enable AutoModerator on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) [20:03:30] sergi0: only one of those patches actually needs testing, right? [20:03:53] correct, 1084181 and the config [20:04:14] hopefully scap-backport is smart enough to deploy them all together [20:04:27] (03CR) 10Gergő Tisza: [C:03+2] Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084774 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [20:04:31] (03CR) 10Gergő Tisza: [C:03+2] Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [20:04:32] (03PS3) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [20:04:34] (03CR) 10Gergő Tisza: [C:03+2] Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084776 (owner: 10Sergio Gimeno) [20:05:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:06:20] (03CR) 10Gergő Tisza: [C:03+2] build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [20:07:51] sergi0: does the config patch go together with the rest, or can it go out separately? [20:08:09] needs to go after 1084181 [20:08:13] (03CR) 10Gergő Tisza: [C:03+2] HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [20:08:44] cool, I'll deploy the other config patch first then [20:09:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083848 (https://phabricator.wikimedia.org/T377990) (owner: 10Esanders) [20:10:08] (03PS2) 10Scott French: mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) [20:11:07] (03Merged) 10jenkins-bot: Set Flow to read-only on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083848 (https://phabricator.wikimedia.org/T377990) (owner: 10Esanders) [20:11:36] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1083848|Set Flow to read-only on nowiki (T377990)]] [20:11:41] T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990 [20:12:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10278874 (10wiki_willy) Thanks so much @jcrespo, I appreciate your flexibility and patience on this. >>! In T371416#10276034, @jcrespo wrote: > @wiki... [20:12:27] (03PS3) 10Scott French: mediawiki: parameterize PHP version via chart value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071957 (https://phabricator.wikimedia.org/T372604) [20:13:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T376905)', diff saved to https://phabricator.wikimedia.org/P70718 and previous config saved to /var/cache/conftool/dbconfig/20241030-201305-ladsgroup.json [20:13:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [20:13:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: Maintenance [20:13:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2131 (T376905)', diff saved to https://phabricator.wikimedia.org/P70719 and previous config saved to /var/cache/conftool/dbconfig/20241030-201331-ladsgroup.json [20:16:12] (03CR) 10Gergő Tisza: [C:03+2] Increase log level for autocreation callback [extensions/CentralAuth] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084812 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [20:16:28] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650 (10Eevans) 03NEW [20:16:31] (03PS2) 10Dzahn: aphlict: create system user with systemd:sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) [20:16:38] !log tgr@deploy2002 esanders, tgr: Backport for [[gerrit:1083848|Set Flow to read-only on nowiki (T377990)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:43] T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990 [20:17:00] edsanders: do you want to test it? [20:17:45] 06SRE-OnFire, 10Incident Tooling: Corto: Bot needs a registered nick - https://phabricator.wikimedia.org/T378650#10278902 (10Eevans) [20:19:12] tgr|away: looking [20:20:13] looks good to me on mwdebug1001 [20:20:18] !log tgr@deploy2002 esanders, tgr: Continuing with sync [20:20:59] (03CR) 10Dzahn: "In this case there is no rsync involved. It doesn't have to be the same UID on all machines, but we still want to avoid puppet errors like" [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [20:21:12] (03PS3) 10Cwhite: Profiler: centralize metrics send to a function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081460 [20:21:12] (03PS2) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [20:23:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T376905)', diff saved to https://phabricator.wikimedia.org/P70720 and previous config saved to /var/cache/conftool/dbconfig/20241030-202315-ladsgroup.json [20:23:16] sergi0: should the GE and config patches be synced together, or code first, config second? [20:23:17] RECOVERY - Host rdb1014 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:23:30] code first, config second [20:23:41] (03PS4) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [20:24:15] PROBLEM - Check health of redis instance on 6379 on rdb1014 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 https://wikitech.wikimedia.org/wiki/Redis [20:24:58] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1083848|Set Flow to read-only on nowiki (T377990)]] (duration: 13m 21s) [20:25:11] T377990: [Config] Set Flow to "read-only" at all Phase 0 wikis - https://phabricator.wikimedia.org/T377990 [20:25:34] (03PS1) 10Jsn.sherman: Translations for configuration for same-user-same-page reverts in Automoderator [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084889 (https://phabricator.wikimedia.org/T370795) [20:26:06] I'm just shadowing here, but – looks like Flow is behaving when I turn off WikimediaDebug [20:26:40] (ie lgtm) [20:26:56] yep - thanks tgr [20:27:47] PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:54] (03Merged) 10jenkins-bot: Set username in user mock and reset state after test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084774 (https://phabricator.wikimedia.org/T378573) (owner: 10Sergio Gimeno) [20:28:57] (03Merged) 10jenkins-bot: Fix and re-enable selenium test [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084775 (https://phabricator.wikimedia.org/T378581) (owner: 10Sergio Gimeno) [20:28:59] (03Merged) 10jenkins-bot: Fix selenium test loading the wrong talk page [extensions/Wikibase] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084776 (owner: 10Sergio Gimeno) [20:29:17] RECOVERY - Host rdb1014 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:30:15] PROBLEM - Check health of redis instance on 6379 on rdb1014 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 https://wikitech.wikimedia.org/wiki/Redis [20:30:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084889 (https://phabricator.wikimedia.org/T370795) (owner: 10Jsn.sherman) [20:30:42] (03Merged) 10jenkins-bot: build: Suppress phan issue with null for Message::numParams [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084185 (owner: 10Umherirrender) [20:30:43] (03Merged) 10jenkins-bot: HomepageHooks: do not store assigned variant on account creation [extensions/GrowthExperiments] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084181 (https://phabricator.wikimedia.org/T377713) (owner: 10Sergio Gimeno) [20:30:44] (03Merged) 10jenkins-bot: Increase log level for autocreation callback [extensions/CentralAuth] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084812 (https://phabricator.wikimedia.org/T378289) (owner: 10Gergő Tisza) [20:33:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:34:31] FIRING: RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=eqiad&var-job=redis_misc&var-instance=rdb1014:16379 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [20:36:07] jouncebot: nowandnext [20:36:08] For the next 0 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T2000) [20:36:08] In 0 hour(s) and 23 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T2100) [20:36:49] (03CR) 10Dreamy Jazz: [C:03+2] Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [20:37:18] Going to deploy now if that's okay [20:37:56] Need to wait for gate-and-submit-wmf to finish first, but appears that no-one is currently deploying [20:38:16] Though I see some merges for wmf.1 [20:38:17] tgr|away: are changes already synced? [20:38:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P70721 and previous config saved to /var/cache/conftool/dbconfig/20241030-203822-ladsgroup.json [20:40:12] (03PS5) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [20:40:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [20:41:40] sergi0: They won't have been as they just merged. I don't see the start of scap backport either. [20:41:52] sergi0: uh, sorry, I wasn't paying attention and the script got stuck [20:42:26] I merged the CentralAuth patch too quickly. I'm still used to manual scap I guess. [20:42:42] My change is ETA 28 mins away from merging, so should be ready once you've done. [20:42:52] Will have to deploy that together with the GE patch. It's trivial so should be fine. [20:43:32] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1084774|Set username in user mock and reset state after test (T378573)]], [[gerrit:1084775|Fix and re-enable selenium test (T378581)]], [[gerrit:1084776|Fix selenium test loading the wrong talk page]], [[gerrit:1084185|build: Suppress phan issue with null for Message::numParams]], [[gerrit:1084181|HomepageHooks: do not store assigned variant on account cre [20:43:32] ation (T377713)]] [20:43:38] T378573: Wikibase CI blocked by errors in SkinAfterPortletHandlerTest and ChangesListSpecialPageHookHandlerTest - https://phabricator.wikimedia.org/T378573 [20:43:39] T378581: Re-enable browser test in repo/tests/selenium/specs/item.js - https://phabricator.wikimedia.org/T378581 [20:43:39] T377713: Do not call ExperimentUserManager::setVariant on all newly registered accounts - https://phabricator.wikimedia.org/T377713 [20:44:18] (03PS6) 10Fabfur: hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) [20:45:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [20:45:37] (03CR) 10Fabfur: hiera: fix haproxykafka socket file mode using String type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [20:45:53] !log tgr@deploy2002 sgimeno, umherirrender, tgr: Backport for [[gerrit:1084774|Set username in user mock and reset state after test (T378573)]], [[gerrit:1084775|Fix and re-enable selenium test (T378581)]], [[gerrit:1084776|Fix selenium test loading the wrong talk page]], [[gerrit:1084185|build: Suppress phan issue with null for Message::numParams]], [[gerrit:1084181|HomepageHooks: do not store assigned variant on account [20:45:53] creation (T377713)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:46:02] (03PS1) 10Jsn.sherman: Add follow-up message [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084891 (https://phabricator.wikimedia.org/T372476) [20:46:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/AutoModerator] (wmf/1.44.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1084891 (https://phabricator.wikimedia.org/T372476) (owner: 10Jsn.sherman) [20:48:48] (03CR) 10Ssingh: [C:03+1] hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [20:50:56] (03CR) 10Fabfur: [C:03+2] hiera: fix haproxykafka socket file mode using String type [puppet] - 10https://gerrit.wikimedia.org/r/1084811 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [20:51:00] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:51:13] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:53:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131', diff saved to https://phabricator.wikimedia.org/P70722 and previous config saved to /var/cache/conftool/dbconfig/20241030-205329-ladsgroup.json [20:53:35] sergi0: does it look OK? [20:54:06] tgr|away: checking now [20:56:57] ok on my side [20:57:14] !log tgr@deploy2002 sgimeno, umherirrender, tgr: Continuing with sync [20:57:17] thx [20:58:22] you can set a hilight on your gerrit username, that's what logmsgbot tries to ping when it finishes deploying to the debug host [20:58:28] (03PS1) 10Fabfur: haproxykafka: ensure directories are removed when ensure=>absent [puppet] - 10https://gerrit.wikimedia.org/r/1084893 (https://phabricator.wikimedia.org/T377614) [20:59:28] a highlight? [20:59:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084893 (https://phabricator.wikimedia.org/T377614) (owner: 10Fabfur) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241030T2100) [21:00:54] most IRC clients have some feature for specifying strings which are treated like your nickname for pinging purposes [21:01:04] I guess hilight is irssi terminology [21:01:51] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084774|Set username in user mock and reset state after test (T378573)]], [[gerrit:1084775|Fix and re-enable selenium test (T378581)]], [[gerrit:1084776|Fix selenium test loading the wrong talk page]], [[gerrit:1084185|build: Suppress phan issue with null for Message::numParams]], [[gerrit:1084181|HomepageHooks: do not store assigned variant on account cr [21:01:51] eation (T377713)]] (duration: 18m 18s) [21:01:56] T378573: Wikibase CI blocked by errors in SkinAfterPortletHandlerTest and ChangesListSpecialPageHookHandlerTest - https://phabricator.wikimedia.org/T378573 [21:01:57] T378581: Re-enable browser test in repo/tests/selenium/specs/item.js - https://phabricator.wikimedia.org/T378581 [21:01:57] T377713: Do not call ExperimentUserManager::setVariant on all newly registered accounts - https://phabricator.wikimedia.org/T377713 [21:03:42] we have a small error spike of "Cannot execute Wikimedia\Rdbms\Database::commit critical section while session state is out of sync. [21:03:58] probably unrelated? [21:04:26] IPInfo? [21:04:39] seems api.php purge requests mostly [21:05:31] the stack trace is not very useful, just deferred callbacks [21:06:08] anyway it was short so probably not related to the backports [21:06:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) (owner: 10Sergio Gimeno) [21:06:59] (03Merged) 10jenkins-bot: GrowthExperiments: enable community updates module in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081104 (https://phabricator.wikimedia.org/T374664) (owner: 10Sergio Gimeno) [21:07:25] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1081104|GrowthExperiments: enable community updates module in pilot wikis (T374664)]] [21:07:29] T374664: T374577: Community Updates module: Release to Growth Pilot Wikipedias - https://phabricator.wikimedia.org/T374664 [21:08:04] (03Merged) 10jenkins-bot: Fix bug in BlockManager::getUniqueBlocks [core] (wmf/1.43.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1084834 (https://phabricator.wikimedia.org/T378563) (owner: 10Dreamy Jazz) [21:08:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T376905)', diff saved to https://phabricator.wikimedia.org/P70723 and previous config saved to /var/cache/conftool/dbconfig/20241030-210836-ladsgroup.json [21:08:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2191.codfw.wmnet with reason: Maintenance [21:08:48] found the highlight setting, thanks tgr|away [21:08:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2191.codfw.wmnet with reason: Maintenance [21:09:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2191 (T376905)', diff saved to https://phabricator.wikimedia.org/P70724 and previous config saved to /var/cache/conftool/dbconfig/20241030-210902-ladsgroup.json [21:09:43] !log tgr@deploy2002 tgr, sgimeno: Backport for [[gerrit:1081104|GrowthExperiments: enable community updates module in pilot wikis (T374664)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:56] sergi0: ^ [21:10:06] checking [21:11:06] Dreamy_Jazz: want to add your patch to the wiki page? [21:12:15] tgr|away: ok on mwdebug1001 [21:12:48] !log tgr@deploy2002 tgr, sgimeno: Continuing with sync [21:13:06] Sure [21:14:41] should I deploy it or do you want to self-deploy? [21:15:02] I can self-deploy [21:15:55] 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Evaluate hw-raid controllers for Supermicro's Config J - https://phabricator.wikimedia.org/T378584#10279192 (10wiki_willy) Meeting set with Supermicro team on October 31 at 3pm UTC, to discuss the proposed RAID controller opti... [21:17:35] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1081104|GrowthExperiments: enable community updates module in pilot wikis (T374664)]] (duration: 10m 10s) [21:17:40] T374664: T374577: Community Updates module: Release to Growth Pilot Wikipedias - https://phabricator.wikimedia.org/T374664 [21:17:57] tgr|away: Are you done? [21:18:04] yeah, all yours [21:18:08] Thanks [21:18:19] thank you for all the suport tgr|away ! [21:18:36] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1084834|Fix bug in BlockManager::getUniqueBlocks (T378563)]] [21:18:44] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [21:21:00] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1084834|Fix bug in BlockManager::getUniqueBlocks (T378563)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:04] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [21:25:58] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084834|Fix bug in BlockManager::getUniqueBlocks (T378563)]] (duration: 07m 22s) [21:26:03] T378563: Error: Call to undefined method MediaWiki\Extension\GlobalBlocking\GlobalBlock::getParentBlockId() - https://phabricator.wikimedia.org/T378563 [21:26:24] (03PS6) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [21:28:15] (03PS9) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [21:28:43] (03CR) 10Fabfur: haproxy: add ring support to haproxy configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [21:28:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:29:38] (03PS10) 10Fabfur: haproxy: add ring support to haproxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) [21:30:45] (03CR) 10Fabfur: haproxy: add ring support to haproxy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [21:32:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084113 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [21:35:31] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:45] (03PS7) 10Bking: elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) [21:38:10] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:40:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:40:33] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [21:42:26] (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:42:41] (03CR) 10Bking: [C:03+2] elasticsearch: use the correct port for snapshot monitor [puppet] - 10https://gerrit.wikimedia.org/r/1084874 (https://phabricator.wikimedia.org/T357146) (owner: 10Bking) [21:50:05] (03PS2) 10Ryan Kemper: stat hosts: guarantee minimum RAM% for system processes [puppet] - 10https://gerrit.wikimedia.org/r/1083815 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [21:50:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1083815 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [21:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:55:50] (03PS3) 10Ryan Kemper: stat hosts: guarantee minimum RAM% for system procs [puppet] - 10https://gerrit.wikimedia.org/r/1083815 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [21:59:02] (03CR) 10Bking: [C:03+2] stat hosts: guarantee minimum RAM% for system procs [puppet] - 10https://gerrit.wikimedia.org/r/1083815 (https://phabricator.wikimedia.org/T377734) (owner: 10Bking) [22:03:40] !log Running ./redis-check-aof --fix on rdb1014 tcp_6379 instance - T376961 [22:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:45] T376961: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T376961 [22:05:15] RECOVERY - Check health of redis instance on 6379 on rdb1014 is OK: OK: REDIS 6.0.16 on 127.0.0.1:6379 has 1 databases (db0) with 4126217 keys, up 56 seconds https://wikitech.wikimedia.org/wiki/Redis [22:09:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191 (T376905)', diff saved to https://phabricator.wikimedia.org/P70725 and previous config saved to /var/cache/conftool/dbconfig/20241030-220928-ladsgroup.json [22:09:31] RESOLVED: RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=eqiad&var-job=redis_misc&var-instance=rdb1014:16379 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [22:24:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191', diff saved to https://phabricator.wikimedia.org/P70726 and previous config saved to /var/cache/conftool/dbconfig/20241030-222435-ladsgroup.json [22:24:54] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667 (10AntiCompositeNumber) 03NEW [22:29:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [22:29:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye [22:31:25] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10279376 (10Reedy) [22:35:51] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10279409 (10Reedy) `lang=irc [09:30:47] Hi, is there a problem with irc.wikimedia.org ? My bot can't connect on this server IR... [22:39:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [22:39:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191', diff saved to https://phabricator.wikimedia.org/P70727 and previous config saved to /var/cache/conftool/dbconfig/20241030-223942-ladsgroup.json [22:40:21] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye [22:43:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279452 (10Papaul) It looks like my number 3 conclusion was the solution for the other missing 12 disk. In BIOS mode you can not see the utility, you nee... [22:48:03] 06SRE, 10Huggle, 06Infrastructure-Foundations: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade - https://phabricator.wikimedia.org/T378667#10279455 (10AntiCompositeNumber) I captured the conversation between huggle and irc003.wm.o in Wireshark. Huggle connects, sends CAP, U... [22:54:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2191 (T376905)', diff saved to https://phabricator.wikimedia.org/P70728 and previous config saved to /var/cache/conftool/dbconfig/20241030-225449-ladsgroup.json [22:54:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: Maintenance [22:55:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: Maintenance [22:55:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2215 (T376905)', diff saved to https://phabricator.wikimedia.org/P70729 and previous config saved to /var/cache/conftool/dbconfig/20241030-225520-ladsgroup.json [22:58:24] (03CR) 10BCornwall: "It does not include mail traffic - truthfully, I don't know if this is something that should apply to mail servers or not since compatibil" [puppet] - 10https://gerrit.wikimedia.org/r/1075604 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [23:17:36] FIRING: [4x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:44:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2083.codfw.wmnet with OS bullseye [23:44:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279513 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2083.codfw.wmnet with OS bullseye executed... [23:44:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2215 (T376905)', diff saved to https://phabricator.wikimedia.org/P70730 and previous config saved to /var/cache/conftool/dbconfig/20241030-234453-ladsgroup.json [23:53:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2081.codfw.wmnet with OS bullseye [23:53:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10279527 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye executed...