[00:02:25] (03CR) 10RLazarus: [C: 03+2] admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [00:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P55273 and previous config saved to /var/cache/conftool/dbconfig/20240123-000303-ladsgroup.json [00:04:00] PROBLEM - Check systemd state on graphite1005 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P55274 and previous config saved to /var/cache/conftool/dbconfig/20240123-001810-ladsgroup.json [00:18:36] (03PS1) 10Scott French: admin: add swfrench to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/992253 (https://phabricator.wikimedia.org/T355618) [00:21:12] (03CR) 10Scott French: "Thank you in advance!" [puppet] - 10https://gerrit.wikimedia.org/r/992253 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [00:21:15] (03CR) 10RLazarus: [C: 03+2] admin: add swfrench to sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/992253 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [00:29:40] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:33:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T352010)', diff saved to https://phabricator.wikimedia.org/P55275 and previous config saved to /var/cache/conftool/dbconfig/20240123-003316-ladsgroup.json [00:33:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [00:33:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:33:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [00:33:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P55276 and previous config saved to /var/cache/conftool/dbconfig/20240123-003338-ladsgroup.json [00:34:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:34:40] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992178 [00:39:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992178 (owner: 10TrainBranchBot) [00:42:01] !log running 'zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=viwiki --fix' in screen [00:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:22] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:44:32] (03CR) 10Andrew Bogott: [C: 03+1] "looks great, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/992208 (https://phabricator.wikimedia.org/T355418) (owner: 10Majavah) [00:45:24] (03CR) 10Andrew Bogott: [C: 03+1] galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 (owner: 10Majavah) [00:49:49] (03PS1) 10Eevans: cassandra-dev2001: canary dev version of Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) [00:51:32] (03PS2) 10Eevans: cassandra-dev2001: canary dev version of Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) [00:52:07] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [00:52:08] 10SRE-swift-storage, 10UploadWizard: Problem uploading FLAC file in Upload Wizzard to Wikimedia Commons - https://phabricator.wikimedia.org/T355610 (10Bugreporter) [00:55:47] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=cywiki --fix # T350889 [00:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:57] T350889: Run maintenance script to fix BBC:* titles in all wikis following set up of Toba Batak Wikipedia - https://phabricator.wikimedia.org/T350889 [00:56:27] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=enwiki --fix # T350889 [00:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:22] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=fiwiki --fix # T350889 [00:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:29] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=fiwikinews --fix # T350889 [00:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:35] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf, sre-admins for swfrench - https://phabricator.wikimedia.org/T355618 (10Scott_French) 05In progress→03Resolved Many thanks to @RLazarus for working through this with me (e.g., making LDAP changes, submitting / applying Puppet changes). [00:58:04] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=ruwikinews --fix # T350889 [00:58:06] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Rmaung) I approve! [00:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992178 (owner: 10TrainBranchBot) [01:03:48] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T355630 (10phaultfinder) [01:04:01] (03CR) 10Andrew Bogott: "I'm fine with this but can it wait until March? We're going to have plenty of "you broke my workflow" conversations in February already." [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [01:05:43] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppet encapi: use project_id instead of project_name for keystone [puppet] - 10https://gerrit.wikimedia.org/r/988051 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [01:14:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [01:14:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [01:14:30] (03PS1) 10Zabe: foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 [01:14:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T352010)', diff saved to https://phabricator.wikimedia.org/P55277 and previous config saved to /var/cache/conftool/dbconfig/20240123-011434-ladsgroup.json [01:14:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:15:17] (03PS2) 10Zabe: foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 [01:15:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:15] (03CR) 10CI reject: [V: 04-1] foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 (owner: 10Zabe) [01:19:42] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ssingh) [01:20:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:22:12] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_MachineVision_prioritize_uncategorized.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:36] PROBLEM - MariaDB Replica Lag: s7 on db1170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:30:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:30:22] (03PS3) 10Eevans: cassandra: reconfigure 'dev' target_version for a 4.x release [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) [01:30:24] (03PS3) 10Eevans: cassandra-dev2001: canary dev version of Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) [01:32:28] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [01:33:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:22] (03PS4) 10Eevans: cassandra: reconfigure 'dev' target_version for a 4.x release [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) [01:35:24] (03PS4) 10Eevans: cassandra-dev2001: canary dev version of Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) [01:37:45] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [02:04:36] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:45] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Papaul) [02:26:14] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [02:39:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:44] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0300) [03:07:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.15 [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992179 (https://phabricator.wikimedia.org/T354433) [03:08:04] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.15 [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992179 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [03:09:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:25:18] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.15 [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992179 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0400) [04:01:29] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992273 (https://phabricator.wikimedia.org/T354433) [04:01:31] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992273 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [04:02:15] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992273 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [04:02:44] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.15 refs T354433 [04:02:55] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [04:32:22] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:38:26] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [04:54:07] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.15 refs T354433 (duration: 51m 22s) [04:54:11] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [05:17:12] RECOVERY - MariaDB Replica Lag: s7 on db1170 is OK: OK slave_sql_lag Replication lag: 39.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:59:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [05:59:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [06:00:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:00:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:00:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:01:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:01:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:01:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:01:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T354336)', diff saved to https://phabricator.wikimedia.org/P55278 and previous config saved to /var/cache/conftool/dbconfig/20240123-060127-marostegui.json [06:01:32] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:02:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T354336)', diff saved to https://phabricator.wikimedia.org/P55279 and previous config saved to /var/cache/conftool/dbconfig/20240123-060237-marostegui.json [06:17:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P55280 and previous config saved to /var/cache/conftool/dbconfig/20240123-061744-marostegui.json [06:32:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P55281 and previous config saved to /var/cache/conftool/dbconfig/20240123-063250-marostegui.json [06:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:45:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P55282 and previous config saved to /var/cache/conftool/dbconfig/20240123-064502-ladsgroup.json [06:45:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:47:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T354336)', diff saved to https://phabricator.wikimedia.org/P55283 and previous config saved to /var/cache/conftool/dbconfig/20240123-064757-marostegui.json [06:47:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:48:02] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:48:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:48:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T354336)', diff saved to https://phabricator.wikimedia.org/P55284 and previous config saved to /var/cache/conftool/dbconfig/20240123-064819-marostegui.json [06:50:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T354336)', diff saved to https://phabricator.wikimedia.org/P55285 and previous config saved to /var/cache/conftool/dbconfig/20240123-065029-marostegui.json [06:55:20] (03PS1) 10Marostegui: db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992353 (https://phabricator.wikimedia.org/T354506) [06:56:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1231', diff saved to https://phabricator.wikimedia.org/P55287 and previous config saved to /var/cache/conftool/dbconfig/20240123-065606-marostegui.json [06:56:21] (03PS2) 10Kosta Harlan: PreAuthenticationProvider: Allow blocking account creation based on IP reputation [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) [06:56:56] (03CR) 10Marostegui: [C: 03+2] db1231: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992353 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:57:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1231.eqiad.wmnet with OS bookworm [06:58:39] (03PS1) 10Kosta Harlan: PreAuthenticationProvider: Allow blocking account creation based on IP reputation [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0700) [07:00:05] kormat, marostegui, and Amir1: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0700). Please do the needful. [07:00:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P55288 and previous config saved to /var/cache/conftool/dbconfig/20240123-070008-ladsgroup.json [07:01:53] (03CR) 10CI reject: [V: 04-1] PreAuthenticationProvider: Allow blocking account creation based on IP reputation [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [07:03:29] (03CR) 10CI reject: [V: 04-1] PreAuthenticationProvider: Allow blocking account creation based on IP reputation [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [07:05:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P55289 and previous config saved to /var/cache/conftool/dbconfig/20240123-070535-marostegui.json [07:10:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1231.eqiad.wmnet with reason: host reimage [07:13:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1231.eqiad.wmnet with reason: host reimage [07:15:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P55290 and previous config saved to /var/cache/conftool/dbconfig/20240123-071515-ladsgroup.json [07:19:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:20:07] (03PS5) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [07:20:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P55291 and previous config saved to /var/cache/conftool/dbconfig/20240123-072041-marostegui.json [07:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:21:48] (03CR) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [07:27:43] (03PS1) 10Marostegui: Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992124 [07:29:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:29:23] (03CR) 10Marostegui: [C: 03+2] Revert "db1231: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992124 (owner: 10Marostegui) [07:30:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T352010)', diff saved to https://phabricator.wikimedia.org/P55292 and previous config saved to /var/cache/conftool/dbconfig/20240123-073021-ladsgroup.json [07:30:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:30:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55293 and previous config saved to /var/cache/conftool/dbconfig/20240123-073033-root.json [07:30:42] 10SRE-swift-storage, 10UploadWizard: Problem uploading FLAC file in Upload Wizzard to Wikimedia Commons - https://phabricator.wikimedia.org/T355610 (10Bugreporter) [07:31:11] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Bugreporter) [07:32:13] (03CR) 10Ayounsi: [C: 03+1] cr-labs: Add temporary term for cloudrabbit hosts [homer/public] - 10https://gerrit.wikimedia.org/r/992245 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [07:32:31] (03CR) 10Ayounsi: [C: 03+1] cr-labs: Remove temporary openstack-apis rule [homer/public] - 10https://gerrit.wikimedia.org/r/992244 (owner: 10Majavah) [07:34:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1231.eqiad.wmnet with OS bookworm [07:35:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T354336)', diff saved to https://phabricator.wikimedia.org/P55294 and previous config saved to /var/cache/conftool/dbconfig/20240123-073548-marostegui.json [07:35:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance [07:35:53] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:36:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance [07:36:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T354336)', diff saved to https://phabricator.wikimedia.org/P55295 and previous config saved to /var/cache/conftool/dbconfig/20240123-073610-marostegui.json [07:38:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T354336)', diff saved to https://phabricator.wikimedia.org/P55296 and previous config saved to /var/cache/conftool/dbconfig/20240123-073821-marostegui.json [07:40:15] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:41:01] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:41:05] (03PS6) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [07:41:11] PROBLEM - Host ps1-f8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:41:15] PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:41:23] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:41:25] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:41:44] (03PS7) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) [07:41:51] XioNoX: is that expected? ^ [07:42:07] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:07] PROBLEM - Host ps1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:07] PROBLEM - Host ps1-e6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:07] PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:07] PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:07] PROBLEM - Host ps1-f6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:08] PROBLEM - Host ps1-e5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:08] PROBLEM - Host ps1-f7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:09] PROBLEM - Host ps1-f3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:09] nop :) [07:42:13] PROBLEM - Host ps1-e8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:13] PROBLEM - Host ps1-f1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:14] \o/ [07:42:15] looks like mgmt went down [07:42:17] PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:23] PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:29] PROBLEM - Host ps1-e7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:42:35] PROBLEM - Host ps1-f5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [07:43:17] oob is down too [07:43:21] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:43:21] (03CR) 10Andrea Denisse: "I read on the Wiki that authdns is going to use UID/GID 928." [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [07:43:35] nothing we can do until dcops wakes up [07:43:56] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [07:43:59] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:44:06] (03CR) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [07:44:11] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:15] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:30] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:45:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55297 and previous config saved to /var/cache/conftool/dbconfig/20240123-074538-root.json [07:45:55] (03CR) 10Majavah: [C: 03+2] galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 (owner: 10Majavah) [07:46:13] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: galera: always use cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/992208 (https://phabricator.wikimedia.org/T355418) (owner: 10Majavah) [07:46:28] (03CR) 10Majavah: [C: 03+2] P:openstack: galera: migrate to firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992217 (owner: 10Majavah) [07:47:29] (03Merged) 10jenkins-bot: galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 (owner: 10Majavah) [07:51:45] 10ops-eqiad: mr1-eqiad down - https://phabricator.wikimedia.org/T355643 (10ayounsi) p:05Triage→03High [07:52:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1051.eqiad.wmnet [07:52:53] RECOVERY - Host fasw-c-eqiad is UP: PING WARNING - Packet loss = 60%, RTA = 2.33 ms [07:52:53] RECOVERY - Host asw2-c-eqiad is UP: PING WARNING - Packet loss = 71%, RTA = 3.67 ms [07:52:53] RECOVERY - Host ps1-f1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 6.90 ms [07:52:53] RECOVERY - Host ps1-f7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 6.16 ms [07:52:53] RECOVERY - Host ps1-f8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [07:52:53] RECOVERY - Host ps1-e2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [07:52:54] RECOVERY - Host ps1-e1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.61 ms [07:52:54] RECOVERY - Host ps1-e5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [07:52:55] RECOVERY - Host ps1-f5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [07:52:55] RECOVERY - Host ps1-e3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [07:52:56] (03CR) 10Muehlenhoff: [C: 03+2] mc1051: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991297 (owner: 10Effie Mouzeli) [07:52:56] RECOVERY - Host ps1-e6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.97 ms [07:52:56] RECOVERY - Host ps1-e4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.68 ms [07:52:57] RECOVERY - Host ps1-f3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [07:52:57] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:52:58] RECOVERY - Host ps1-e7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.60 ms [07:52:58] RECOVERY - Host ps1-e8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [07:52:59] RECOVERY - Host ps1-f6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.37 ms [07:52:59] RECOVERY - Host ps1-f4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [07:53:00] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [07:53:00] RECOVERY - Host ps1-f2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms [07:53:19] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [07:53:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P55298 and previous config saved to /var/cache/conftool/dbconfig/20240123-075327-marostegui.json [07:53:33] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:53:34] it's back? [07:53:43] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [07:54:02] yeah, I guess it crashed and rebooted on its own [07:55:31] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [07:55:35] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [07:56:11] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:57:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1051.eqiad.wmnet [07:57:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T352010)', diff saved to https://phabricator.wikimedia.org/P55299 and previous config saved to /var/cache/conftool/dbconfig/20240123-075725-ladsgroup.json [07:57:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:57:53] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2051.codfw.wmnet [07:59:13] 10ops-eqiad: mr1-eqiad down - https://phabricator.wikimedia.org/T355643 (10ayounsi) 05Open→03Resolved a:03ayounsi Router rebooted on its own. Lets see if it happens again before following up with JTAC. [07:59:19] (03CR) 10Muehlenhoff: [C: 03+2] mc2051: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991298 (owner: 10Effie Mouzeli) [08:00:04] Amir1 and Urbanecm: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55300 and previous config saved to /var/cache/conftool/dbconfig/20240123-080044-root.json [08:02:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2051.codfw.wmnet [08:08:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P55301 and previous config saved to /var/cache/conftool/dbconfig/20240123-080834-marostegui.json [08:09:11] (03CR) 10Muehlenhoff: [C: 03+1] "I have an old computer with Windows 10 for my taxes. If you point me to the current docs I'm happy to figure out how to update them with t" [puppet] - 10https://gerrit.wikimedia.org/r/875899 (https://phabricator.wikimedia.org/T198138) (owner: 10Majavah) [08:12:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P55302 and previous config saved to /var/cache/conftool/dbconfig/20240123-081231-ladsgroup.json [08:15:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55303 and previous config saved to /var/cache/conftool/dbconfig/20240123-081549-root.json [08:18:26] (03CR) 10Majavah: [C: 03+2] cr-labs: Remove temporary openstack-apis rule [homer/public] - 10https://gerrit.wikimedia.org/r/992244 (owner: 10Majavah) [08:19:09] (03Merged) 10jenkins-bot: cr-labs: Remove temporary openstack-apis rule [homer/public] - 10https://gerrit.wikimedia.org/r/992244 (owner: 10Majavah) [08:23:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T354336)', diff saved to https://phabricator.wikimedia.org/P55304 and previous config saved to /var/cache/conftool/dbconfig/20240123-082340-marostegui.json [08:23:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance [08:23:45] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:23:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance [08:24:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T354336)', diff saved to https://phabricator.wikimedia.org/P55305 and previous config saved to /var/cache/conftool/dbconfig/20240123-082402-marostegui.json [08:26:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T354336)', diff saved to https://phabricator.wikimedia.org/P55306 and previous config saved to /var/cache/conftool/dbconfig/20240123-082613-marostegui.json [08:27:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P55307 and previous config saved to /var/cache/conftool/dbconfig/20240123-082738-ladsgroup.json [08:28:19] !log updating CR firewall policy with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/992244 [08:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:23] (03CR) 10Majavah: [C: 03+2] cr-labs: Add temporary term for cloudrabbit hosts [homer/public] - 10https://gerrit.wikimedia.org/r/992245 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [08:28:38] hashar: I haven't been getting gerrit emails for example for updates to https://gerrit.wikimedia.org/r/c/operations/puppet/+/990795 with new comments from folks, is that expected/known? [08:28:54] since yesterday's upgrade that is [08:28:58] (03Merged) 10jenkins-bot: cr-labs: Add temporary term for cloudrabbit hosts [homer/public] - 10https://gerrit.wikimedia.org/r/992245 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [08:29:24] I am getting gerrit emails so that part is working, though e.g. on from jenkins-bot when a change is submitted [08:30:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55308 and previous config saved to /var/cache/conftool/dbconfig/20240123-083054-root.json [08:32:42] (03PS1) 10Majavah: cr-labs: cloudrabbit: add missing action [homer/public] - 10https://gerrit.wikimedia.org/r/992359 [08:32:49] (03CR) 10Majavah: [C: 03+2] cr-labs: cloudrabbit: add missing action [homer/public] - 10https://gerrit.wikimedia.org/r/992359 (owner: 10Majavah) [08:33:52] (03Merged) 10jenkins-bot: cr-labs: cloudrabbit: add missing action [homer/public] - 10https://gerrit.wikimedia.org/r/992359 (owner: 10Majavah) [08:35:11] indeed my "notifications" has only "submits", maybe the default changed [08:35:48] I'll followup in a task [08:37:48] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [08:39:54] !log ayounsi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1002.eqiad.wmnet [08:40:52] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki::cgroup: Enable v1 cgroups on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) (owner: 10Muehlenhoff) [08:41:03] opened T355646 [08:41:04] T355646: Gerrit notifications settings default/reset post upgrade - https://phabricator.wikimedia.org/T355646 [08:41:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P55309 and previous config saved to /var/cache/conftool/dbconfig/20240123-084119-marostegui.json [08:41:46] XioNoX: topranks: hey, I'm trying to run the netbox capirca homer script and that's timing out - any clues how to fix that? https://netbox.wikimedia.org/extras/scripts/results/5441253/ [08:41:49] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1002.eqiad.wmnet [08:42:33] taavi: looking [08:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T352010)', diff saved to https://phabricator.wikimedia.org/P55310 and previous config saved to /var/cache/conftool/dbconfig/20240123-084244-ladsgroup.json [08:42:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:42:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:42:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:42:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:42:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:43:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T352010)', diff saved to https://phabricator.wikimedia.org/P55311 and previous config saved to /var/cache/conftool/dbconfig/20240123-084301-ladsgroup.json [08:44:25] taavi: I ran it again and it worked fine, something funky with redis I guess but at least you're not blocked [08:44:30] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [08:44:37] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) [08:44:39] weird. that was my second run with the same timeout [08:44:40] thanks [08:44:42] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [08:45:35] (03CR) 10Filippo Giunchedi: "LGTM overall (see inline, non blocking)" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [08:46:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55312 and previous config saved to /var/cache/conftool/dbconfig/20240123-084559-root.json [08:49:57] (03CR) 10Gmodena: [C: 03+2] eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [08:51:12] (03Merged) 10jenkins-bot: eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [08:51:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS bullseye [08:55:11] !log updating CR firewall policy with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/992245/ https://gerrit.wikimedia.org/r/c/operations/homer/public/+/992359/ [08:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P55313 and previous config saved to /var/cache/conftool/dbconfig/20240123-085625-marostegui.json [09:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0900) [09:01:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55314 and previous config saved to /var/cache/conftool/dbconfig/20240123-090104-root.json [09:01:15] !log ayounsi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [09:04:54] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1003.eqiad.wmnet [09:05:13] good morning [09:06:44] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware fails with "unable to extract version" - https://phabricator.wikimedia.org/T355649 (10ayounsi) [09:11:03] (03PS1) 10Slyngshede: Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) [09:11:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T354336)', diff saved to https://phabricator.wikimedia.org/P55315 and previous config saved to /var/cache/conftool/dbconfig/20240123-091132-marostegui.json [09:11:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance [09:11:37] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:11:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance [09:11:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T354336)', diff saved to https://phabricator.wikimedia.org/P55316 and previous config saved to /var/cache/conftool/dbconfig/20240123-091154-marostegui.json [09:13:36] (03CR) 10Majavah: Capitalize first character in CNs. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [09:14:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T354336)', diff saved to https://phabricator.wikimedia.org/P55317 and previous config saved to /var/cache/conftool/dbconfig/20240123-091404-marostegui.json [09:15:11] I am deploying wmf.15 on group0 [09:15:47] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992363 (https://phabricator.wikimedia.org/T354433) [09:15:49] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992363 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [09:16:42] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992363 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [09:21:12] 09:20:48 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'deploy1002.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'deploy2002.codfw.wmnet'] (ran as mwdeploy@mw1486.eqiad.wmnet) returned [127]: bash: line 1: /usr/bin/scap: No such file or directory [09:21:14] fun [09:21:29] mw1486 is thus faulty somehow, but maybe it is being reimaged [09:21:47] (03PS2) 10Slyngshede: Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) [09:21:55] (ran as mwdeploy@snapshot1016.eqiad.wmnet) returned [255]: Host key verification failed. [09:21:59] another one :) [09:22:05] (03CR) 10Slyngshede: Capitalize first character in CNs. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [09:22:11] we also lost https://sal.toolforge.org/production [09:22:20] T355622 [09:22:20] T355622: Scap Error - https://phabricator.wikimedia.org/T355622 [09:22:46] (03PS3) 10Slyngshede: Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) [09:22:53] that is a candidate for the poorest task title of the year [09:22:54] :) [09:24:04] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.15 refs T354433 [09:24:07] hopefully `mw1486` is not pooled [09:24:09] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [09:24:09] 10SRE, 10serviceops: scap not installed on mw1486.eqiad.wment which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10hashar) [09:24:30] hashar: snapshot1016 is being reimaged currently, sorry about that [09:26:21] hmm mw1486 was made a kubernetes worker, but is still listed as a scap proxy [09:26:31] moritzm: ah great thanks :) [09:26:55] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10hashar) When running the MediaWiki train, scap complained due to the ssh host key of `snapshot1016.eqiad.wmnet` not b... [09:27:22] (03PS1) 10Majavah: hieradata: remove mw1486 from scap proxies [puppet] - 10https://gerrit.wikimedia.org/r/992364 (https://phabricator.wikimedia.org/T355622) [09:27:59] I don't know what happens when a scap proxy is not updated, hopefully it is not used for the rest of the sync :) [09:28:24] hnowlan: ^^ https://gerrit.wikimedia.org/r/c/operations/puppet/+/992364 [09:28:57] I can't tell about how the scap proxies are balanced though [09:29:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P55318 and previous config saved to /var/cache/conftool/dbconfig/20240123-092910-marostegui.json [09:29:21] the original idea was to avoid killing the low level network by spreading the load per datacenter rows [09:29:38] maybe it does not matter anymore nowadays [09:29:41] 10SRE, 10serviceops, 10Patch-For-Review: scap not installed on mw1486.eqiad.wment which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10taavi) [09:29:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [09:29:43] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10taavi) [09:30:56] (03PS1) 10Ayounsi: Firmware extract_version: handle more NIC strings [cookbooks] - 10https://gerrit.wikimedia.org/r/992365 (https://phabricator.wikimedia.org/T355649) [09:31:14] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hardware.upgrade-firmware fails with "unable to extract version" - https://phabricator.wikimedia.org/T355649 (10ayounsi) [09:32:14] (03CR) 10Hashar: "The original intent was to spread the network load between the different datacenter raws and avoid network traffic accross the network, bu" [puppet] - 10https://gerrit.wikimedia.org/r/992364 (https://phabricator.wikimedia.org/T355622) (owner: 10Majavah) [09:33:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [09:33:09] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/992365 (https://phabricator.wikimedia.org/T355649) (owner: 10Ayounsi) [09:34:13] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10MoritzMuehlenhoff) >>! In T325228#9480025, @hashar wrote: > When running the MediaWiki train, scap complained due to the ssh host key of `s... [09:35:52] (03CR) 10Ayounsi: [C: 03+2] Firmware extract_version: handle more NIC strings [cookbooks] - 10https://gerrit.wikimedia.org/r/992365 (https://phabricator.wikimedia.org/T355649) (owner: 10Ayounsi) [09:40:19] (03Merged) 10jenkins-bot: Firmware extract_version: handle more NIC strings [cookbooks] - 10https://gerrit.wikimedia.org/r/992365 (https://phabricator.wikimedia.org/T355649) (owner: 10Ayounsi) [09:41:25] !log ayounsi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [09:41:30] (03PS2) 10Kosta Harlan: Fix CentralIdLookup tests [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 [09:43:10] (03PS26) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [09:43:27] (03CR) 10CI reject: [V: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [09:43:52] (03PS1) 10Kosta Harlan: Fix CentralIdLookup tests [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 [09:44:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P55319 and previous config saved to /var/cache/conftool/dbconfig/20240123-094417-marostegui.json [09:44:49] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hardware.upgrade-firmware fails with "unable to extract version" - https://phabricator.wikimedia.org/T355649 (10ayounsi) 05Open→03Resolved a:03ayounsi [09:46:11] (03PS27) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [09:47:44] (03CR) 10Cathal Mooney: [C: 03+2] Add BGP to the contributing protocols for aggregate routes on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney) [09:48:39] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1184/console" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [09:48:48] (03Merged) 10jenkins-bot: Add BGP to the contributing protocols for aggregate routes on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney) [09:51:51] (03PS28) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [09:54:56] (03PS1) 10Lucas Werkmeister (WMDE): termbox(test): update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992387 (https://phabricator.wikimedia.org/T331403) [09:55:21] (03PS4) 10Slyngshede: Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) [09:57:13] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T354336)', diff saved to https://phabricator.wikimedia.org/P55320 and previous config saved to /var/cache/conftool/dbconfig/20240123-095923-marostegui.json [09:59:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance [09:59:29] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:59:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance [09:59:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:59:52] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts sretest1003.eqiad.wmnet [09:59:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:00:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T354336)', diff saved to https://phabricator.wikimedia.org/P55321 and previous config saved to /var/cache/conftool/dbconfig/20240123-100002-marostegui.json [10:00:33] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [10:02:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T354336)', diff saved to https://phabricator.wikimedia.org/P55322 and previous config saved to /var/cache/conftool/dbconfig/20240123-100212-marostegui.json [10:02:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1016.eqiad.wmnet with OS bullseye [10:03:47] !log ayounsi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [10:03:56] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts sretest1003.eqiad.wmnet [10:04:04] (03CR) 10Muehlenhoff: "Looks good, one nit inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:04:17] !log ayounsi@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [10:04:55] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [10:04:58] (03CR) 10Muehlenhoff: Capitalize first character in CNs. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:06:58] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [10:10:46] (03PS1) 10Marostegui: db2171: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992389 (https://phabricator.wikimedia.org/T354506) [10:10:49] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-master1001.eqiad.wmnet [10:10:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2171:3315 db2171:3316', diff saved to https://phabricator.wikimedia.org/P55323 and previous config saved to /var/cache/conftool/dbconfig/20240123-101056-marostegui.json [10:11:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:12:15] (03CR) 10Marostegui: [C: 03+2] db2171: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992389 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [10:12:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2171.codfw.wmnet with OS bookworm [10:13:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [10:13:44] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts sretest1003.eqiad.wmnet [10:17:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P55324 and previous config saved to /var/cache/conftool/dbconfig/20240123-101718-marostegui.json [10:18:41] (03CR) 10Muehlenhoff: "Which wiki do you mean? The authorative source is definitely what's in data.yaml only." [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [10:23:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1017.eqiad.wmnet with OS bullseye [10:23:49] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10ayounsi) a:05ayounsi→03None [10:24:25] (03CR) 10Slyngshede: Capitalize first character in CNs. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:25:10] 10SRE-swift-storage, 10UploadWizard: Problem uploading FLAC file in Upload Wizzard to Wikimedia Commons - https://phabricator.wikimedia.org/T355610 (10MatthewVernon) I don't think this is a result of a swift failure, so we'd need input from the upload wizard folks. Looking in the swift logs, I see: ` moss-fe2... [10:27:41] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [10:28:25] (03PS5) 10Slyngshede: Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) [10:31:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [10:32:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P55325 and previous config saved to /var/cache/conftool/dbconfig/20240123-103225-marostegui.json [10:32:38] (03CR) 10Muehlenhoff: Capitalize first character in CNs. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:32:45] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-master1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [10:34:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-master1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [10:34:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:10] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-master1001.eqiad.wmnet [10:35:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [10:36:24] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10MatthewVernon) When did you try with upload wizard and get the error message you describe here? I've checked the swift logs for 18 and 19 January, and get no hits at all for `1an8dgb0q6... [10:40:45] (03PS1) 10Marostegui: Revert "db2171: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992370 [10:41:26] 10SRE-swift-storage, 10UploadWizard: Problem uploading FLAC file in Upload Wizzard to Wikimedia Commons - https://phabricator.wikimedia.org/T355610 (10Wilfredor) Apparently it is not a storage problem but in the way the files are unified in a subsequent process, the temporary files are deleted without first co... [10:41:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:41:46] (03CR) 10Clément Goubert: [C: 03+1] hieradata: remove mw1486 from scap proxies [puppet] - 10https://gerrit.wikimedia.org/r/992364 (https://phabricator.wikimedia.org/T355622) (owner: 10Majavah) [10:41:56] (03CR) 10Majavah: [C: 03+2] hieradata: remove mw1486 from scap proxies [puppet] - 10https://gerrit.wikimedia.org/r/992364 (https://phabricator.wikimedia.org/T355622) (owner: 10Majavah) [10:43:03] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-master1002.eqiad.wmnet [10:45:21] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Wilfredor) I have the same problem uploading FLAC files, same error [10:45:25] (03PS6) 10Slyngshede: Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) [10:46:01] (03CR) 10Slyngshede: Capitalize first character in CNs. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:47:00] 10SRE, 10serviceops, 10Patch-For-Review: scap not installed on mw1486.eqiad.wment which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 (10Clement_Goubert) The host had been reclaimed for mw-on-k8s, and wasn't removed from the list of scap proxies. Than... [10:47:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T354336)', diff saved to https://phabricator.wikimedia.org/P55326 and previous config saved to /var/cache/conftool/dbconfig/20240123-104731-marostegui.json [10:47:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance [10:47:36] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:47:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance [10:47:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T354336)', diff saved to https://phabricator.wikimedia.org/P55327 and previous config saved to /var/cache/conftool/dbconfig/20240123-104753-marostegui.json [10:48:48] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [10:49:26] (03CR) 10Muehlenhoff: "Looks good. Strictly speaking this should be a WMF-specific setting (given the restriction is only imposed by MediaWiki and other uses of " [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:49:31] (03CR) 10Muehlenhoff: [C: 03+1] Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T354336)', diff saved to https://phabricator.wikimedia.org/P55328 and previous config saved to /var/cache/conftool/dbconfig/20240123-105003-marostegui.json [10:50:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55329 and previous config saved to /var/cache/conftool/dbconfig/20240123-105035-root.json [10:54:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55330 and previous config saved to /var/cache/conftool/dbconfig/20240123-105410-root.json [10:54:16] (03CR) 10Marostegui: [C: 03+2] Revert "db2171: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992370 (owner: 10Marostegui) [10:54:19] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Capitalize first character in CNs. [software/bitu] - 10https://gerrit.wikimedia.org/r/992362 (https://phabricator.wikimedia.org/T355615) (owner: 10Slyngshede) [10:55:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2171.codfw.wmnet with OS bookworm [10:55:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/992181 (https://phabricator.wikimedia.org/T355658) [10:55:57] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/992182 (https://phabricator.wikimedia.org/T355658) [10:56:08] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-master1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [10:56:15] (03PS1) 10Clément Goubert: scap::dsh::scap_proxies: Replace mw1486 by mw1405 [puppet] - 10https://gerrit.wikimedia.org/r/992391 (https://phabricator.wikimedia.org/T355622) [10:58:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-master1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [10:58:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:58:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-master1002.eqiad.wmnet [10:59:47] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992183 (https://phabricator.wikimedia.org/T355660) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1100) [11:00:44] (03PS1) 10Kamila Součková: sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) [11:02:45] (03PS1) 10Muehlenhoff: Deprecate system::role for ML services [puppet] - 10https://gerrit.wikimedia.org/r/992393 [11:05:07] (03CR) 10CI reject: [V: 04-1] sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) (owner: 10Kamila Součková) [11:05:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P55331 and previous config saved to /var/cache/conftool/dbconfig/20240123-110509-marostegui.json [11:05:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55332 and previous config saved to /var/cache/conftool/dbconfig/20240123-110540-root.json [11:06:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355660 [11:07:05] T355660: Switchover s6 master (db1173 -> db1231) - https://phabricator.wikimedia.org/T355660 [11:07:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s6 T355660 [11:07:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1231 with weight 0 T355660', diff saved to https://phabricator.wikimedia.org/P55333 and previous config saved to /var/cache/conftool/dbconfig/20240123-110743-marostegui.json [11:09:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55335 and previous config saved to /var/cache/conftool/dbconfig/20240123-110915-root.json [11:11:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1017.eqiad.wmnet with OS bullseye [11:11:53] !log dropping pif_edits table from all wikis (T355594) [11:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:59] T355594: SecurePoll creates a table for each election and keeps it forever - https://phabricator.wikimedia.org/T355594 [11:13:43] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 2245 MB (2% inode=83%): /tmp 2245 MB (2% inode=83%): /var/tmp 2245 MB (2% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [11:14:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/992183 (https://phabricator.wikimedia.org/T355660) (owner: 10Gerrit maintenance bot) [11:16:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1017.eqiad.wmnet with OS bullseye [11:16:08] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [11:16:12] taavi: thanks for fixing that [11:20:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P55336 and previous config saved to /var/cache/conftool/dbconfig/20240123-112016-marostegui.json [11:20:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55337 and previous config saved to /var/cache/conftool/dbconfig/20240123-112045-root.json [11:20:47] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [11:21:34] (03CR) 10Vgutierrez: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [11:22:49] !log dropping bv2011_edits table from all wikis (T355594) [11:22:51] Amir1: Failed to log message to wiki. Somebody should check the error logs. [11:22:52] T355594: SecurePoll creates a table for each election and keeps it forever - https://phabricator.wikimedia.org/T355594 [11:24:17] marostegui: let me know once done, I have T343718 on the old master :D [11:24:17] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:24:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55338 and previous config saved to /var/cache/conftool/dbconfig/20240123-112420-root.json [11:24:24] Amir1: will do [11:24:31] thanks! [11:25:33] re stashbot, apparently wikitech was briefly readonly due to replication lag [11:26:16] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10BTullis) a:05BTullis→03Jclark-ctr [11:26:27] Lucas_WMDE: correct, part of the s6 eqiad switch [11:26:37] is it okay to edit https://wikitech.wikimedia.org/wiki/Server_Admin_Log manually to add the missed messages? [11:26:57] 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) a:05BTullis→03Jclark-ctr [11:27:53] (03PS2) 10Kamila Součková: sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) [11:31:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage [11:32:06] (03CR) 10CI reject: [V: 04-1] sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) (owner: 10Kamila Součková) [11:33:45] (03PS1) 10Muehlenhoff: late_command: Drop special case for snapshot1016/1017 [puppet] - 10https://gerrit.wikimedia.org/r/992398 (https://phabricator.wikimedia.org/T325228) [11:34:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1017.eqiad.wmnet with reason: host reimage [11:35:05] !log Starting s6 eqiad failover from db1173 to db1231 - T355660 [11:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:11] T355660: Switchover s6 master (db1173 -> db1231) - https://phabricator.wikimedia.org/T355660 [11:35:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T354336)', diff saved to https://phabricator.wikimedia.org/P55339 and previous config saved to /var/cache/conftool/dbconfig/20240123-113522-marostegui.json [11:35:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [11:35:27] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:35:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [11:35:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2167:3318 (T354336)', diff saved to https://phabricator.wikimedia.org/P55340 and previous config saved to /var/cache/conftool/dbconfig/20240123-113544-marostegui.json [11:35:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55341 and previous config saved to /var/cache/conftool/dbconfig/20240123-113550-root.json [11:37:29] (03PS3) 10Kamila Součková: sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) [11:37:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T354336)', diff saved to https://phabricator.wikimedia.org/P55342 and previous config saved to /var/cache/conftool/dbconfig/20240123-113754-marostegui.json [11:39:45] (03CR) 10Majavah: [C: 03+1] "Seems reasonable, mw1405 is indeed in row C and does seem old enough that it won't be converted to a k8s node any time soon." [puppet] - 10https://gerrit.wikimedia.org/r/992391 (https://phabricator.wikimedia.org/T355622) (owner: 10Clément Goubert) [11:42:58] (03PS1) 10Lucas Werkmeister (WMDE): beta: Don’t prevent * from editing if IP masking enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) [11:44:39] (03CR) 10Lucas Werkmeister (WMDE): "CCing people from I76036e4e0b." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [11:48:13] (03CR) 10Lucas Werkmeister (WMDE): "Normal wikitext edits are still working on Beta (https://wikidata.beta.wmflabs.org/w/index.php?title=Talk:Q459365&oldid=1355297); I’m gues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [11:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1173', diff saved to https://phabricator.wikimedia.org/P55343 and previous config saved to /var/cache/conftool/dbconfig/20240123-114826-marostegui.json [11:48:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55344 and previous config saved to /var/cache/conftool/dbconfig/20240123-114831-root.json [11:48:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55345 and previous config saved to /var/cache/conftool/dbconfig/20240123-114840-root.json [11:50:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55346 and previous config saved to /var/cache/conftool/dbconfig/20240123-115055-root.json [11:53:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P55347 and previous config saved to /var/cache/conftool/dbconfig/20240123-115301-marostegui.json [11:53:15] (03PS4) 10Kamila Součková: sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) [11:54:26] !log initial cleanup of replicated thanos blocks - T351927 [11:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:34] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [11:57:46] (03CR) 10CI reject: [V: 04-1] sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) (owner: 10Kamila Součková) [12:00:49] (03CR) 10Muehlenhoff: [C: 03+2] late_command: Drop special case for snapshot1016/1017 [puppet] - 10https://gerrit.wikimedia.org/r/992398 (https://phabricator.wikimedia.org/T325228) (owner: 10Muehlenhoff) [12:02:12] (03PS29) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [12:02:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1017.eqiad.wmnet with OS bullseye [12:02:40] (03CR) 10Fabfur: [C: 04-1] hiera: add acls for heavy ratelimiting abusing ip from list (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:02:59] !log dropping bv2009_edits table from all wikis (T355594) [12:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:04] T355594: SecurePoll creates a table for each election and keeps it forever - https://phabricator.wikimedia.org/T355594 [12:03:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55348 and previous config saved to /var/cache/conftool/dbconfig/20240123-120335-root.json [12:03:37] (03PS2) 10Majavah: P:openstack: nova: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991772 (https://phabricator.wikimedia.org/T355417) [12:03:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55349 and previous config saved to /var/cache/conftool/dbconfig/20240123-120344-root.json [12:05:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1016.eqiad.wmnet [12:06:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55350 and previous config saved to /var/cache/conftool/dbconfig/20240123-120600-root.json [12:07:01] (03PS1) 10Muehlenhoff: Switch snapshot1016 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992402 (https://phabricator.wikimedia.org/T349619) [12:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P55351 and previous config saved to /var/cache/conftool/dbconfig/20240123-120807-marostegui.json [12:08:31] (03CR) 10Majavah: [C: 03+2] P:openstack: nova: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991772 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah) [12:09:45] (03CR) 10Muehlenhoff: [C: 03+2] Switch snapshot1016 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992402 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:10:33] (03PS5) 10Kamila Součková: sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) [12:11:57] (03CR) 10Clément Goubert: [C: 03+2] scap::dsh::scap_proxies: Replace mw1486 by mw1405 [puppet] - 10https://gerrit.wikimedia.org/r/992391 (https://phabricator.wikimedia.org/T355622) (owner: 10Clément Goubert) [12:13:35] !log dropping bv2015_edits table from all wikis (T355594) [12:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:47] T355594: SecurePoll creates a table for each election and keeps it forever - https://phabricator.wikimedia.org/T355594 [12:14:31] !log scap::dsh::scap_proxies: Replace mw1486 by mw1405 - T355622 [12:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] T355622: scap not installed on mw1486.eqiad.wment which breaks deployment: /usr/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T355622 [12:14:50] (03CR) 10Volans: [C: 03+1] "LGTM, thanks a lot!" [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) (owner: 10Kamila Součková) [12:15:21] (03CR) 10Kamila Součková: [C: 03+2] sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) (owner: 10Kamila Součková) [12:16:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1016.eqiad.wmnet [12:17:11] !log Restarting ferm.service on k8s node mw1495.eqiad.wmnet - T354855 [12:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:25] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [12:18:10] RECOVERY - Check systemd state on mw1495 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55352 and previous config saved to /var/cache/conftool/dbconfig/20240123-121841-root.json [12:18:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55353 and previous config saved to /var/cache/conftool/dbconfig/20240123-121849-root.json [12:19:51] (03Merged) 10jenkins-bot: sre.downtime: add locking (workaround for T355187) [cookbooks] - 10https://gerrit.wikimedia.org/r/992392 (https://phabricator.wikimedia.org/T355187) (owner: 10Kamila Součková) [12:21:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55354 and previous config saved to /var/cache/conftool/dbconfig/20240123-122105-root.json [12:22:11] (03PS1) 10Klausman: /home/klausman: add convenience functions to (de)activate webproxy [puppet] - 10https://gerrit.wikimedia.org/r/992404 [12:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T354336)', diff saved to https://phabricator.wikimedia.org/P55355 and previous config saved to /var/cache/conftool/dbconfig/20240123-122314-marostegui.json [12:23:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:23:19] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:23:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:23:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168:3318 (T354336)', diff saved to https://phabricator.wikimedia.org/P55356 and previous config saved to /var/cache/conftool/dbconfig/20240123-122336-marostegui.json [12:23:47] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on sretest1001.eqiad.wmnet with reason: testing the cookbook [12:24:30] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10BTullis) [12:24:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1495 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:24:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T354336)', diff saved to https://phabricator.wikimedia.org/P55357 and previous config saved to /var/cache/conftool/dbconfig/20240123-122446-marostegui.json [12:24:58] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10BTullis) [12:26:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1001.eqiad.wmnet with reason: testing the cookbook [12:26:46] !log kamila@cumin1002 START - Cookbook sre.hosts.remove-downtime for sretest1001.eqiad.wmnet [12:26:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for sretest1001.eqiad.wmnet [12:26:49] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=757944ae-dc8f-4433-9c0c-e68dc04b371b) set by kamila@cumin1002 for 4:00:... [12:28:34] !log Restarting killed maintenance job mediawiki_job_MachineVision_prioritize_uncategorized.service [12:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:32] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:32] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10kamila) 05Open→03Resolved I believe the above patch fixed it, so I'm closing this. I will reopen in case I see the race again. [12:31:59] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1017.eqiad.wmnet [12:33:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171:3316 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55358 and previous config saved to /var/cache/conftool/dbconfig/20240123-123346-root.json [12:33:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55359 and previous config saved to /var/cache/conftool/dbconfig/20240123-123354-root.json [12:34:25] (03CR) 10Marostegui: [C: 03+1] mariadb: prometheus on localhost grant should be VIA unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/992220 (owner: 10Ladsgroup) [12:36:05] (03PS2) 10Ladsgroup: mariadb: prometheus on localhost grant should be VIA unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/992220 [12:36:11] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: prometheus on localhost grant should be VIA unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/992220 (owner: 10Ladsgroup) [12:38:34] (03PS1) 10Muehlenhoff: Switch snapshot1017 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992406 (https://phabricator.wikimedia.org/T349619) [12:39:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P55360 and previous config saved to /var/cache/conftool/dbconfig/20240123-123952-marostegui.json [12:40:32] (03CR) 10Muehlenhoff: [C: 03+2] Switch snapshot1017 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992406 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:40:52] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error 503, Backend fetch failed while uploading file from Internet Archive - https://phabricator.wikimedia.org/T352215 (10Yann) I got this again, this time uploading from the Library of Congress: * https://tile.loc.gov/storage-services/master/pnp/fsa/8c5200... [12:44:04] (03PS2) 10Klausman: /home/klausman: add convenience functions to (de)activate webproxy [puppet] - 10https://gerrit.wikimedia.org/r/992404 [12:45:23] (03PS3) 10Klausman: /home/klausman: Assorted small bash and tmux fixes [puppet] - 10https://gerrit.wikimedia.org/r/992404 [12:45:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1017.eqiad.wmnet [12:48:45] (03CR) 10Klausman: [C: 03+2] /home/klausman: Assorted small bash and tmux fixes [puppet] - 10https://gerrit.wikimedia.org/r/992404 (owner: 10Klausman) [12:49:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55361 and previous config saved to /var/cache/conftool/dbconfig/20240123-124859-root.json [12:49:08] (03CR) 10Klausman: [C: 03+1] Deprecate system::role for ML services [puppet] - 10https://gerrit.wikimedia.org/r/992393 (owner: 10Muehlenhoff) [12:54:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P55362 and previous config saved to /var/cache/conftool/dbconfig/20240123-125459-marostegui.json [12:56:23] (03PS1) 10Muehlenhoff: Remove obsolete setting [puppet] - 10https://gerrit.wikimedia.org/r/992407 [12:56:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS bullseye [12:57:10] (03CR) 10Muehlenhoff: [C: 03+2] Deprecate system::role for ML services [puppet] - 10https://gerrit.wikimedia.org/r/992393 (owner: 10Muehlenhoff) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1300) [13:00:22] (03CR) 10Muehlenhoff: "ry" [puppet] - 10https://gerrit.wikimedia.org/r/992407 (owner: 10Muehlenhoff) [13:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55363 and previous config saved to /var/cache/conftool/dbconfig/20240123-130404-root.json [13:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T354336)', diff saved to https://phabricator.wikimedia.org/P55364 and previous config saved to /var/cache/conftool/dbconfig/20240123-131005-marostegui.json [13:10:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance [13:10:14] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:10:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance [13:10:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T354336)', diff saved to https://phabricator.wikimedia.org/P55365 and previous config saved to /var/cache/conftool/dbconfig/20240123-131027-marostegui.json [13:10:58] (03PS3) 10Slyngshede: Code cleanup before enabling CI pipeline. [software/bitu] - 10https://gerrit.wikimedia.org/r/992074 [13:11:13] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Sannita) >>! In T355433#9480324, @MatthewVernon wrote: > When did you try with upload wizard and get the error message you describe here? I've checked the swift logs for 18 and 19 Janua... [13:12:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T354336)', diff saved to https://phabricator.wikimedia.org/P55366 and previous config saved to /var/cache/conftool/dbconfig/20240123-131237-marostegui.json [13:12:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [13:13:37] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error 503, Backend fetch failed while uploading file from Internet Archive - https://phabricator.wikimedia.org/T352215 (10MatthewVernon) That first one looks to have uploaded OK as https://commons.wikimedia.org/wiki/File:Washstand_in_the_dog_run_and_kitchen... [13:15:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1016.eqiad.wmnet with reason: host reimage [13:16:13] (03PS9) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [13:19:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55367 and previous config saved to /var/cache/conftool/dbconfig/20240123-131909-root.json [13:22:25] (03PS1) 10WMDE-Fisch: Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) [13:26:55] (03CR) 10Muehlenhoff: Package Debmonitor server as .deb (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [13:27:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P55368 and previous config saved to /var/cache/conftool/dbconfig/20240123-132744-marostegui.json [13:29:15] (03CR) 10Muehlenhoff: Package Debmonitor server as .deb (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [13:37:45] (03PS1) 10Filippo Giunchedi: thanos: fix bucket-query tools [puppet] - 10https://gerrit.wikimedia.org/r/992413 (https://phabricator.wikimedia.org/T351927) [13:37:47] (03PS1) 10Filippo Giunchedi: thanos: add replicated blocks view to bucket-query [puppet] - 10https://gerrit.wikimedia.org/r/992414 (https://phabricator.wikimedia.org/T351927) [13:37:49] (03PS1) 10Filippo Giunchedi: thanos: add labels to thanos-rule blocks [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) [13:38:48] (03PS1) 10Bartosz Dziewoński: Restore support for matching 'LIKE' patterns/wildcards [extensions/Nuke] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/992376 (https://phabricator.wikimedia.org/T355478) [13:38:58] (03PS1) 10Bartosz Dziewoński: Restore support for matching 'LIKE' patterns/wildcards [extensions/Nuke] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992377 (https://phabricator.wikimedia.org/T355478) [13:39:08] (03PS10) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [13:40:36] (03CR) 10Bartosz Dziewoński: "(removing that code is T355210)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [13:42:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P55369 and previous config saved to /var/cache/conftool/dbconfig/20240123-134250-marostegui.json [13:45:03] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10MatthewVernon) I'm afraid I don't have the time and resources to follow discussions elsewhere, and without that information there's nothing much more I can do with this report. [13:45:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1016.eqiad.wmnet with OS bullseye [13:45:50] (03CR) 10Awight: [C: 03+1] Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [13:47:40] (03PS1) 10EoghanGaffney: [phabricator] Remove public task dump task timer [puppet] - 10https://gerrit.wikimedia.org/r/992416 (https://phabricator.wikimedia.org/T355502) [13:48:33] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix bucket-query tools [puppet] - 10https://gerrit.wikimedia.org/r/992413 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:48:45] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add replicated blocks view to bucket-query [puppet] - 10https://gerrit.wikimedia.org/r/992414 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:49:38] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/992416 (https://phabricator.wikimedia.org/T355502) (owner: 10EoghanGaffney) [13:49:55] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:50:31] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:50:53] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:51:19] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:52:15] !log Ran `foreachwikiindblist group0 extensions/MediaModeration/maintenance/resendMatchEmails.php 20200405 --verbose` [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T354336)', diff saved to https://phabricator.wikimedia.org/P55370 and previous config saved to /var/cache/conftool/dbconfig/20240123-135757-marostegui.json [13:58:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance [13:58:02] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:58:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance [13:58:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T354336)', diff saved to https://phabricator.wikimedia.org/P55371 and previous config saved to /var/cache/conftool/dbconfig/20240123-135819-marostegui.json [13:58:25] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Sannita) >>! In T355433#9480955, @MatthewVernon wrote: > I'm afraid I don't have the time and resources to follow discussions elsewhere, and without that information there's nothing muc... [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1400). [14:00:05] Lucas_WMDE, phuedx, awight, anzx, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] hi [14:00:24] o/ [14:00:38] o/ [14:01:00] o/ [14:01:37] (03PS1) 10Muehlenhoff: Make ganeti1037 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992418 (https://phabricator.wikimedia.org/T349925) [14:01:43] wow that’s a lot of changes in the window [14:02:01] (03PS4) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [14:02:02] I can deploy I guess [14:02:05] (03PS11) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [14:03:08] (03CR) 10Slyngshede: Package Debmonitor server as .deb (033 comments) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:04:00] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1037 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992418 (https://phabricator.wikimedia.org/T349925) (owner: 10Muehlenhoff) [14:04:06] (03CR) 10Lucas Werkmeister (WMDE): "Why do the three PNGs have to change? They look mostly identical to me, and according to Git were last touched in 2020 (I7c362a2153), long" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:04:19] let’s start with phuedx [14:04:28] (03PS2) 10Lucas Werkmeister (WMDE): ext-EventLogging,ext-EventStreamConfig: Remove mediawiki.special_diff_interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991606 (https://phabricator.wikimedia.org/T353366) (owner: 10Phuedx) [14:04:30] Sure. Ready to verify [14:04:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991606 (https://phabricator.wikimedia.org/T353366) (owner: 10Phuedx) [14:04:47] \o/ [14:05:12] awight: around? [14:05:23] I’d also be thankful if someone could +1 my own changes so I can deploy them ^^ [14:05:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:05:54] (03Merged) 10jenkins-bot: ext-EventLogging,ext-EventStreamConfig: Remove mediawiki.special_diff_interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991606 (https://phabricator.wikimedia.org/T353366) (owner: 10Phuedx) [14:06:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:06:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:06:09] MatmaRex: do you have the necessary permissions to test the Nuke backport on wmf.14 and wmf.15 wikis? or should we do the testing on one of the groups first? [14:06:11] (03CR) 10Hashar: [C: 03+1] "Welcome to beta cluster!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [14:06:26] (I’m assuming Nuke requires special permissions) [14:06:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:06:35] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10Gehel) p:05Triage→03High [14:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T343718)', diff saved to https://phabricator.wikimedia.org/P55372 and previous config saved to /var/cache/conftool/dbconfig/20240123-140636-ladsgroup.json [14:06:43] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991606|ext-EventLogging,ext-EventStreamConfig: Remove mediawiki.special_diff_interactions stream (T353366)]] [14:06:45] thanks hashar :) [14:06:47] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:06:51] T353366: Remove WikimediaEvents diff instrumentation - https://phabricator.wikimedia.org/T353366 [14:06:56] Lucas_WMDE: i can test on mw.org, but it should be the same in both versions [14:07:04] Lucas_WMDE: I'd approve the deployment-charts / termbox one, but I have absolutely no idea what Termbox is about :) [14:07:18] MatmaRex: alright, then let’s backport wmf.15 first [14:07:57] wmf.14 and wmf.15 of Nuke are actually the exact same commit, there were no changes last week [14:07:57] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [14:08:03] hashar: it provides server-side rendering for the labels, descriptions and aliases on mobile Wikidata, e.g. https://m.wikidata.org/wiki/Q42 [14:08:08] so it’s still available to users without JS [14:08:15] though idk if this gives you enough confidence to +1 it ^^ [14:08:26] MatmaRex: nice, that makes it safer ^^ [14:08:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T354336)', diff saved to https://phabricator.wikimedia.org/P55373 and previous config saved to /var/cache/conftool/dbconfig/20240123-140833-marostegui.json [14:08:38] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Backport for [[gerrit:991606|ext-EventLogging,ext-EventStreamConfig: Remove mediawiki.special_diff_interactions stream (T353366)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:41] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:08:46] (03CR) 10LSobanski: [C: 03+1] [phabricator] Remove public task dump task timer [puppet] - 10https://gerrit.wikimedia.org/r/992416 (https://phabricator.wikimedia.org/T355502) (owner: 10EoghanGaffney) [14:10:06] (03CR) 10Hashar: [C: 03+1] termbox(test): update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992387 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [14:10:07] phuedx: can you test on mwdebug? [14:10:40] Lucas_WMDE: I have +1ed the termbox upgrade patch, but you would probably want to get some others from WMDE to assist you if something explodes rather than me :-] [14:10:56] fair enough, thanks ^^ [14:11:12] that commit is only for test.wikidata.org anyways [14:11:22] (03PS1) 10Marostegui: db1173 Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992420 (https://phabricator.wikimedia.org/T354506) [14:11:22] I’ll follow up with another version bump for the main values file if everything works out [14:11:55] Lucas_WMDE: LGTM. Stream isn't present via the streamconfigs API nor anything being delivered on regular pageviews [14:12:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1173.eqiad.wmnet with OS bookworm [14:12:13] (03CR) 10Muehlenhoff: Package Debmonitor server as .deb (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:12:31] (03CR) 10Marostegui: [C: 03+2] db1173 Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992420 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [14:12:33] alright, thanks! [14:12:36] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and phuedx: Continuing with sync [14:13:30] (03CR) 10Lucas Werkmeister (WMDE): "Similar question for the SVG, actually – that was also not changed when the temporary logo was introduced, as far as I can tell." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:15:07] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts sretest1003.eqiad.wmnet [14:15:31] (03CR) 10Lucas Werkmeister (WMDE): "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [14:15:37] (03PS2) 10Lucas Werkmeister (WMDE): beta: Don’t prevent * from editing if IP masking enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) [14:16:46] (03PS5) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [14:16:59] (03CR) 10Slyngshede: Package Debmonitor server as .deb (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:17:03] Lucas_WMDE: sorry for the delay, yes I'm around now. [14:17:11] alright, hi :) [14:17:36] (03CR) 10CI reject: [V: 04-1] uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:18:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] beta: Don’t prevent * from editing if IP masking enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [14:18:33] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991606|ext-EventLogging,ext-EventStreamConfig: Remove mediawiki.special_diff_interactions stream (T353366)]] (duration: 11m 49s) [14:18:38] T353366: Remove WikimediaEvents diff instrumentation - https://phabricator.wikimedia.org/T353366 [14:18:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [14:19:18] (03Merged) 10jenkins-bot: beta: Don’t prevent * from editing if IP masking enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992400 (https://phabricator.wikimedia.org/T354730) (owner: 10Lucas Werkmeister (WMDE)) [14:19:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "+2ing already to start gate-and-submit" [extensions/Nuke] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992377 (https://phabricator.wikimedia.org/T355478) (owner: 10Bartosz Dziewoński) [14:19:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Nuke] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992377 (https://phabricator.wikimedia.org/T355478) (owner: 10Bartosz Dziewoński) [14:20:41] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [14:22:34] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error 503, Backend fetch failed while uploading file from Internet Archive - https://phabricator.wikimedia.org/T352215 (10MatthewVernon) And the second one as https://commons.wikimedia.org/wiki/File:Washstand_in_the_dog_run_of_Floyd_Burroughs%27_cabin._Hale... [14:22:41] (03Merged) 10jenkins-bot: Restore support for matching 'LIKE' patterns/wildcards [extensions/Nuke] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992377 (https://phabricator.wikimedia.org/T355478) (owner: 10Bartosz Dziewoński) [14:23:05] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992377|Restore support for matching 'LIKE' patterns/wildcards (T355478)]] [14:23:14] T355478: The pattern is not working for mass delete (Nuke) - https://phabricator.wikimedia.org/T355478 [14:23:35] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1003.eqiad.wmnet [14:23:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P55374 and previous config saved to /var/cache/conftool/dbconfig/20240123-142339-marostegui.json [14:23:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage [14:24:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [14:24:40] (03CR) 10Muehlenhoff: "Two final nits" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:24:48] !log lucaswerkmeister-wmde@deploy2002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:992377|Restore support for matching 'LIKE' patterns/wildcards (T355478)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:57] MatmaRex: can you test on mw.o? [14:25:03] yeah, looking [14:25:40] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T355630 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact on host [14:26:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage [14:27:29] PROBLEM - Check systemd state on ganeti1037 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:34] Lucas_WMDE: works as expected [14:27:37] !log lucaswerkmeister-wmde@deploy2002 matmarex and lucaswerkmeister-wmde: Continuing with sync [14:27:39] yay [14:27:49] (03PS12) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [14:29:15] awight: fyi, I’ll do the second of MatmaRex’ backports and then your config change [14:29:31] kk [14:30:19] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:30:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "+2ing to start gate-and-submit while previous `scap backport` finishes" [extensions/Nuke] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/992376 (https://phabricator.wikimedia.org/T355478) (owner: 10Bartosz Dziewoński) [14:31:18] (03PS7) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [14:32:25] hm, I think that gate-and-submit might almost finish before the php-fpm-restart does [14:32:28] no big deal if it does though ^^ [14:32:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [14:32:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts sretest1003.eqiad.wmnet [14:33:35] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992377|Restore support for matching 'LIKE' patterns/wildcards (T355478)]] (duration: 10m 29s) [14:33:41] T355478: The pattern is not working for mass delete (Nuke) - https://phabricator.wikimedia.org/T355478 [14:33:56] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10SLyngshede-WMF) 05Open→03Resolved [14:33:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Nuke] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/992376 (https://phabricator.wikimedia.org/T355478) (owner: 10Bartosz Dziewoński) [14:33:59] 10Puppet, 10SRE, 10SRE-tools, 10Infrastructure-Foundations, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10SLyngshede-WMF) [14:34:17] (03Merged) 10jenkins-bot: Restore support for matching 'LIKE' patterns/wildcards [extensions/Nuke] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/992376 (https://phabricator.wikimedia.org/T355478) (owner: 10Bartosz Dziewoński) [14:34:23] (03CR) 10Anzx: "i thought it was necessary every time logo change was done, removed it now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:34:37] RECOVERY - Check systemd state on ganeti1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:42] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992376|Restore support for matching 'LIKE' patterns/wildcards (T355478)]] [14:36:36] !log lucaswerkmeister-wmde@deploy2002 matmarex and lucaswerkmeister-wmde: Backport for [[gerrit:992376|Restore support for matching 'LIKE' patterns/wildcards (T355478)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:41] !log lucaswerkmeister-wmde@deploy2002 matmarex and lucaswerkmeister-wmde: Continuing with sync [14:36:47] no point in testing again on wmf.14 [14:37:00] (03PS1) 10Marostegui: Revert "db1173 Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992378 [14:37:04] yup [14:38:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P55375 and previous config saved to /var/cache/conftool/dbconfig/20240123-143846-marostegui.json [14:39:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:09] (03PS1) 10Kosta Harlan: ORES: Enable renamed revertrisklanguageagnostic model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992423 (https://phabricator.wikimedia.org/T348298) [14:42:10] awight: apparently your change has a merge conflict :S [14:42:15] can I add a config patch to the window? [14:42:18] with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/991606 earlier in the window I guess [14:42:33] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992376|Restore support for matching 'LIKE' patterns/wildcards (T355478)]] (duration: 07m 50s) [14:42:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992426 [14:42:42] T355478: The pattern is not working for mass delete (Nuke) - https://phabricator.wikimedia.org/T355478 [14:42:44] seems like a busy one... [14:42:57] yup [14:43:24] (03CR) 10Marostegui: [C: 03+2] Revert "db1173 Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992378 (owner: 10Marostegui) [14:43:29] (03CR) 10Lucas Werkmeister (WMDE): uzwiki: revert temporary logo for the 20th anniversary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:43:47] Lucas_WMDE: on it... [14:43:50] ok [14:43:52] kostajh: what’s your change? [14:43:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55376 and previous config saved to /var/cache/conftool/dbconfig/20240123-144356-root.json [14:44:04] Lucas_WMDE: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/992423 [14:44:44] (03PS8) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [14:45:31] I guess I can squeeze that in now [14:45:40] (03PS2) 10Awight: Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:45:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992423 (https://phabricator.wikimedia.org/T348298) (owner: 10Kosta Harlan) [14:45:51] Lucas_WMDE: rebased. [14:46:05] alright, I’ll do that after kostajh then [14:46:07] Lucas_WMDE: thank you [14:46:08] It's fine if my config patch doesn't fit [14:46:15] kostajh: please add it to the calendar too just for the record [14:46:19] Lucas_WMDE: will do [14:46:40] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm) [14:47:12] Lucas_WMDE: done [14:47:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1173.eqiad.wmnet with OS bookworm [14:47:37] thanks [14:47:37] (03Merged) 10jenkins-bot: ORES: Enable renamed revertrisklanguageagnostic model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992423 (https://phabricator.wikimedia.org/T348298) (owner: 10Kosta Harlan) [14:48:00] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:992423|ORES: Enable renamed revertrisklanguageagnostic model (T348298)]] [14:48:04] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [14:49:13] (03PS9) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [14:49:28] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kharlan: Backport for [[gerrit:992423|ORES: Enable renamed revertrisklanguageagnostic model (T348298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:49:37] (03CR) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:49:50] kostajh: please test [14:49:59] Lucas_WMDE: ack [14:52:45] jouncebot: next [14:52:45] In 1 hour(s) and 7 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1600) [14:53:05] alright, I’ll just do my own change outside the window [14:53:16] Lucas_WMDE: lgtm. [14:53:18] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and kharlan: Continuing with sync [14:53:21] ok, thanks [14:53:28] thank you! [14:53:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T354336)', diff saved to https://phabricator.wikimedia.org/P55377 and previous config saved to /var/cache/conftool/dbconfig/20240123-145353-marostegui.json [14:53:58] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:54:01] (03PS10) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [14:54:21] (03CR) 10Btullis: [C: 03+2] Update public_mirrors.html with new mirror info. [puppet] - 10https://gerrit.wikimedia.org/r/991794 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo) [14:55:24] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/991794 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo) [14:56:51] (03PS3) 10Awight: Allow Cite events for reference previews baseline stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992411 (https://phabricator.wikimedia.org/T353798) (owner: 10WMDE-Fisch) [14:58:03] (03CR) 10Lucas Werkmeister (WMDE): uzwiki: revert temporary logo for the 20th anniversary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [14:59:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55378 and previous config saved to /var/cache/conftool/dbconfig/20240123-145901-root.json [14:59:11] awight: do you have a few more minutes? [14:59:15] then I think we could still do your change [14:59:20] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:992423|ORES: Enable renamed revertrisklanguageagnostic model (T348298)]] (duration: 11m 20s) [14:59:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:27] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [15:00:54] alright, let’s close the window then [15:00:59] !log UTC afternoon backport+config window done [15:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:10] sorry awight and anzx, there was too much to get through everything [15:02:37] (03CR) 10Vgutierrez: hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [15:02:47] (03CR) 10Dzahn: [C: 03+1] [phabricator] Remove public task dump task timer [puppet] - 10https://gerrit.wikimedia.org/r/992416 (https://phabricator.wikimedia.org/T355502) (owner: 10EoghanGaffney) [15:02:57] (03PS11) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) [15:03:08] Lucas_WMDE: I understand--thanks for deploying! [15:03:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] termbox(test): update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992387 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:03:59] (03CR) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [15:04:07] (03Merged) 10jenkins-bot: termbox(test): update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992387 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:04:29] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:52] * Lucas_WMDE tries lucaswerkmeister-wmde@deploy2002:/srv/deployment-charts/helmfile.d/services/termbox$ helmfile -e staging -i apply --context 5 [15:05:54] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [15:06:01] ah, it logs that anyway, good ^^ [15:06:20] nothing except the image version in the diff, that’s good [15:06:36] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T352010)', diff saved to https://phabricator.wikimedia.org/P55379 and previous config saved to /var/cache/conftool/dbconfig/20240123-150659-ladsgroup.json [15:07:16] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:08:10] seems to be working on https://test.m.wikidata.org/wiki/Q233590 [15:08:16] so I guess I’ll roll that out to codfw and eqiad as well now [15:08:21] (and it should still only affect testwikidata iiuc) [15:08:26] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:08:30] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:08:35] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:08:37] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:08:42] okay, there was no diff there ^^ [15:09:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "As far as I can tell, this is working fine on Test Wikidata (brand-new item https://test.m.wikidata.org/wiki/Q233590 still got an SSR term" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992387 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:10:42] (03PS1) 10Lucas Werkmeister (WMDE): termbox: update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992446 (https://phabricator.wikimedia.org/T331403) [15:11:10] (03PS2) 10Lucas Werkmeister (WMDE): termbox: update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992446 (https://phabricator.wikimedia.org/T331403) [15:13:01] (03CR) 10Anzx: uzwiki: revert temporary logo for the 20th anniversary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992371 (https://phabricator.wikimedia.org/T353723) (owner: 10Anzx) [15:13:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance [15:13:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance [15:14:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55380 and previous config saved to /var/cache/conftool/dbconfig/20240123-151406-root.json [15:14:35] can I bother someone for another +1 on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/992446? ^^ [15:14:39] just another version bump [15:15:03] maybe hashar? 🥺 [15:15:53] yeah [15:16:17] wait [15:16:24] I thought I had +1ed it already?! [15:16:32] you +1ed values-test.yaml [15:16:34] OH [15:16:36] this is the same for values.yaml [15:16:36] THAT IS FOR PROD [15:16:41] (03CR) 10Hashar: [C: 03+2] termbox: update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992446 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:16:47] thanks \o/ [15:16:51] \o/ [15:17:03] * Lucas_WMDE tempted to bash all-caps THAT IS FOR PROD [15:17:28] I DON'T SEE WHY SINCE TODAY IS INTERNATIONAL CAPS LOCK DAY. SO THAT IS PRETTY NORMAL. [15:17:40] (03Merged) 10jenkins-bot: termbox: update to 2024-01-22-163619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/992446 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:17:43] I DON’T BELIEVE YOU [15:17:55] I think one day I have added a MediaWiki language `en_ALLCAPS` [15:18:00] which well [15:18:10] did some strtoupper() to every messages [15:18:34] and I had a `fr_rtl` locally [15:18:42] isn’t en_ALLCAPS just en-us [15:18:44] scnr [15:18:48] ahah [15:18:49] touché [15:18:57] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [15:19:38] anyway, it got pulled, so now I’ll helmfile it to staging/eqiad/codfw again ig [15:19:39] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/992428 (https://phabricator.wikimedia.org/T355682) [15:19:44] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/992429 (https://phabricator.wikimedia.org/T355682) [15:19:44] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [15:19:57] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [15:20:00] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:20:28] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:21:16] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:21:17] ok, this one is taking a bit longer, so I guess there are more pods to restart or something [15:21:25] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:22:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P55381 and previous config saved to /var/cache/conftool/dbconfig/20240123-152206-ladsgroup.json [15:22:08] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:22:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 30% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/992198 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [15:22:38] hmm, I’m looking at new items and seeing no SSR termbox… [15:22:41] (03CR) 10Vgutierrez: [C: 03+1] "LGTM! thanks for submitting this one" [puppet] - 10https://gerrit.wikimedia.org/r/991785 (https://phabricator.wikimedia.org/T351069) (owner: 10Ssingh) [15:23:06] (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/992451 (https://phabricator.wikimedia.org/T355683) [15:23:22] jouncebot: nowandnext [15:23:22] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [15:23:23] In 0 hour(s) and 36 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1600) [15:23:29] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 30% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/992198 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [15:23:36] claime: I’m doing stuff [15:23:42] * Lucas_WMDE sees a bunch of ECONNREFUSED in kubectl logs termbox-production-7dfff557f8-4p9m9 termbox-production [15:23:52] might have to roll that back after all [15:23:53] Lucas_WMDE: ack, won't deploy [15:24:11] it’s not super pants-on-fire but it looks like I just rolled out a bad version update [15:24:13] though idk why yet [15:24:15] (it's just a replica bump though, it won't terminate any mw-* pods) [15:24:27] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 30% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/992198 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [15:24:51] hm, “connect ECONNREFUSED ::1:6500” [15:24:56] is that an IPv6? [15:25:03] looks like it [15:25:05] localhost [15:25:07] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/992451 (https://phabricator.wikimedia.org/T355683) (owner: 10Marostegui) [15:25:22] (6500 is apparently the mw-api-int-async port) [15:25:40] (and/or mwapi-async) [15:26:24] yes it is [15:26:37] yeah I’m seeing some “TermboxRemoteRenderer: Problem requesting from the remote server” in logstash too [15:26:40] I think I should just roll back [15:26:48] it’s not like the upgrade was urgent, just node 16 → 18 [15:26:52] and then let you bump the replicas :) [15:27:39] Do you want me to take a look? [15:28:32] (03PS1) 10Lucas Werkmeister (WMDE): Revert "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992452 (https://phabricator.wikimedia.org/T331403) [15:28:42] doesn’t have to be right now [15:28:45] (03CR) 10Hashar: [C: 03+1] Revert "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992452 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:28:54] I’ll file a task later [15:29:11] it must be a DNS issue [15:29:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55382 and previous config saved to /var/cache/conftool/dbconfig/20240123-152911-root.json [15:29:12] :) [15:29:17] :) [15:29:21] at least you caught it and it is easy to rollback! [15:29:23] * Lucas_WMDE saves a copy of the kubectl logs just in case [15:29:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992452 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:30:30] (03Merged) 10jenkins-bot: Revert "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992452 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:31:06] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [15:31:15] alright, doing the three helmfile commands again [15:31:26] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:31:32] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [15:31:40] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] [phabricator] Remove public task dump task timer [puppet] - 10https://gerrit.wikimedia.org/r/992416 (https://phabricator.wikimedia.org/T355502) (owner: 10EoghanGaffney) [15:31:45] AH AND FOR REFERENCE HTTPS://CAPSLOCKDAY.ORG/ [15:31:49] hashar: the big question is whether the rollback will work ;) [15:32:04] TODAY IS NEITHER OF THOSE DAYS [15:32:27] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [15:32:30] such a pity the old page is only available via archive.org (which by the way should get international funding of some sort) [15:32:37] Hmm seeing a bunch of 500s in termbox-tls-proxy logs [15:32:40] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [15:32:56] but that's inbound [15:33:08] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [15:33:09] probably caused by the connrefused on the local listener [15:33:21] I’m tentatively done, now checking whether it fixed the issue on new items or not [15:33:52] yeah I’m seeing some termboxen on new items [15:34:00] so the rollback worked, hooray for container images and all that [15:34:15] claime: feel free to bump the replicas [15:34:18] (assuming the 500s go away) [15:34:18] ty [15:34:22] should I CC you in the phab task? [15:34:45] yeah, maybe a.kosiaris as well, since j.ayme is ooo [15:34:52] ok, thanks! [15:35:15] well done [15:35:23] !log Bumping mw-web replicas - T355532 [15:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:31] T355532: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 [15:35:35] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:35:36] I am off [15:35:55] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:36:05] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:36:15] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:36:30] !log Bumping mw-api-ext replicas - T355532 [15:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P55383 and previous config saved to /var/cache/conftool/dbconfig/20240123-153712-ladsgroup.json [15:37:24] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:37:30] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:37:38] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:38:57] (03CR) 10Jforrester: "Aha, this reminded me of a commit I saw two months ago – c92ba296bfc2e656865e4ef8b8695ec4e5df3da8" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992452 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:39:12] !log trafficserver: move 30% of traffic to mw on k8s - T355532 [15:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:17] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 30% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/992158 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [15:40:09] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye [15:41:30] jhathaway, Amir1, topranks, heads-up on mw-on-k8s traffic increase [15:41:42] 😍 [15:41:47] I'm letting puppet do its work at its own pace, not forcing runs [15:42:01] claime: good stuff great to see it rolling along :) [15:42:08] <3 [15:42:32] gl claime! [15:43:17] (03PS1) 10Jforrester: wikifunctions: Hard-code 'localhost' as IPv4 127.0.0.1 for Node 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992455 (https://phabricator.wikimedia.org/T355592) [15:44:06] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "⇒ T355685" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992452 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [15:44:15] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:44:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55384 and previous config saved to /var/cache/conftool/dbconfig/20240123-154416-root.json [15:44:21] mw-on-k8s hype 🥳 [15:44:49] Whoop whoop, etc. [15:45:30] Lucas_WMDE: As soon as you mentioned the ECONNREFUSED it reminded me of the fix Language made back in November. Thanks for finding it. [15:45:40] thanks for remembering it! [15:46:03] Yeah, my thinking was along those lines as well James_F, thanks for saving me the debugging [15:46:05] Maybe we should fix the general security rules so that other people aren't caught out by it. [15:46:09] * Lucas_WMDE stress-tests browser by loading up the node 18 changelog [15:46:15] Lucas_WMDE: :-D [15:46:41] yes, we should, please drop us a task and we'll have a look [15:46:42] meh, not seeing anything super-relevant in there [15:46:47] claime: `git log | grep -B10 ::` was helpful. :-D [15:46:53] :D [15:46:53] if it fixes the issue that’s good enough for me, I don’t need to know the node commit ^^ [15:47:26] (for all I know it might also be in the underlying base image, new musl version or whatever) [15:49:33] claime: Filed as T355686 [15:49:34] T355686: Adjust general (mesh?) security rules to allow IPv6 localhost (::) as well as IPv4 (127.0.0.1) - https://phabricator.wikimedia.org/T355686 [15:49:42] James_F: Thanks :) [15:50:22] OK, I'll just slip my one out now. [15:50:25] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Hard-code 'localhost' as IPv4 127.0.0.1 for Node 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992455 (https://phabricator.wikimedia.org/T355592) (owner: 10Jforrester) [15:51:19] (03Merged) 10jenkins-bot: wikifunctions: Hard-code 'localhost' as IPv4 127.0.0.1 for Node 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992455 (https://phabricator.wikimedia.org/T355592) (owner: 10Jforrester) [15:52:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T352010)', diff saved to https://phabricator.wikimedia.org/P55385 and previous config saved to /var/cache/conftool/dbconfig/20240123-155219-ladsgroup.json [15:52:27] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:52:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:52:58] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:53:22] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:54:30] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:54:45] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:55:54] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:59:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55386 and previous config saved to /var/cache/conftool/dbconfig/20240123-155921-root.json [15:59:32] !log disable puppet on A:lvs to merge CR 991785 [15:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] eoghan, jelto, and arnoldokoth: Time to snap out of that daydream and deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1600). [16:00:08] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:lvs: set monitoring enabled for IPIP-related services [puppet] - 10https://gerrit.wikimedia.org/r/991785 (https://phabricator.wikimedia.org/T351069) (owner: 10Ssingh) [16:10:39] !log enable puppet on A:lvs to merge CR 991785 and run agent on all nodes [16:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:42] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm) [16:14:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55387 and previous config saved to /var/cache/conftool/dbconfig/20240123-161426-root.json [16:14:42] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad [16:14:42] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad [16:23:12] (03PS1) 10Ayounsi: sre.network.tls: handle one more usecase [cookbooks] - 10https://gerrit.wikimedia.org/r/992457 [16:25:33] (03CR) 10Ssingh: "Some minor nits in between but looks good otherwise! Thanks for cleaning this up." [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:27:09] (03Abandoned) 10Ssingh: depool codfw: do not merge! emergency depool patch [dns] - 10https://gerrit.wikimedia.org/r/989534 (https://phabricator.wikimedia.org/T352758) (owner: 10Ssingh) [16:29:22] (03CR) 10Ayounsi: [C: 03+2] sre.network.tls: handle one more usecase [cookbooks] - 10https://gerrit.wikimedia.org/r/992457 (owner: 10Ayounsi) [16:34:02] (03Merged) 10jenkins-bot: sre.network.tls: handle one more usecase [cookbooks] - 10https://gerrit.wikimedia.org/r/992457 (owner: 10Ayounsi) [16:36:42] (03PS30) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [16:36:45] (03PS1) 10Jcrespo: mediabackups: Setup backup1011, backup2011 as new media storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069) [16:38:11] (03PS2) 10Jcrespo: mediabackups: Setup backup1011, backup2011 as new media storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069) [16:39:03] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992459 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo) [16:39:36] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [16:39:47] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [16:41:51] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [16:45:26] (03CR) 10Fabfur: [V: 03+1] hiera: add acls for heavy ratelimiting abusing ip from list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [16:49:32] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS bookworm [16:53:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:53:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:54:00] (03PS1) 10Ryan Kemper: webserver-misc-apps: new cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/992460 (https://phabricator.wikimedia.org/T355593) [16:54:06] (03PS1) 10Superpes15: [knwiki] Removing the temporary logo (already reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992461 (https://phabricator.wikimedia.org/T338136) [16:54:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:54:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:54:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1135 (T354336)', diff saved to https://phabricator.wikimedia.org/P55388 and previous config saved to /var/cache/conftool/dbconfig/20240123-165433-marostegui.json [16:54:37] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:56:00] (03CR) 10Dzahn: [C: 03+2] webserver-misc-apps: new cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/992460 (https://phabricator.wikimedia.org/T355593) (owner: 10Ryan Kemper) [16:56:24] (03CR) 10Dzahn: [V: 03+2 C: 03+2] webserver-misc-apps: new cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/992460 (https://phabricator.wikimedia.org/T355593) (owner: 10Ryan Kemper) [16:56:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T354336)', diff saved to https://phabricator.wikimedia.org/P55389 and previous config saved to /var/cache/conftool/dbconfig/20240123-165656-marostegui.json [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:20] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10klausman) This has been solved for now, though needs better docs and possibly simplification, as an extra step is needed: `lang=shell $ kube_env experime... [17:00:57] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs graph-split: enable microsite [puppet] - 10https://gerrit.wikimedia.org/r/992115 (https://phabricator.wikimedia.org/T354658) (owner: 10Ryan Kemper) [17:01:07] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:19] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:27] PROBLEM - Host aqs2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:17] (03PS1) 10Superpes15: [itwiki] Add the 'abusefilter-bypass-blocked-external-domains' right to botadmins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992466 (https://phabricator.wikimedia.org/T355694) [17:07:39] RECOVERY - Host aqs2003 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [17:08:55] PROBLEM - cassandra-b SSL 10.192.0.219:7000 on aqs2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:10:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:10:19] PROBLEM - cassandra-a SSL 10.192.0.218:7000 on aqs2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:10:53] PROBLEM - cassandra-b CQL 10.192.0.219:9042 on aqs2003 is CRITICAL: connect to address 10.192.0.219 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:11:51] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:12:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P55390 and previous config saved to /var/cache/conftool/dbconfig/20240123-171202-marostegui.json [17:12:21] RECOVERY - cassandra-b SSL 10.192.0.219:7000 on aqs2003 is OK: SSL OK - Certificate aqs2003-b valid until 2024-06-07 14:43:41 +0000 (expires in 135 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:12:39] RECOVERY - cassandra-a SSL 10.192.0.218:7000 on aqs2003 is OK: SSL OK - Certificate aqs2003-a valid until 2024-06-07 14:43:39 +0000 (expires in 135 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:12:53] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:13:15] RECOVERY - cassandra-b CQL 10.192.0.219:9042 on aqs2003 is OK: TCP OK - 0.030 second response time on 10.192.0.219 port 9042 https://phabricator.wikimedia.org/T93886 [17:14:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:36] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Sannita) @MatthewVernon the affected files are https://commons.wikimedia.org/w/index.php?title=File:1an7ghzwafv8.s9fwlh.12187057.pdf, https://commons.wikimedia.org/w/index.php?title=Fil... [17:18:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:20:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.936 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:21:36] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm) [17:21:49] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm) Moved db2158 port because the port was already taken up. Used first available. Same for the three servers in A6. Moved up to the next available port. These cabling conve... [17:25:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] [knwiki] Removing the temporary logo (already reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992461 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [17:27:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P55391 and previous config saved to /var/cache/conftool/dbconfig/20240123-172709-marostegui.json [17:29:10] (03PS1) 10Superpes15: [enwiki] and [enwikibooks] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992471 (https://phabricator.wikimedia.org/T355695) [17:36:58] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10MatthewVernon) Right, those are all too far ago to still in the recent logs. Today's, however, I can find, and swift has done what was asked of it - that file was uploaded and subsequen... [17:42:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T354336)', diff saved to https://phabricator.wikimedia.org/P55392 and previous config saved to /var/cache/conftool/dbconfig/20240123-174215-marostegui.json [17:42:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [17:42:21] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:42:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [17:42:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:43:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:43:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [17:43:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [17:43:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T354336)', diff saved to https://phabricator.wikimedia.org/P55393 and previous config saved to /var/cache/conftool/dbconfig/20240123-174339-marostegui.json [17:46:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T354336)', diff saved to https://phabricator.wikimedia.org/P55394 and previous config saved to /var/cache/conftool/dbconfig/20240123-174600-marostegui.json [17:47:15] 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700 (10RobH) [17:47:39] 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700 (10RobH) [17:56:11] (03PS1) 10Cwhite: site: predefine logging-hd100[123] insetup role [puppet] - 10https://gerrit.wikimedia.org/r/992434 (https://phabricator.wikimedia.org/T354226) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T1800) [18:01:01] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10MikhasikRV) @MatthewVernon I just used Upload Wizard to upload the file. I did not see neither attempt to delete the file after upload. After progressbar reached the end, it just spit o... [18:01:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P55395 and previous config saved to /var/cache/conftool/dbconfig/20240123-180107-marostegui.json [18:06:35] (03PS5) 10Eevans: cassandra: reconfigure 'dev' target_version for a 4.x release [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) [18:06:37] (03PS5) 10Eevans: cassandra-dev2001: canary dev version of Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) [18:09:58] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:16:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P55396 and previous config saved to /var/cache/conftool/dbconfig/20240123-181613-marostegui.json [18:16:41] (03CR) 10Eevans: [C: 03+2] cassandra: reconfigure 'dev' target_version for a 4.x release [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:18:36] (03CR) 10Eevans: [C: 03+2] cassandra-dev2001: canary dev version of Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/992261 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:20:51] (03CR) 10Ssingh: [C: 03+1] "herron, let us know if you want to merge this? (Just going through the backlog)" [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:22:21] (03CR) 10Ssingh: [C: 03+1] "(Any bug number?)" [puppet] - 10https://gerrit.wikimedia.org/r/916424 (owner: 10Majavah) [18:25:20] (03CR) 10Ssingh: [C: 03+1] fifo-log-demux: Update project homepage [puppet] - 10https://gerrit.wikimedia.org/r/973887 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall) [18:31:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T354336)', diff saved to https://phabricator.wikimedia.org/P55397 and previous config saved to /var/cache/conftool/dbconfig/20240123-183120-marostegui.json [18:31:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:31:34] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:31:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:31:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T354336)', diff saved to https://phabricator.wikimedia.org/P55398 and previous config saved to /var/cache/conftool/dbconfig/20240123-183141-marostegui.json [18:34:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T354336)', diff saved to https://phabricator.wikimedia.org/P55399 and previous config saved to /var/cache/conftool/dbconfig/20240123-183403-marostegui.json [18:35:27] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [18:35:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [18:36:09] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [18:37:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [18:37:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [18:42:52] (03PS1) 10Eevans: cassandra-dev: upgrade to 'dev' (cassandra_4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992480 (https://phabricator.wikimedia.org/T352469) [18:43:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [18:43:40] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:45:22] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992480 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:47:55] (03CR) 10Eevans: [C: 03+2] cassandra-dev: upgrade to 'dev' (cassandra_4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992480 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [18:48:24] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) [18:48:36] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) This has been completed [18:48:53] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1002.eqiad.wmnet - https://phabricator.wikimedia.org/T355654 (10VRiley-WMF) 05Open→03Resolved [18:49:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P55400 and previous config saved to /var/cache/conftool/dbconfig/20240123-184909-marostegui.json [18:49:46] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [18:51:01] (03PS48) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [18:51:03] (03PS4) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [18:51:05] (03PS1) 10AOkoth: admin: add ccuifo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [18:51:24] (03PS2) 10AOkoth: admin: add ccuifo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [18:53:12] (03CR) 10Ebernhardson: [C: 03+2] Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (https://phabricator.wikimedia.org/T354197) (owner: 10Peter Fischer) [18:53:26] (03CR) 10Dzahn: "Are you sure the user name is ccuifo? I can't see it in LDAP with that uid." [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) (owner: 10AOkoth) [18:54:08] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10C.Suthorn) I do get the same error with a webm file. I was able to upload the first 1GB of the file and do a revision upload of the first 2GB, but once I try to upload either the comple... [18:54:14] (03Merged) 10jenkins-bot: Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (https://phabricator.wikimedia.org/T354197) (owner: 10Peter Fischer) [18:54:39] (03CR) 10Herron: [C: 03+1] thanos: add labels to thanos-rule blocks [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [18:55:31] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) This has been completed [18:55:39] (03CR) 10Dzahn: [C: 04-1] "looks likeit is "cciufo" (not ccuifo)" [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) (owner: 10AOkoth) [18:55:52] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) [18:56:04] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): decommission an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T355653 (10VRiley-WMF) 05Open→03Resolved [18:58:43] (03PS3) 10AOkoth: admin: add cciufo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [18:58:58] (03PS4) 10AOkoth: admin: add cciufo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [19:02:13] (03PS5) 10AOkoth: admin: add cciufo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [19:02:26] (03PS6) 10AOkoth: admin: add cciufo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [19:04:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P55401 and previous config saved to /var/cache/conftool/dbconfig/20240123-190416-marostegui.json [19:04:22] (03CR) 10Dzahn: admin: add cciufo to LDAP users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) (owner: 10AOkoth) [19:07:58] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T355345 (10VRiley-WMF) a:03VRiley-WMF [19:09:13] (03CR) 10AOkoth: admin: add cciufo to LDAP users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) (owner: 10AOkoth) [19:12:14] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) (owner: 10AOkoth) [19:19:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T354336)', diff saved to https://phabricator.wikimedia.org/P55402 and previous config saved to /var/cache/conftool/dbconfig/20240123-191922-marostegui.json [19:19:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [19:19:29] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:19:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [19:19:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T354336)', diff saved to https://phabricator.wikimedia.org/P55403 and previous config saved to /var/cache/conftool/dbconfig/20240123-191945-marostegui.json [19:22:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T354336)', diff saved to https://phabricator.wikimedia.org/P55404 and previous config saved to /var/cache/conftool/dbconfig/20240123-192207-marostegui.json [19:23:44] (03CR) 10Majavah: [C: 03+2] P:lvs::configuration: replace labsproject with wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/916424 (owner: 10Majavah) [19:25:23] (03CR) 10Majavah: [C: 03+2] "just general cleanup, I don't think we've been tracking this in phab" [puppet] - 10https://gerrit.wikimedia.org/r/916424 (owner: 10Majavah) [19:30:11] (03CR) 10AOkoth: [C: 03+2] admin: add cciufo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) (owner: 10AOkoth) [19:30:21] (03PS7) 10AOkoth: admin: add cciufo to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/992482 (https://phabricator.wikimedia.org/T355595) [19:37:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P55405 and previous config saved to /var/cache/conftool/dbconfig/20240123-193713-marostegui.json [19:37:25] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) 05Open→03In progress [19:38:22] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) @CCiufo-WMF This should be good to go now. [19:44:37] (03PS1) 10Majavah: sre.hosts.decommission: Add flag to disable removing mgmt DNS name [cookbooks] - 10https://gerrit.wikimedia.org/r/992490 [19:45:16] !log phab1004 - /srv/phab/phabricator/bin/mail volume [19:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:27] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs[2024-2025].codfw.wmnet with reason: testing data xfter cookbook [19:45:35] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T355345 (10VRiley-WMF) Rebalanced power load. [19:45:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs[2024-2025].codfw.wmnet with reason: testing data xfter cookbook [19:45:48] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T355345 (10VRiley-WMF) 05Open→03Resolved [19:49:53] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, test data xfer) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet, repooling both afterwards [19:49:58] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [19:52:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P55406 and previous config saved to /var/cache/conftool/dbconfig/20240123-195220-marostegui.json [19:56:03] (03PS20) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:57:36] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T347624, test data xfer) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet, repooling both afterwards [19:57:41] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [19:57:54] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, test data xfer) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling both afterwards [19:59:30] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10CCiufo-WMF) >>! In T355595#9482526, @Arnoldokoth wrote: > @CCiufo-WMF This should be good to go now. Thanks! Does it take a while for the access rights to propagate? I don't seem... [20:00:04] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [20:00:06] (03CR) 10Thcipriani: "One question inline." [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [20:01:24] (03PS6) 10Paladox: gerrit: Fix linking to hash url [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) [20:01:27] (03CR) 10Paladox: gerrit: Fix linking to hash url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [20:03:17] (03CR) 10Thcipriani: [C: 03+1] "LGTM now, thanks Paladox!" [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [20:06:09] (03CR) 10Dzahn: [C: 03+2] gerrit: Fix linking to hash url [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [20:07:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T354336)', diff saved to https://phabricator.wikimedia.org/P55407 and previous config saved to /var/cache/conftool/dbconfig/20240123-200726-marostegui.json [20:07:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:07:37] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:07:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:07:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:08:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:08:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T354336)', diff saved to https://phabricator.wikimedia.org/P55408 and previous config saved to /var/cache/conftool/dbconfig/20240123-200809-marostegui.json [20:08:56] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T347624, test data xfer) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:09:00] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [20:09:27] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10Arnoldokoth) @CCiufo-WMF Hmm. Which username are you using? Also you can check out >> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#LDAP_access [20:09:37] (03CR) 10Hashar: gerrit: Fix linking to hash url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [20:10:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T354336)', diff saved to https://phabricator.wikimedia.org/P55409 and previous config saved to /var/cache/conftool/dbconfig/20240123-201030-marostegui.json [20:11:45] (03CR) 10Paladox: gerrit: Fix linking to hash url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [20:12:08] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, test data xfer) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:22:14] (03PS3) 10Hashar: PreAuthenticationProvider: Allow blocking account creation based on IP reputation [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [20:23:12] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T347624, test data xfer) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:23:25] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [20:23:39] RECOVERY - Disk space on stat1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [20:25:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P55410 and previous config saved to /var/cache/conftool/dbconfig/20240123-202536-marostegui.json [20:40:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P55411 and previous config saved to /var/cache/conftool/dbconfig/20240123-204043-marostegui.json [20:41:45] (03PS1) 10Kosta Harlan: revertrisk: Fix i18n message reference [extensions/ORES] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992506 (https://phabricator.wikimedia.org/T348298) [20:42:02] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for CCiufo - https://phabricator.wikimedia.org/T355595 (10CCiufo-WMF) I tried using both `CCiufo` and `cciufo`, but I get the following error message: {F41711400} [20:48:56] (03PS1) 10Kosta Harlan: revertrisk: Fix i18n messages [extensions/ORES] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992507 (https://phabricator.wikimedia.org/T348298) [20:55:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T354336)', diff saved to https://phabricator.wikimedia.org/P55412 and previous config saved to /var/cache/conftool/dbconfig/20240123-205549-marostegui.json [20:55:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:55:58] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:56:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:56:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T354336)', diff saved to https://phabricator.wikimedia.org/P55413 and previous config saved to /var/cache/conftool/dbconfig/20240123-205611-marostegui.json [20:58:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T354336)', diff saved to https://phabricator.wikimedia.org/P55414 and previous config saved to /var/cache/conftool/dbconfig/20240123-205832-marostegui.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T2100). [21:00:05] Superpes and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:31] Hi! I’ll be here in some minutes! In the meantime you can process the task that not requires testing (only the itwiki patch should be tested) [21:00:54] Or you can start with kostajh if you want :) [21:01:01] I'm here [21:01:08] I'll start with mine [21:01:19] Oh lol [21:01:37] Wonderful! Will wait for your ping then :) [21:01:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ORES] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992506 (https://phabricator.wikimedia.org/T348298) (owner: 10Kosta Harlan) [21:02:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ORES] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992507 (https://phabricator.wikimedia.org/T348298) (owner: 10Kosta Harlan) [21:04:37] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:41] (03Merged) 10jenkins-bot: revertrisk: Fix i18n message reference [extensions/ORES] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992506 (https://phabricator.wikimedia.org/T348298) (owner: 10Kosta Harlan) [21:04:49] (03Merged) 10jenkins-bot: revertrisk: Fix i18n messages [extensions/ORES] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992507 (https://phabricator.wikimedia.org/T348298) (owner: 10Kosta Harlan) [21:05:12] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:992506|revertrisk: Fix i18n message reference (T348298)]], [[gerrit:992507|revertrisk: Fix i18n messages (T348298)]] [21:05:22] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [21:06:09] If any of the other deployers are around, another glance at the proposed config patches would be welcome. [21:13:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P55415 and previous config saved to /var/cache/conftool/dbconfig/20240123-211338-marostegui.json [21:16:22] Superpes: re T355694, is someone (besides you) asking for that? do you have a specific use case in mind? [21:16:22] T355694: Adding the 'abusefilter-bypass-blocked-external-domains' flag to botadmin usergroup - https://phabricator.wikimedia.org/T355694 [21:18:10] kostajh Not a specific one but it could be surely used (since we use a lot botadmins)! Also other projects did this :) [21:18:13] I see we've done the same thing for eswiki in T342484 [21:18:13] T342484: Add botadmin group on eswiki - https://phabricator.wikimedia.org/T342484 [21:18:22] but that's the only other instance I find [21:18:59] Well maybe not every projects use botadmins :) [21:19:45] Instead seems reasonable to give this flag to every botadmins group (makes no sense if bot uses this and botadmins no) [21:21:43] Superpes: but does bot use it? I am missing that in the config. [21:22:00] (sorry, I am not very familiar with that permission or bot/botadmin permissions.) [21:22:15] Yep bot has this by default [21:22:18] Will. find you the task [21:22:59] https://phabricator.wikimedia.org/rEABF0acfe0525171bff2d7a90dc02b41ac8bc4bb40ab [21:23:39] I see [21:23:57] thx [21:24:04] nearly done syncing the wmf.15 patches [21:24:17] I hope 😅 [21:25:09] Lol no problem! I'm here, just studying, so I'm in no hurry :D [21:25:59] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:992506|revertrisk: Fix i18n message reference (T348298)]], [[gerrit:992507|revertrisk: Fix i18n messages (T348298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:05] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [21:26:44] !log kharlan@deploy2002 kharlan: Continuing with sync [21:28:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P55416 and previous config saved to /var/cache/conftool/dbconfig/20240123-212845-marostegui.json [21:36:04] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:992506|revertrisk: Fix i18n message reference (T348298)]], [[gerrit:992507|revertrisk: Fix i18n messages (T348298)]] (duration: 30m 51s) [21:36:10] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [21:36:11] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 2373 MB (2% inode=83%): /tmp 2373 MB (2% inode=83%): /var/tmp 2373 MB (2% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [21:36:27] Superpes: ok, on to your patches [21:36:52] Thanks :) Only the itwiki patch requires testing [21:36:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992461 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [21:37:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992466 (https://phabricator.wikimedia.org/T355694) (owner: 10Superpes15) [21:37:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992471 (https://phabricator.wikimedia.org/T355695) (owner: 10Superpes15) [21:38:30] Also maybe you should run mwscript resetAuthenticationThrottle.php after deploying (see https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold ) [21:38:47] yea, was just looking at that [21:38:48] will do [21:38:53] Thanks :D [21:43:01] (03Merged) 10jenkins-bot: [knwiki] Removing the temporary logo (already reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992461 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [21:43:05] (03Merged) 10jenkins-bot: [itwiki] Add the 'abusefilter-bypass-blocked-external-domains' right to botadmins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992466 (https://phabricator.wikimedia.org/T355694) (owner: 10Superpes15) [21:43:10] (03Merged) 10jenkins-bot: [enwiki] and [enwikibooks] Throttle exemption for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992471 (https://phabricator.wikimedia.org/T355695) (owner: 10Superpes15) [21:43:36] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:992461|[knwiki] Removing the temporary logo (already reverted) (T338136)]], [[gerrit:992466|[itwiki] Add the 'abusefilter-bypass-blocked-external-domains' right to botadmins (T355694)]], [[gerrit:992471|[enwiki] and [enwikibooks] Throttle exemption for event (T355695)]] [21:43:44] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [21:43:44] T355694: Adding the 'abusefilter-bypass-blocked-external-domains' flag to botadmin usergroup - https://phabricator.wikimedia.org/T355694 [21:43:45] T355695: Requesting temporary lift of IP cap for event on 2024-01-26 - https://phabricator.wikimedia.org/T355695 [21:43:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T354336)', diff saved to https://phabricator.wikimedia.org/P55417 and previous config saved to /var/cache/conftool/dbconfig/20240123-214351-marostegui.json [21:43:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [21:43:58] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [21:44:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [21:44:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T354336)', diff saved to https://phabricator.wikimedia.org/P55418 and previous config saved to /var/cache/conftool/dbconfig/20240123-214413-marostegui.json [21:45:08] !log kharlan@deploy2002 superpes and kharlan: Backport for [[gerrit:992461|[knwiki] Removing the temporary logo (already reverted) (T338136)]], [[gerrit:992466|[itwiki] Add the 'abusefilter-bypass-blocked-external-domains' right to botadmins (T355694)]], [[gerrit:992471|[enwiki] and [enwikibooks] Throttle exemption for event (T355695)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:45:21] Superpes: ok, please test [21:45:33] Yep looking [21:46:01] Ok it's fine thanks :D kostajh [21:46:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T354336)', diff saved to https://phabricator.wikimedia.org/P55419 and previous config saved to /var/cache/conftool/dbconfig/20240123-214633-marostegui.json [21:46:52] ok [21:48:29] Superpes: I see a parser error for AbuseFilter in the logs, related to your test on itwiki [21:48:36] (03CR) 10DannyS712: [C: 03+1] "LGTM but no deployment rights" [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992367 (owner: 10Kosta Harlan) [21:48:37] not sure if that is relevant, though. [21:48:48] Uhm will take a look [21:52:30] Ok I'll move forward with the sync [21:53:15] !log kharlan@deploy2002 superpes and kharlan: Continuing with sync [21:53:48] Thanks :) [21:59:09] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:992461|[knwiki] Removing the temporary logo (already reverted) (T338136)]], [[gerrit:992466|[itwiki] Add the 'abusefilter-bypass-blocked-external-domains' right to botadmins (T355694)]], [[gerrit:992471|[enwiki] and [enwikibooks] Throttle exemption for event (T355695)]] (duration: 15m 33s) [21:59:17] T338136: Requesting temporary logo change for kn.wikipedia.org - https://phabricator.wikimedia.org/T338136 [21:59:17] T355694: Adding the 'abusefilter-bypass-blocked-external-domains' flag to botadmin usergroup - https://phabricator.wikimedia.org/T355694 [21:59:18] T355695: Requesting temporary lift of IP cap for event on 2024-01-26 - https://phabricator.wikimedia.org/T355695 [22:00:13] (03PS1) 10Bking: cloudelastic: promote new hosts to master-eligible [puppet] - 10https://gerrit.wikimedia.org/r/992538 (https://phabricator.wikimedia.org/T351354) [22:00:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992538 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking) [22:01:04] !log T355695 running mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip 62.232.9.14 [22:01:08] 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Wilfredor) I think the simplest way to correct this error is to lower the maximum upload limit to 1 GB for validation. [22:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P55420 and previous config saved to /var/cache/conftool/dbconfig/20240123-220140-marostegui.json [22:01:55] !log T355695 running mwscript resetAuthenticationThrottle.php --wiki=enwiki --signup --ip 195.70.81.86 [22:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:20] !log T355695 running mwscript resetAuthenticationThrottle.php --wiki=enwikibooks --signup --ip 62.232.9.14 [22:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:40] !log T355695 running mwscript resetAuthenticationThrottle.php --wiki=enwikibooks --signup --ip 195.70.81.86 [22:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:48] Wonderful! :) [22:02:51] Superpes: all done [22:03:02] Thanks for your time kostajh :3 [22:03:10] thank you, Superpes. see you next time! [22:03:24] !log UTC late deploys done [22:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:44] (03PS1) 10Eevans: restbase1019: canary 'dev' version (4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992539 (https://phabricator.wikimedia.org/T352469) [22:07:32] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992539 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [22:10:43] (03CR) 10Ebernhardson: [C: 03+1] "seems reasonable. Will want to restart the new hosts before the old hosts, to make sure there are always enough master capable instances." [puppet] - 10https://gerrit.wikimedia.org/r/992538 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking) [22:12:54] (03CR) 10Eevans: [C: 03+2] restbase1019: canary 'dev' version (4.1.1-wmf1) [puppet] - 10https://gerrit.wikimedia.org/r/992539 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [22:16:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P55421 and previous config saved to /var/cache/conftool/dbconfig/20240123-221646-marostegui.json [22:17:33] (03PS1) 10Clare Ming: Update Android Metrics Platform stream configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) [22:21:47] (03CR) 10Clare Ming: "If someone wouldn't mind +1'ing, I'll take care of deploying the config change once I push a new lib release and pending merge of incoming" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992541 (https://phabricator.wikimedia.org/T355360) (owner: 10Clare Ming) [22:31:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T354336)', diff saved to https://phabricator.wikimedia.org/P55422 and previous config saved to /var/cache/conftool/dbconfig/20240123-223153-marostegui.json [22:31:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [22:32:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [22:32:12] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [22:32:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T354336)', diff saved to https://phabricator.wikimedia.org/P55423 and previous config saved to /var/cache/conftool/dbconfig/20240123-223215-marostegui.json [22:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T354336)', diff saved to https://phabricator.wikimedia.org/P55424 and previous config saved to /var/cache/conftool/dbconfig/20240123-223439-marostegui.json [22:37:54] (03CR) 10Andrea Denisse: "Hi Moritz, I read this on the Wikitech page of reserved UIDs: https://wikitech.wikimedia.org/wiki/UID" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [22:49:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P55425 and previous config saved to /var/cache/conftool/dbconfig/20240123-224946-marostegui.json [22:51:25] (03PS1) 10Andrew Bogott: nova policy: add awareness of 'unmanaged' role [puppet] - 10https://gerrit.wikimedia.org/r/992543 (https://phabricator.wikimedia.org/T326818) [22:54:12] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/992415 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [22:54:57] (03CR) 10Cwhite: [C: 03+2] site: predefine logging-hd100[123] insetup role [puppet] - 10https://gerrit.wikimedia.org/r/992434 (https://phabricator.wikimedia.org/T354226) (owner: 10Cwhite) [23:04:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P55426 and previous config saved to /var/cache/conftool/dbconfig/20240123-230453-marostegui.json [23:20:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T354336)', diff saved to https://phabricator.wikimedia.org/P55427 and previous config saved to /var/cache/conftool/dbconfig/20240123-231959-marostegui.json [23:20:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [23:20:12] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [23:20:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [23:20:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T354336)', diff saved to https://phabricator.wikimedia.org/P55428 and previous config saved to /var/cache/conftool/dbconfig/20240123-232021-marostegui.json [23:22:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T354336)', diff saved to https://phabricator.wikimedia.org/P55429 and previous config saved to /var/cache/conftool/dbconfig/20240123-232242-marostegui.json [23:37:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P55430 and previous config saved to /var/cache/conftool/dbconfig/20240123-233749-marostegui.json [23:47:52] (03PS1) 10Bking: cloudelastic: Configure post-migration TLS [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) [23:48:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [23:48:56] (03PS2) 10Bking: cloudelastic: Configure post-migration TLS [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) [23:49:59] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10cmooney) I've been looking into these settings a little bit. The man for //ipfrag_high_thresh// states: ` Maximum memory used to reassemble IP fragments. When ipfrag_high... [23:52:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P55432 and previous config saved to /var/cache/conftool/dbconfig/20240123-235255-marostegui.json [23:53:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992547 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)