[00:20:51] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2023-04-25 00:00:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [00:39:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/914811 [00:39:23] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/914811 (owner: 10TrainBranchBot) [00:41:47] (03PS1) 10RLazarus: thanos: Migrate from 100-scale to unit-scale SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/914945 (https://phabricator.wikimedia.org/T289615) [00:42:10] (03PS1) 10RLazarus: Migrate from 100-scale to unit-scale SLO recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914946 (https://phabricator.wikimedia.org/T289615) [00:51:25] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2023-05-02 15:17:24 (4378 GiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:55:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/914811 (owner: 10TrainBranchBot) [02:07:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [02:22:54] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [03:36:29] (03PS1) 10Jdrewniak: [10%] Enable Vector 2022 as the default skin for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) [03:36:31] (03PS1) 10Jdrewniak: Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) [03:36:33] (03PS1) 10Jdrewniak: Enable Vector 2022 as the default skin on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686) [04:10:55] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [04:12:23] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:29:52] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 50 hosts with reason: Rolling reboot of eqiad for T335835 [04:30:26] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 50 hosts with reason: Rolling reboot of eqiad for T335835 [04:33:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [04:36:15] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:36:58] !log [Elastic] Beginning rolling reboot of eqiad elastic, 3 nodes at a time, `ryankemper@cumin1001` tmux session `reboot_eqiad` [04:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:25] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:38:22] !log [Elastic] Reboot operation failed w/ (likely transient) read timeouts, will try again in 10 mins [04:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:51] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on relforge[1003-1004].eqiad.wmnet with reason: Rolling reboot T335835 [04:39:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on relforge[1003-1004].eqiad.wmnet with reason: Rolling reboot T335835 [04:43:14] (03PS1) 10Ryan Kemper: elastic: remove redundant usage [cookbooks] - 10https://gerrit.wikimedia.org/r/915092 [04:45:57] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T335835 [04:47:29] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:38] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Rolling reboot for T335835 [04:47:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Rolling reboot for T335835 [04:49:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:15] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [04:51:26] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:52:29] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [04:53:57] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 182, active_shards: 364, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [04:53:57] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:54:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T335835 [05:00:39] PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:15] RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:07] PROBLEM - Check systemd state on elastic1091 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:01] PROBLEM - Check systemd state on elastic1096 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:39] PROBLEM - Check systemd state on elastic1101 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:31] PROBLEM - Check systemd state on elastic1099 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:47] Unfortunately these `Check systemd state` alerts don't seem to be suppressed by the icinga downtime. Sorry for the noise [05:25:49] RECOVERY - Check systemd state on elastic1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:53] RECOVERY - Check systemd state on elastic1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:37] PROBLEM - Check systemd state on elastic1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:13] RECOVERY - Check systemd state on elastic1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:17] RECOVERY - Check systemd state on elastic1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:39] PROBLEM - Check systemd state on elastic1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:13] RECOVERY - Check systemd state on elastic1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:09] (03PS1) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 [05:44:33] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:41] PROBLEM - Check systemd state on elastic1084 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:49:39] PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:13] RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [05:51:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [05:53:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5003.wikimedia.org [05:54:08] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [05:56:31] RECOVERY - Check systemd state on elastic1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host bast5003.wikimedia.org [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600). [06:01:53] !log slyngshede@cumin1001 START - Cookbook sre.hosts.decommission for hosts test-reimage2001.codfw.wmnet [06:05:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [06:05:52] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [06:07:00] (03PS1) 10Slyngshede: site.pp: decommision test-reimage2001 [puppet] - 10https://gerrit.wikimedia.org/r/915144 (https://phabricator.wikimedia.org/T335835) [06:07:57] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test-reimage2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1001" [06:08:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [06:10:37] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test-reimage2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1001" [06:10:37] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:10:38] !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts test-reimage2001.codfw.wmnet [06:15:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [06:18:00] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [06:22:06] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.07 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [06:22:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915144 (https://phabricator.wikimedia.org/T335835) (owner: 10Slyngshede) [06:23:11] (03CR) 10Slyngshede: [C: 03+2] site.pp: decommision test-reimage2001 [puppet] - 10https://gerrit.wikimedia.org/r/915144 (https://phabricator.wikimedia.org/T335835) (owner: 10Slyngshede) [06:24:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4004.wikimedia.org [06:25:30] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [06:26:00] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) [06:26:30] (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915151 (https://phabricator.wikimedia.org/T334722) [06:27:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [06:27:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [06:27:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm completed: - sretest1002 (**PAS... [06:27:51] (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915151 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui) [06:30:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4004.wikimedia.org [06:30:38] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.3 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [06:31:03] (03PS1) 10KartikMistry: Update MinT to 2023-05-04-054118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915215 [06:31:47] (03CR) 10Muehlenhoff: Django 3.2 support (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [06:33:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS bookworm [06:33:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast2003.wikimedia.org with OS bookworm [06:35:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui) [06:36:31] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.82 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [06:40:57] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-04-054118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915215 (owner: 10KartikMistry) [06:43:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [06:43:02] (03Merged) 10jenkins-bot: Update MinT to 2023-05-04-054118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915215 (owner: 10KartikMistry) [06:44:12] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:44:22] (03CR) 10Filippo Giunchedi: [C: 03+1] kafkamon: cut over to bullseye exporters [puppet] - 10https://gerrit.wikimedia.org/r/914876 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [06:44:26] kart_: can I deploy MW? [06:46:26] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:46:28] (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915357 [06:46:54] (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915357 (owner: 10Marostegui) [06:46:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org [06:48:05] marostegui: Yes yes. This was quick staging one. [06:48:10] cool! thanks! [06:48:18] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui) [06:48:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [06:49:01] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui) [06:49:54] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:915150|ProductionServices.php: Promote pc2014 to pc1 master (T334722)]] [06:49:57] T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 [06:51:16] RECOVERY - Check systemd state on krb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:29] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:915150|ProductionServices.php: Promote pc2014 to pc1 master (T334722)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [06:52:11] !log Promote pc2014 as pc1 master codfw dbmaint - T334722 [06:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [06:54:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org [06:54:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [06:54:46] Good morning apergos et al: do we have trainees for the morning window? I plan to deploy a bunch of stuff (I'll put them to calendar ASAP), so that's why I'm asking. [06:55:10] we don't have any trainees signed up, no [06:55:29] your patches would be the only ones on the calendar [06:55:45] Ack, thanks. I'll take over the window then. [06:56:15] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [06:56:21] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914836 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [06:56:25] (03CR) 10Urbanecm: [C: 03+2] EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914302 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [06:56:31] (03CR) 10Urbanecm: [C: 03+2] ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914303 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [06:56:37] (03CR) 10Urbanecm: [C: 03+2] EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914304 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [06:56:37] ok, enjoy! [06:56:43] (03CR) 10Urbanecm: [C: 03+2] ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [06:57:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [06:57:17] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:915150|ProductionServices.php: Promote pc2014 to pc1 master (T334722)]] (duration: 07m 23s) [06:57:21] T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 [06:57:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [06:58:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on pc2011.codfw.wmnet with reason: Onsite maintenance T334722 [06:58:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org [06:58:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on pc2011.codfw.wmnet with reason: Onsite maintenance T334722 [06:58:53] (03PS1) 10Majavah: hieradata: swap dumps_dist_active_* params [puppet] - 10https://gerrit.wikimedia.org/r/915362 [07:00:05] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0700). [07:00:57] no trainees are signed up for the window and urbanecm has several patches to be scheduled for deployment, self-deploying I assume, so that's how this morning's window will go. [07:01:40] Yep yep, waiting on CI atm. [07:01:56] 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) @Jhancock.wm pc2011 is now OFF, so you can work on it whenever you want. [07:02:35] ok, enjoy this morning's episode of Zuul Watch! [07:02:57] Thanks! :D [07:07:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6002.wikimedia.org [07:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:11:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6002.wikimedia.org [07:13:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS bookworm [07:13:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast2003.wikimedia.org with OS bookworm completed: - bast2003 (**WARN**) - D... [07:13:53] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add new LVS host lvs2011 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/914871 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [07:15:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5002.wikimedia.org [07:16:20] (03CR) 10CI reject: [V: 04-1] Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:16:44] this episode of Zuul Watch ended with selenium failure. let's restart! [07:16:51] (03CR) 10Urbanecm: [C: 03+2] "rerun..." [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:18:06] (03PS1) 10Muehlenhoff: Assign bastion role to bast2003 [puppet] - 10https://gerrit.wikimedia.org/r/915363 (https://phabricator.wikimedia.org/T334287) [07:21:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5002.wikimedia.org [07:22:52] pperhaps if you whack the tv on its side the reception will come back in again ;-) [07:24:06] (03PS1) 10Alexandros Kosiaris: machinetranslation: networkpolicy for metrics-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/915364 (https://phabricator.wikimedia.org/T331505) [07:27:24] (03CR) 10David Caro: [C: 03+2] hieradata: swap dumps_dist_active_* params [puppet] - 10https://gerrit.wikimedia.org/r/915362 (owner: 10Majavah) [07:28:27] (03CR) 10Ayounsi: "Overall it looks good to me, but before approving it could you split this patch in 2: One for the bird/anycast.pp change and one for the c" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [07:29:07] (03Merged) 10jenkins-bot: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914836 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:29:12] (03Merged) 10jenkins-bot: EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914302 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [07:29:16] (03Merged) 10jenkins-bot: ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914303 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [07:29:18] 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from... [07:29:24] (03Merged) 10jenkins-bot: EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914304 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [07:29:31] finally something [07:29:57] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914836|Mentor dashboard: Move away from alpha/beta/stable (T334630)]], [[gerrit:914302|EditPage: Support preloading from i18n messages (T330337)]], [[gerrit:914303|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914304|EditPage: Support preloading from i18n messages (T330337)]] [07:30:02] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [07:30:02] T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337 [07:31:28] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:914836|Mentor dashboard: Move away from alpha/beta/stable (T334630)]], [[gerrit:914302|EditPage: Support preloading from i18n messages (T330337)]], [[gerrit:914303|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914304|EditPage: Support preloading from i18n messages (T330337)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2 [07:31:28] 001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:33:41] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 293 [07:33:46] patches work, continuing [07:33:55] and (still) waiting CI on the remainding few patches [07:34:04] hopefully will be faster [07:34:25] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.43 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [07:34:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 293 [07:35:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 134823 [07:37:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 134823 [07:37:55] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914836|Mentor dashboard: Move away from alpha/beta/stable (T334630)]], [[gerrit:914302|EditPage: Support preloading from i18n messages (T330337)]], [[gerrit:914303|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914304|EditPage: Support preloading from i18n messages (T330337)]] (duration: 07m 58s) [07:37:59] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [07:37:59] T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337 [07:38:35] (Access port speed <= 100Mbps) firing: (2) Alert for device asw-a-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [07:45:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [07:45:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:46:50] (03Merged) 10jenkins-bot: ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm) [07:47:16] looks i might be able to finish all patches in time after all :) [07:48:42] (03Merged) 10jenkins-bot: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:49:08] (03CR) 10Ayounsi: "At first glance it looks good to me, but someone from traffic (or who knows more about DNS) needs to review it." [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [07:49:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914305|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914837|Mentor dashboard: Move away from alpha/beta/stable (T334630)]] [07:49:19] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [07:49:20] T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337 [07:49:26] (03PS5) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) [07:49:32] (03CR) 10Urbanecm: [C: 03+2] [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:49:39] (03PS5) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) [07:50:18] (03Merged) 10jenkins-bot: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:50:20] (03CR) 10Urbanecm: [C: 03+2] [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:50:50] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:914305|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914837|Mentor dashboard: Move away from alpha/beta/stable (T334630)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:51:06] (03Merged) 10jenkins-bot: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm) [07:56:24] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914305|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914837|Mentor dashboard: Move away from alpha/beta/stable (T334630)]] (duration: 07m 08s) [07:56:28] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [07:56:29] T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337 [07:56:40] 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) I am working with MariaDB foundation to see if we can find more information about this. For now I am running `mariadb-check --check --extended --database enwiki` on both hosts and it is sh... [07:56:47] ahh! scap backport still has issues with dependencies :-/ [07:56:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1132.eqiad.wmnet with reason: Onsite maintenance T334722 [07:56:51] T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 [07:57:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1132.eqiad.wmnet with reason: Onsite maintenance T334722 [07:57:06] fortunately, "continue with unexpected commits" works :) [07:57:23] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914393|[Growth] Deploy Personalized praise to AR, BN, CS (T334630)]] [07:57:41] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.32 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [07:58:52] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:914393|[Growth] Deploy Personalized praise to AR, BN, CS (T334630)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [08:01:44] (03CR) 10Elukey: [C: 03+2] admin_ng: add ml-staging among helmfile_namespace_certs's options [deployment-charts] - 10https://gerrit.wikimedia.org/r/914859 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [08:01:45] okay, not exactly in time, but still almost :) [08:01:48] last sync in progress [08:04:47] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914393|[Growth] Deploy Personalized praise to AR, BN, CS (T334630)]] (duration: 07m 24s) [08:04:51] T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630 [08:04:53] done [08:07:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:07:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:08:37] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [08:11:22] (03PS1) 10Elukey: admin_ng: remove tls hostname override for ores-legacy-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/915416 (https://phabricator.wikimedia.org/T335756) [08:13:58] (03CR) 10JMeybohm: [C: 04-1] "I would suggest to rename this to cronjob instead of job as there are plain job objects in k8s as well, so the name is a bit misleading." [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [08:14:11] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [08:14:51] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.33 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [08:15:58] (03CR) 10Jelto: "lgtm, but what about profile::query_service::gui_url?" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [08:16:55] looks like there's a long break before the next window so it's all good [08:18:22] i overran just four minutes :) [08:18:57] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:19:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: networkpolicy for metrics-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/915364 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [08:20:20] (03PS1) 10ArielGlenn: Add dump user subdirectories to support testing of new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915423 (https://phabricator.wikimedia.org/T325232) [08:20:22] (03Merged) 10jenkins-bot: machinetranslation: networkpolicy for metrics-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/915364 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [08:22:07] (03CR) 10Elukey: [C: 03+1] "Looks really nice and simpler than what we have, I am in favor to proceed :) Since it is a new thing, I guess that we could let it bake fo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [08:25:36] (03CR) 10Stevemunene: [C: 03+2] Add a postgresql database and user for airflow_analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/911296 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene) [08:30:35] (03PS1) 10Volans: decorators: fix dry_run detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 [08:31:13] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:37] (03CR) 10Volans: "Thanks for finding the error and sending this patch. I proposed a slightly different one with additional tests to catch this use case in I" [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855) (owner: 10EoghanGaffney) [08:32:10] (03PS1) 10ArielGlenn: add nfs tester to dumps worker (snapshot) testbed role [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) [08:34:46] (03CR) 10Elukey: "Checked all the kubernetes.yaml config (IPs, etc..) and everything checks out (also way tidier than before!)" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:35:01] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [08:37:54] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes1006.eqiad.wmnet [08:37:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes1006.eqiad.wmnet [08:37:56] (03PS2) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [08:38:54] (03PS3) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [08:40:22] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [08:40:25] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [08:40:40] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915437/41033/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [08:41:23] (03PS1) 10ArielGlenn: create custom db list files for testing of nfs shares for xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) [08:43:58] (03CR) 10Elukey: "Did a quick pass and added two comments, but the work is really good and we should really push to deploy it before it gets stale." [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [08:46:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [08:46:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [08:47:01] (03CR) 10Elukey: [C: 03+2] admin_ng: remove tls hostname override for ores-legacy-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/915416 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [08:47:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:47:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:47:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:47:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:47:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T335838)', diff saved to https://phabricator.wikimedia.org/P47481 and previous config saved to /var/cache/conftool/dbconfig/20230504-084741-ladsgroup.json [08:47:59] PROBLEM - Host wcqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:49:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:49:44] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915447/41035/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [08:49:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [08:50:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [08:50:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T335838)', diff saved to https://phabricator.wikimedia.org/P47482 and previous config saved to /var/cache/conftool/dbconfig/20230504-085008-ladsgroup.json [08:50:35] (03PS1) 10ArielGlenn: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) [08:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T335838)', diff saved to https://phabricator.wikimedia.org/P47483 and previous config saved to /var/cache/conftool/dbconfig/20230504-085151-ladsgroup.json [08:52:21] (03PS1) 10JMeybohm: Copy configuration_1.1.1 to configuration_1.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) [08:52:23] (03PS1) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) [08:52:39] (03PS1) 10KartikMistry: Update MinT to 2023-05-04-084420-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472) [08:54:56] (03CR) 10Muehlenhoff: [C: 03+2] Assign bastion role to bast2003 [puppet] - 10https://gerrit.wikimedia.org/r/915363 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff) [08:56:11] (03PS2) 10Volans: decorators: fix dry_run detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 (https://phabricator.wikimedia.org/T335855) [08:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T335838)', diff saved to https://phabricator.wikimedia.org/P47484 and previous config saved to /var/cache/conftool/dbconfig/20230504-085637-ladsgroup.json [08:56:44] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [08:56:55] (03Abandoned) 10KartikMistry: machinetranslation: Fix gunicorn workers setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/912298 (owner: 10KartikMistry) [08:59:01] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915455/41036/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [08:59:51] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.85 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [08:59:53] (03PS2) 10KartikMistry: Update MinT to 2023-05-04-085722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472) [09:01:37] Deploying to MinT (staging only) ^^ [09:02:15] (03PS1) 10ArielGlenn: add a custom xml dumps config file for testing new nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) [09:02:21] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-04-085722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472) (owner: 10KartikMistry) [09:03:16] (03Merged) 10jenkins-bot: Update MinT to 2023-05-04-085722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472) (owner: 10KartikMistry) [09:04:30] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [09:05:01] (03CR) 10EoghanGaffney: [C: 03+1] "This is great, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 (https://phabricator.wikimedia.org/T335855) (owner: 10Volans) [09:05:10] (03Abandoned) 10EoghanGaffney: [spicerack/decorators] Don't miss dry_run if it's disabled in kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855) (owner: 10EoghanGaffney) [09:05:58] 10SRE-tools, 10Infrastructure-Foundations: cookbooks.sre.ganeti.reimage: failure reported when first puppet run succeeds after a retry - https://phabricator.wikimedia.org/T335863 (10Volans) [09:06:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans) [09:06:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:41] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [09:06:43] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915463/41038/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:06:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P47485 and previous config saved to /var/cache/conftool/dbconfig/20230504-090657-ladsgroup.json [09:07:34] (03PS18) 10ArielGlenn: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) [09:07:39] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.24 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [09:07:57] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:08:03] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet [09:11:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P47486 and previous config saved to /var/cache/conftool/dbconfig/20230504-091143-ladsgroup.json [09:12:09] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [09:12:25] (03PS3) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [09:13:38] (03PS1) 10Elukey: admin_ng: complete ml-staging support in helmfile_namespace_certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465 [09:14:23] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10jijiki) [09:14:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet [09:17:05] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.31 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [09:17:28] (03CR) 10Volans: "Thanks for the patch, but with PCC it fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:18:27] (03PS6) 10Arturo Borrero Gonzalez: profile::bird::anycast: allow setting the BGP IP address from the profile [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) [09:20:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [09:21:12] (03CR) 10Arturo Borrero Gonzalez: profile::bird::anycast: allow setting the BGP IP address from the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [09:21:23] (03PS4) 10Clément Goubert: k8s::proxy: Start kube-proxy after ferm [puppet] - 10https://gerrit.wikimedia.org/r/915461 [09:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:21:41] (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/913164/41041/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [09:22:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P47487 and previous config saved to /var/cache/conftool/dbconfig/20230504-092203-ladsgroup.json [09:22:09] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.24 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [09:22:17] (03PS4) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 [09:23:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41044/console" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:24:11] (03PS5) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 [09:25:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41046/console" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:25:48] (03CR) 10Hnowlan: [C: 03+2] admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:25:55] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:26:28] (03PS6) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 [09:26:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:26:47] (03CR) 10Jbond: install_server: improve readability of netmask logic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P47488 and previous config saved to /var/cache/conftool/dbconfig/20230504-092649-ladsgroup.json [09:26:59] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.51 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [09:27:22] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/915476 (https://phabricator.wikimedia.org/T335760) [09:28:10] (03Merged) 10jenkins-bot: admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:30:34] (03CR) 10Volans: [C: 03+1] "LGTM, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1114 T335837', diff saved to https://phabricator.wikimedia.org/P47490 and previous config saved to /var/cache/conftool/dbconfig/20230504-093419-ladsgroup.json [09:34:23] T335837: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 [09:35:59] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 [09:36:17] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:36:35] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:37:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T335838)', diff saved to https://phabricator.wikimedia.org/P47491 and previous config saved to /var/cache/conftool/dbconfig/20230504-093710-ladsgroup.json [09:37:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [09:37:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [09:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T335838)', diff saved to https://phabricator.wikimedia.org/P47492 and previous config saved to /var/cache/conftool/dbconfig/20230504-093733-ladsgroup.json [09:37:58] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for cloudbackup1001-dev.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001 [09:38:09] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond) [09:38:36] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cloudbackup1001-dev.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001 [09:38:41] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [09:38:45] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [09:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T335838)', diff saved to https://phabricator.wikimedia.org/P47493 and previous config saved to /var/cache/conftool/dbconfig/20230504-094156-ladsgroup.json [09:42:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:42:09] (03PS1) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 [09:42:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:42:17] (03PS1) 10Ladsgroup: instances.yaml: Remove db1114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/915487 (https://phabricator.wikimedia.org/T335837) [09:42:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T335838)', diff saved to https://phabricator.wikimedia.org/P47494 and previous config saved to /var/cache/conftool/dbconfig/20230504-094221-ladsgroup.json [09:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T335838)', diff saved to https://phabricator.wikimedia.org/P47495 and previous config saved to /var/cache/conftool/dbconfig/20230504-094253-ladsgroup.json [09:42:58] (03PS1) 10Alexandros Kosiaris: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) [09:44:17] (03CR) 10Jbond: [C: 03+1] cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/915476 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [09:44:24] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 4.892 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [09:45:51] (03CR) 10Jbond: [C: 03+2] install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond) [09:45:53] (03PS1) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) [09:47:16] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [09:47:46] (03PS2) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) [09:47:51] (03PS2) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 [09:47:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor200[3456].codfw.wmnet [09:48:15] (03PS2) 10Ladsgroup: instances.yaml: Remove db1114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/915487 (https://phabricator.wikimedia.org/T335837) [09:48:31] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] instances.yaml: Remove db1114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/915487 (https://phabricator.wikimedia.org/T335837) (owner: 10Ladsgroup) [09:48:33] (03PS3) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) [09:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T335838)', diff saved to https://phabricator.wikimedia.org/P47496 and previous config saved to /var/cache/conftool/dbconfig/20230504-094850-ladsgroup.json [09:49:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Remove db1114 from dbctl T335837', diff saved to https://phabricator.wikimedia.org/P47497 and previous config saved to /var/cache/conftool/dbconfig/20230504-094945-ladsgroup.json [09:49:48] T335837: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 [09:52:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [09:53:00] (03PS1) 10Volans: Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 [09:53:24] (03CR) 10CI reject: [V: 04-1] Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 (owner: 10Volans) [09:53:37] (03PS2) 10Alexandros Kosiaris: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) [09:53:39] (03PS1) 10Alexandros Kosiaris: Ship a prometheus-statsd-export configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/915493 (https://phabricator.wikimedia.org/T331505) [09:55:36] (03PS2) 10Volans: Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 [09:55:54] (03CR) 10Ayounsi: [C: 03+1] Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 (owner: 10Volans) [09:56:22] (03CR) 10Jelto: "Thanks for the addition! Looks mostly good, two suggestions in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [09:58:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P47498 and previous config saved to /var/cache/conftool/dbconfig/20230504-095800-ladsgroup.json [10:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1000) [10:00:30] (03PS1) 10Ladsgroup: mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915494 (https://phabricator.wikimedia.org/T335837) [10:01:34] (03CR) 10Volans: "Answer/question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney) [10:01:37] (03CR) 10Ayounsi: netbox: run the rqworker command as netbox user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:02:15] (03PS3) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 [10:02:19] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [10:02:27] (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:03:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P47499 and previous config saved to /var/cache/conftool/dbconfig/20230504-100357-ladsgroup.json [10:04:06] (03CR) 10Volans: [C: 03+2] Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 (owner: 10Volans) [10:05:04] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet [10:05:59] (03PS2) 10Elukey: admin_ng: complete ml-staging support in helmfile_namespace_certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465 [10:06:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] Ship a prometheus-statsd-export configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/915493 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [10:06:44] (03CR) 10Elukey: "Not the best approach of the world but I think it should work fine for the moment, since inference-staging is the only "custom" ingress se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465 (owner: 10Elukey) [10:06:52] (03CR) 10Ayounsi: [C: 03+1] netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:06:56] (03Merged) 10jenkins-bot: Ship a prometheus-statsd-export configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/915493 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [10:07:08] (03PS2) 10Ladsgroup: mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915494 (https://phabricator.wikimedia.org/T335837) [10:07:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915494 (https://phabricator.wikimedia.org/T335837) (owner: 10Ladsgroup) [10:07:50] (03PS1) 10Ladsgroup: Revert "mariadb: Remove puppet entries for db1114" [puppet] - 10https://gerrit.wikimedia.org/r/915379 [10:07:57] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mariadb: Remove puppet entries for db1114" [puppet] - 10https://gerrit.wikimedia.org/r/915379 (owner: 10Ladsgroup) [10:08:02] (03PS3) 10Alexandros Kosiaris: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) [10:08:20] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:53] (03PS1) 10Ladsgroup: mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915380 (https://phabricator.wikimedia.org/T335837) [10:09:17] BGP alert expected [10:10:44] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [10:10:49] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 [10:11:11] (03PS4) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 [10:11:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1114.eqiad.wmnet [10:11:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [10:12:14] (03Merged) 10jenkins-bot: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris) [10:12:34] (03CR) 10Ayounsi: [C: 03+1] netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P47500 and previous config saved to /var/cache/conftool/dbconfig/20230504-101306-ladsgroup.json [10:15:17] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder) [10:16:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet [10:16:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:16:20] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [10:16:23] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [10:16:59] !log elukey@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:17:34] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet [10:17:40] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:19:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P47501 and previous config saved to /var/cache/conftool/dbconfig/20230504-101903-ladsgroup.json [10:19:35] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1114.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [10:20:48] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:20:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1114.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [10:20:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:20:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1114.eqiad.wmnet [10:21:11] (03CR) 10Volans: "reply/question inline" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:21:27] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915380 (https://phabricator.wikimedia.org/T335837) (owner: 10Ladsgroup) [10:21:45] (03PS5) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) [10:23:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41049/console" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [10:23:35] !log Removing db1114 from zarcillo T335837 [10:23:36] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup100X-dev: set missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/915516 [10:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] T335837: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 [10:24:12] (03CR) 10Jbond: "i think it would be better if someone more familiar with dumps deployed this change so that they can confirm everything is deployed as exp" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [10:25:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/915516 (owner: 10Arturo Borrero Gonzalez) [10:26:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup100X-dev: set missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/915516 (owner: 10Arturo Borrero Gonzalez) [10:26:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet [10:27:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [10:27:20] (03CR) 10Ayounsi: [C: 03+1] netbox: run the rqworker command as netbox user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:27:55] 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [10:28:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T335838)', diff saved to https://phabricator.wikimedia.org/P47502 and previous config saved to /var/cache/conftool/dbconfig/20230504-102812-ladsgroup.json [10:28:15] (03CR) 10Elukey: [C: 03+2] admin_ng: complete ml-staging support in helmfile_namespace_certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465 (owner: 10Elukey) [10:28:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [10:28:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [10:28:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T335838)', diff saved to https://phabricator.wikimedia.org/P47503 and previous config saved to /var/cache/conftool/dbconfig/20230504-102835-ladsgroup.json [10:28:38] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet [10:29:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:30:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:30:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:34:01] 10SRE, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, and 4 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Michael) I'm not fully sure who would be reviewing and deploying these changes. Maybe someone from the #sre team? [10:34:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T335838)', diff saved to https://phabricator.wikimedia.org/P47504 and previous config saved to /var/cache/conftool/dbconfig/20230504-103409-ladsgroup.json [10:34:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [10:34:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [10:34:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T335838)', diff saved to https://phabricator.wikimedia.org/P47505 and previous config saved to /var/cache/conftool/dbconfig/20230504-103434-ladsgroup.json [10:34:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T335838)', diff saved to https://phabricator.wikimedia.org/P47506 and previous config saved to /var/cache/conftool/dbconfig/20230504-103459-ladsgroup.json [10:35:06] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond) [10:35:35] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [10:35:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4002.wikimedia.org [10:36:51] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10darthmon_wmde) [10:37:10] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10darthmon_wmde) [10:39:49] (03CR) 10Filippo Giunchedi: "LGTM, nice! Thank you for looking into this." [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [10:40:01] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet [10:40:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4002.wikimedia.org [10:40:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3002.wikimedia.org [10:40:28] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet [10:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T335838)', diff saved to https://phabricator.wikimedia.org/P47507 and previous config saved to /var/cache/conftool/dbconfig/20230504-104107-ladsgroup.json [10:42:31] (03PS1) 10Alexandros Kosiaris: machinetranslation: Fix some prometheus-stats mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/915522 [10:43:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [10:43:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Fix some prometheus-stats mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/915522 (owner: 10Alexandros Kosiaris) [10:43:58] (03PS5) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 [10:44:27] (03Merged) 10jenkins-bot: machinetranslation: Fix some prometheus-stats mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/915522 (owner: 10Alexandros Kosiaris) [10:44:41] (03CR) 10Volans: "skipped PrivateTmp" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:44:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3002.wikimedia.org [10:47:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:47:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2004.wikimedia.org [10:48:38] !log eoghan@cumin1001 START - Cookbook sre.ganeti.reimage for host aphlict2001.codfw.wmnet with OS bullseye [10:48:48] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet [10:50:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P47508 and previous config saved to /var/cache/conftool/dbconfig/20230504-105005-ladsgroup.json [10:50:26] (03CR) 10Volans: [C: 03+2] netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans) [10:51:48] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [10:52:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2004.wikimedia.org [10:52:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:41] (03CR) 10Filippo Giunchedi: "Something else to note: on transient/spike errors the alert will auto-resolve once the spike is gone, mentioning it in case this is a prob" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [10:52:53] (03PS1) 10Elukey: Rakefile: fix git branch check [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 [10:53:05] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [10:53:35] PROBLEM - Kerberos KDC daemon on krb2002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:54:05] ^krb2002 is me, WIP [10:54:51] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112 [10:55:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112 [10:55:31] (03CR) 10Elukey: [C: 03+2] conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [10:56:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P47509 and previous config saved to /var/cache/conftool/dbconfig/20230504-105613-ladsgroup.json [10:56:35] RECOVERY - Kerberos KDC daemon on krb2002 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:57:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:59:19] (03CR) 10Elukey: [C: 03+2] Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [11:01:34] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage [11:03:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 5713 [11:04:44] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage [11:04:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5713 [11:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P47510 and previous config saved to /var/cache/conftool/dbconfig/20230504-110511-ladsgroup.json [11:07:39] jouncebot: now [11:07:39] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [11:07:58] (03PS7) 10Ayounsi: profile::bird::anycast: allow setting the BGP IP address from the profile [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [11:08:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [11:08:22] I’d like to deploy some backports (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/914298, and probably also for wmf.6) – shouldn’t have an effect yet but will then let us do a config change in the window later without having to wait for the backports first [11:08:30] if no one objects to that :) [11:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P47511 and previous config saved to /var/cache/conftool/dbconfig/20230504-111119-ladsgroup.json [11:11:39] (03CR) 10Muehlenhoff: [C: 03+2] Add krb2002 as additional KDC [puppet] - 10https://gerrit.wikimedia.org/r/906560 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [11:13:45] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:13:47] (03PS1) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/915384 (https://phabricator.wikimedia.org/T300458) [11:14:24] !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aphlict2001.codfw.wmnet with OS bullseye [11:15:18] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001 [11:15:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:47] (03PS7) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782) [11:16:25] alright, I’ll get started then [11:16:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [11:18:52] (03PS1) 10Muehlenhoff: Make krb2002 available to Kerberos client [puppet] - 10https://gerrit.wikimedia.org/r/915569 (https://phabricator.wikimedia.org/T331695) [11:20:06] (03PS1) 10Jbond: ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) [11:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T335838)', diff saved to https://phabricator.wikimedia.org/P47512 and previous config saved to /var/cache/conftool/dbconfig/20230504-112017-ladsgroup.json [11:20:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [11:20:30] (03CR) 10CI reject: [V: 04-1] ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) (owner: 10Jbond) [11:20:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [11:20:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T335838)', diff saved to https://phabricator.wikimedia.org/P47513 and previous config saved to /var/cache/conftool/dbconfig/20230504-112041-ladsgroup.json [11:23:38] (03PS2) 10Jbond: ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) [11:24:29] * kart_ updating cxserver [11:24:43] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) (owner: 10KartikMistry) [11:25:28] (03Merged) 10jenkins-bot: Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) (owner: 10KartikMistry) [11:25:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41052/console" [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) (owner: 10Jbond) [11:26:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T335838)', diff saved to https://phabricator.wikimedia.org/P47514 and previous config saved to /var/cache/conftool/dbconfig/20230504-112625-ladsgroup.json [11:26:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [11:26:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [11:26:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47515 and previous config saved to /var/cache/conftool/dbconfig/20230504-112650-ladsgroup.json [11:27:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T335838)', diff saved to https://phabricator.wikimedia.org/P47516 and previous config saved to /var/cache/conftool/dbconfig/20230504-112705-ladsgroup.json [11:27:23] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:27:43] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:30:47] !log installing curl security updates [11:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:54] !log installing curl security updates (on buster) [11:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:01] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:31:05] (03PS2) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) [11:31:07] (03PS2) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) [11:31:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:33:31] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:34:18] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:35:10] (03PS8) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782) [11:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47518 and previous config saved to /var/cache/conftool/dbconfig/20230504-113529-ladsgroup.json [11:35:38] (03Merged) 10jenkins-bot: Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große) [11:36:08] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914298|Fix output path of list=wbsubscribers API (T300458)]] [11:36:10] T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458 [11:38:02] !log Updated cxserver to 2023-05-03-044244-production (T333835, T335019, T331505) [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:08] T331505: Self hosted machine translation service - https://phabricator.wikimedia.org/T331505 [11:38:08] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:914298|Fix output path of list=wbsubscribers API (T300458)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [11:38:09] T333835: Disable machine translation for Cantonese - https://phabricator.wikimedia.org/T333835 [11:38:09] T335019: Post-creation work for fatwiki - https://phabricator.wikimedia.org/T335019 [11:38:35] (Access port speed <= 100Mbps) firing: (2) Alert for device asw-a-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [11:38:45] no change on https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11 yet (good), syncing [11:40:19] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 [11:40:32] (03CR) 10CI reject: [V: 04-1] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (owner: 10Arturo Borrero Gonzalez) [11:40:43] (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) [11:40:55] (03CR) 10CI reject: [V: 04-1] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [11:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P47519 and previous config saved to /var/cache/conftool/dbconfig/20230504-114211-ladsgroup.json [11:44:32] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914298|Fix output path of list=wbsubscribers API (T300458)]] (duration: 08m 24s) [11:44:35] T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458 [11:45:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:46:04] jouncebot: next [11:46:04] In 1 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300) [11:46:05] In 1 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300) [11:46:24] then I’ll go ahead and do the wmf.6 backport too [11:46:26] (should be another noop) [11:46:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/915384 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [11:49:05] (03PS1) 10Ayounsi: set rq==1.13.0 to workaround bug [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/915591 [11:50:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P47520 and previous config saved to /var/cache/conftool/dbconfig/20230504-115035-ladsgroup.json [11:56:08] (03PS1) 10Slyngshede: signup: allow blocking of username with regex [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806) [11:56:32] (03CR) 10Muehlenhoff: [C: 03+2] Make krb2002 available to Kerberos client [puppet] - 10https://gerrit.wikimedia.org/r/915569 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [11:56:36] (03PS3) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) [11:56:38] (03PS3) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) [11:56:40] (03PS1) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) [11:57:10] (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:57:12] (03CR) 10CI reject: [V: 04-1] Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:57:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P47521 and previous config saved to /var/cache/conftool/dbconfig/20230504-115717-ladsgroup.json [11:57:22] (03CR) 10CI reject: [V: 04-1] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm) [12:02:44] (03Merged) 10jenkins-bot: Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/915384 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [12:03:15] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:915384|Fix output path of list=wbsubscribers API (T300458)]] [12:03:18] T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458 [12:03:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) (owner: 10KartikMistry) [12:04:44] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:915384|Fix output path of list=wbsubscribers API (T300458)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [12:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P47522 and previous config saved to /var/cache/conftool/dbconfig/20230504-120542-ladsgroup.json [12:06:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add MinT support to cxserver (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782) (owner: 10KartikMistry) [12:08:28] !log installing libdatetime-timezone-perl updates [12:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM minus a small piece of ruby style, but feel free to ignore" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 (owner: 10Elukey) [12:10:58] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:915384|Fix output path of list=wbsubscribers API (T300458)]] (duration: 07m 43s) [12:11:01] T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458 [12:11:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:12:01] * Lucas_WMDE done [12:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T335838)', diff saved to https://phabricator.wikimedia.org/P47523 and previous config saved to /var/cache/conftool/dbconfig/20230504-121224-ladsgroup.json [12:12:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [12:12:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [12:12:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T335838)', diff saved to https://phabricator.wikimedia.org/P47524 and previous config saved to /var/cache/conftool/dbconfig/20230504-121247-ladsgroup.json [12:16:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:20:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47525 and previous config saved to /var/cache/conftool/dbconfig/20230504-122048-ladsgroup.json [12:20:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:21:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47526 and previous config saved to /var/cache/conftool/dbconfig/20230504-122114-ladsgroup.json [12:22:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47527 and previous config saved to /var/cache/conftool/dbconfig/20230504-122237-ladsgroup.json [12:22:47] (03PS3) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) [12:25:20] (03PS4) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) [12:27:05] (03CR) 10Majavah: [C: 03+1] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [12:27:11] (03CR) 10David Caro: [C: 03+1] ":fingers_crossed:" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [12:27:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [12:30:49] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/915591 (owner: 10Ayounsi) [12:31:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47528 and previous config saved to /var/cache/conftool/dbconfig/20230504-123103-ladsgroup.json [12:31:13] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:38] (03PS9) 10KartikMistry: WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [12:34:05] (03CR) 10KartikMistry: WIP: Add MinT support to cxserver (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry) [12:34:27] (03PS10) 10KartikMistry: WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 [12:34:43] (03PS1) 10Muehlenhoff: Add bast2003 to Bastion hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/915631 (https://phabricator.wikimedia.org/T334287) [12:36:14] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] set rq==1.13.0 to workaround bug [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/915591 (owner: 10Ayounsi) [12:38:03] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - ayounsi@cumin1001 [12:39:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - ayounsi@cumin1001 [12:41:35] (03PS1) 10Muehlenhoff: wmf-laptop-sre: Add bast2003 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915639 (https://phabricator.wikimedia.org/T334287) [12:41:37] (03PS1) 10Muehlenhoff: Remove decommed bastions from ssh config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915640 [12:46:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P47529 and previous config saved to /var/cache/conftool/dbconfig/20230504-124609-ladsgroup.json [12:48:47] !log installing ruby-rack security updates [12:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1004.wikimedia.org [12:50:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:52:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:52:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:52:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47530 and previous config saved to /var/cache/conftool/dbconfig/20230504-125250-ladsgroup.json [12:53:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47531 and previous config saved to /var/cache/conftool/dbconfig/20230504-125309-ladsgroup.json [12:54:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [12:54:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1004.wikimedia.org [12:54:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [12:55:07] (03PS1) 10Btullis: Fail back hive services to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/915657 [12:56:19] (03PS2) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) [12:56:20] (03PS4) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) [12:56:22] (03PS4) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) [12:56:42] (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:56:45] (03CR) 10CI reject: [V: 04-1] Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:56:48] (03CR) 10CI reject: [V: 04-1] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm) [12:57:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [12:57:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [12:59:40] (03PS1) 10Ladsgroup: mariadb: Allow new externallinks fields to be queried in wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) [13:00:07] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300) [13:00:07] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300). Please do the needful. [13:00:07] jan_drewniak and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] o/ [13:00:46] jan_drewniak: do you want to self-service or should I deploy? [13:01:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P47532 and previous config saved to /var/cache/conftool/dbconfig/20230504-130115-ladsgroup.json [13:01:42] Lucas_WMDE: yes, I can self-service :) this first patch might take 10min or so. We want to deploy Vector 2022 to eswiki, but makes sure all is good with traffic after we do that [13:01:49] ok! [13:02:03] only one of my three changes is left btw, I did the backports earlier already [13:02:04] go ahead :) [13:03:48] Lucas_WMDE: I got a strage message "Aborting: This scap command is disabled on this host" which host do you use for deployments? [13:03:57] deployment.eqiad.wmnet [13:04:02] which I think is currently an alias for deploy1002 [13:04:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [13:04:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [13:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T335845)', diff saved to https://phabricator.wikimedia.org/P47533 and previous config saved to /var/cache/conftool/dbconfig/20230504-130432-ladsgroup.json [13:05:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:05:04] yay [13:05:07] Lucas_WMDE: k, thanks :) [13:05:27] (03PS2) 10Jdrewniak: [10%] Enable Vector 2022 as the default skin for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) [13:05:38] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:06:12] Amir1: Just FYI, we're starting the eswiki deployment [13:06:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47534 and previous config saved to /var/cache/conftool/dbconfig/20230504-130616-ladsgroup.json [13:06:25] thanks [13:06:27] (03Merged) 10jenkins-bot: [10%] Enable Vector 2022 as the default skin for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:06:33] I'm around for a bit [13:06:54] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:915040|[10%] Enable Vector 2022 as the default skin for eswiki (T335686)]] [13:06:57] T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686 [13:07:19] Amir1: thanks, we're doing 10% then 100%, 10% is going out now. [13:07:23] (03PS3) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) [13:07:25] awesome [13:07:25] (03PS5) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) [13:07:27] (03PS5) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) [13:07:49] (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [13:07:54] (03CR) 10CI reject: [V: 04-1] Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [13:07:56] (03CR) 10CI reject: [V: 04-1] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm) [13:08:27] Gosh I can't wait for the zebra to be freed https://en.wikipedia.org/wiki/Polar_bear?VectorZebraDesign=1 [13:08:51] Amir1: thanks! It's almost there! [13:09:02] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:915040|[10%] Enable Vector 2022 as the default skin for eswiki (T335686)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:10:13] (03CR) 10Herron: [C: 03+1] sre.hosts.reimage: improve failed first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) (owner: 10Volans) [13:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T335845)', diff saved to https://phabricator.wikimedia.org/P47535 and previous config saved to /var/cache/conftool/dbconfig/20230504-131054-ladsgroup.json [13:11:25] <3 [13:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335838)', diff saved to https://phabricator.wikimedia.org/P47536 and previous config saved to /var/cache/conftool/dbconfig/20230504-131302-ladsgroup.json [13:15:09] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:915040|[10%] Enable Vector 2022 as the default skin for eswiki (T335686)]] (duration: 08m 15s) [13:15:13] T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686 [13:15:24] Amir1: ok we're at 10% on eswiki [13:15:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:43] checking [13:15:51] it's in s2 I think [13:16:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47537 and previous config saved to /var/cache/conftool/dbconfig/20230504-131621-ladsgroup.json [13:17:07] already got one [13:17:09] nice [13:17:30] traffic in mysql is going back to normal [13:17:38] I suggest you can continue to 100% [13:17:48] Amir1: awesome! [13:18:16] (03PS2) 10Jdrewniak: Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) [13:18:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:20:18] 10SRE, 10Infrastructure-Foundations: Display meta.wikimedia.org username, if authenticated, before linking - https://phabricator.wikimedia.org/T335955 (10SLyngshede-WMF) [13:20:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:20:36] 10SRE, 10Infrastructure-Foundations: Display meta.wikimedia.org username, if authenticated, before linking - https://phabricator.wikimedia.org/T335955 (10SLyngshede-WMF) p:05Triage→03Low [13:20:52] (03CR) 10Jdrewniak: [C: 03+2] Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:21:19] (03CR) 10Marostegui: "Did you get security to approve this?" [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup) [13:21:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P47538 and previous config saved to /var/cache/conftool/dbconfig/20230504-132122-ladsgroup.json [13:21:26] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915631 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff) [13:21:50] (03Merged) 10jenkins-bot: Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:22:16] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2003.codfw.wmnet [13:22:17] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:915041|Enable Vector 2022 as the default skin on eswiki (T335686)]] [13:22:21] T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686 [13:22:22] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:22:24] (03CR) 10Ladsgroup: mariadb: Allow new externallinks fields to be queried in wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup) [13:23:02] (03CR) 10ArielGlenn: dumps::distribution::ferm: update to resolve hosts in puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [13:23:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:23:43] (03CR) 10Marostegui: mariadb: Allow new externallinks fields to be queried in wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup) [13:23:47] (03CR) 10Marostegui: [C: 03+1] mariadb: Allow new externallinks fields to be queried in wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup) [13:23:48] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:915041|Enable Vector 2022 as the default skin on eswiki (T335686)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:24:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47539 and previous config saved to /var/cache/conftool/dbconfig/20230504-132439-ladsgroup.json [13:26:00] (03CR) 10Ladsgroup: mariadb: Allow new externallinks fields to be queried in wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup) [13:26:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P47540 and previous config saved to /var/cache/conftool/dbconfig/20230504-132600-ladsgroup.json [13:26:05] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Allow new externallinks fields to be queried in wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup) [13:26:07] (03PS2) 10Elukey: Rakefile: fix git branch check [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 [13:26:10] (03CR) 10Elukey: Rakefile: fix git branch check (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 (owner: 10Elukey) [13:26:14] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2003.codfw.wmnet [13:26:32] (03CR) 10Elukey: [C: 03+2] Rakefile: fix git branch check [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 (owner: 10Elukey) [13:26:51] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335775 (10Jclark-ctr) 05Open→03Resolved Reseated power supply [13:27:15] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10Jclark-ctr) 05Open→03Resolved Reseated power supply [13:27:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2002.codfw.wmnet [13:28:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P47541 and previous config saved to /var/cache/conftool/dbconfig/20230504-132809-ladsgroup.json [13:30:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2002.codfw.wmnet [13:30:19] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:915041|Enable Vector 2022 as the default skin on eswiki (T335686)]] (duration: 08m 01s) [13:30:22] T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686 [13:31:01] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2001.codfw.wmnet [13:31:21] Alright that's eswiki! [13:31:47] 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) >>! In T334733#8823968, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (... [13:32:07] (03PS2) 10Jdrewniak: Enable Vector 2022 as the default skin on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686) [13:32:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:33:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2001.codfw.wmnet [13:33:29] (03Merged) 10jenkins-bot: Enable Vector 2022 as the default skin on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak) [13:33:29] Lucas_WMDE: almost done, one the last patch... [13:33:33] ok! [13:33:42] (03CR) 10Ssingh: [C: 03+1] "Looks good, with the separate bird change as Arzhel suggested!" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [13:33:57] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:915042|Enable Vector 2022 as the default skin on frwikinews (T335686)]] [13:34:00] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1006.eqiad.wmnet [13:35:34] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond) [13:35:36] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:915042|Enable Vector 2022 as the default skin on frwikinews (T335686)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:35:39] T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686 [13:36:05] Amir1: thanks for checking traffic for us! We're live on eswiki 🎉 [13:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P47542 and previous config saved to /var/cache/conftool/dbconfig/20230504-133628-ladsgroup.json [13:36:35] (03PS1) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) [13:37:33] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:37:37] (03CR) 10CI reject: [V: 04-1] mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [13:37:43] 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10elukey) ` elukey@kafka-logging1001:~$ kafka acls --list kafka-acls --authorizer-properties... [13:37:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1006.eqiad.wmnet [13:37:58] !log revert "Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in kafka logging clusters - T334733" [13:37:58] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond) [13:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:01] T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 [13:38:13] wohoo [13:38:20] (03PS1) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) [13:38:43] 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Ladsgroup) [13:38:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:39:05] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1005.eqiad.wmnet [13:39:44] (03PS6) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) [13:39:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P47543 and previous config saved to /var/cache/conftool/dbconfig/20230504-133945-ladsgroup.json [13:40:27] (03CR) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [13:41:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41053/console" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [13:41:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P47544 and previous config saved to /var/cache/conftool/dbconfig/20230504-134106-ladsgroup.json [13:41:44] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:915042|Enable Vector 2022 as the default skin on frwikinews (T335686)]] (duration: 07m 47s) [13:41:47] T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686 [13:41:57] Lucas_WMDE: ok, finally done :) [13:42:04] \o/ [13:42:58] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1005.eqiad.wmnet [13:42:58] (03PS3) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) [13:43:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P47545 and previous config saved to /var/cache/conftool/dbconfig/20230504-134315-ladsgroup.json [13:43:29] grmbl, scap backport gets confused by the Depends-On, as expected [13:43:36] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1004.eqiad.wmnet [13:43:41] (03PS4) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) [13:43:43] (03CR) 10Jbond: [V: 03+1] dumps::distribution::ferm: update to resolve hosts in puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond) [13:43:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [13:44:38] (03Merged) 10jenkins-bot: Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE)) [13:45:04] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914752|Make wbsubscribers API output sensible on Test Wikidata (T300458)]] [13:45:07] T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458 [13:45:14] (03PS1) 10Eevans: aqs: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915675 (https://phabricator.wikimedia.org/T335383) [13:46:46] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/915675 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [13:47:05] (03CR) 10Jelto: "Hi 👋 As discussed some time ago in T300171#8259774 this change adds a second static miscweb release. Does this make sense to you? Is it ok" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [13:47:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jgreen) [13:47:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:47:27] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:914752|Make wbsubscribers API output sensible on Test Wikidata (T300458)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:47:30] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1004.eqiad.wmnet [13:48:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jgreen) a:05Jclark-ctr→03None [13:48:19] https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11 and https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11&format=xmlfm look good on mwdebug, syncing [13:48:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2006.codfw.wmnet [13:48:26] !log switching to bullseye kafka monitoring hosts T335424 [13:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:29] T335424: kafkamon: upgrade to bullseye - https://phabricator.wikimedia.org/T335424 [13:48:43] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Jgreen) a:05Jgreen→03None [13:49:03] (03CR) 10Herron: [C: 03+2] kafkamon: cut over to bullseye exporters [puppet] - 10https://gerrit.wikimedia.org/r/914876 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [13:49:25] 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Jgreen) a:05Jgreen→03Dwisehaupt [13:50:29] (03CR) 10Nikerabbit: [C: 03+1] WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry) [13:51:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47546 and previous config saved to /var/cache/conftool/dbconfig/20230504-135135-ladsgroup.json [13:52:19] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2006.codfw.wmnet [13:52:20] jouncebot: nowandnext [13:52:20] For the next 0 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300) [13:52:20] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300) [13:52:20] In 0 hour(s) and 7 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1400) [13:53:28] sukhe: I’m close to done, php-fpm-restart at 542% [13:53:30] *52% [13:53:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2005.codfw.wmnet [13:53:45] Lucas_WMDE: all good, not in a rush today! (no dc-ops on site waiting for me :) [13:53:52] I will start once you are done [13:53:52] ok ^^ [13:54:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P47547 and previous config saved to /var/cache/conftool/dbconfig/20230504-135452-ladsgroup.json [13:54:56] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914752|Make wbsubscribers API output sensible on Test Wikidata (T300458)]] (duration: 09m 52s) [13:54:59] !log UTC afternoon backport+config window done [13:54:59] T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458 [13:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47548 and previous config saved to /var/cache/conftool/dbconfig/20230504-135551-ladsgroup.json [13:55:55] 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Jhancock.wm) @Marostegui Thanks for that. Draining the flea power worked. the mgmt port is now active and I can login to the idrac remotely. You can bring it back online now. [13:56:05] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2005.codfw.wmnet [13:56:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T335845)', diff saved to https://phabricator.wikimedia.org/P47549 and previous config saved to /var/cache/conftool/dbconfig/20230504-135612-ladsgroup.json [13:56:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [13:56:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [13:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T335845)', diff saved to https://phabricator.wikimedia.org/P47550 and previous config saved to /var/cache/conftool/dbconfig/20230504-135637-ladsgroup.json [13:56:52] 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) Thanks! Can you power on the host for me? [13:57:39] sukhe: you’re good to go as far as I’m concerned [13:57:47] thanks Lucas_WMDE! [13:57:51] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2004.codfw.wmnet [13:58:20] !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 [13:58:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335838)', diff saved to https://phabricator.wikimedia.org/P47551 and previous config saved to /var/cache/conftool/dbconfig/20230504-135821-ladsgroup.json [13:58:23] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [13:58:25] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1361 is CRITICAL: etcd last index (1915005) is outdated compared to the master one (1915008) https://wikitech.wikimedia.org/wiki/Etcd [13:58:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:58:27] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1379 is CRITICAL: etcd last index (1915005) is outdated compared to the master one (1915008) https://wikitech.wikimedia.org/wiki/Etcd [13:58:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:58:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:58:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47552 and previous config saved to /var/cache/conftool/dbconfig/20230504-135845-ladsgroup.json [13:59:59] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1361 is OK: etcd last index (1915011) matches the master one (1915011) https://wikitech.wikimedia.org/wiki/Etcd [13:59:59] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1379 is OK: etcd last index (1915011) matches the master one (1915011) https://wikitech.wikimedia.org/wiki/Etcd [14:00:05] sukhe: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for LVS maintenance deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1400). [14:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47553 and previous config saved to /var/cache/conftool/dbconfig/20230504-140012-ladsgroup.json [14:01:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2004.codfw.wmnet [14:03:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T335845)', diff saved to https://phabricator.wikimedia.org/P47554 and previous config saved to /var/cache/conftool/dbconfig/20230504-140308-ladsgroup.json [14:03:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd1006.eqiad.wmnet [14:03:31] (Access port speed <= 100Mbps) firing: (2) Device asw-a-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:04:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [14:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47555 and previous config saved to /var/cache/conftool/dbconfig/20230504-140634-ladsgroup.json [14:07:00] (03CR) 10Btullis: [C: 03+2] Fail back hive services to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/915657 (owner: 10Btullis) [14:07:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd1006.eqiad.wmnet [14:08:28] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd1005.eqiad.wmnet [14:08:50] (03CR) 10Eevans: [C: 03+2] aqs: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915675 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [14:09:39] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet [14:09:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47556 and previous config saved to /var/cache/conftool/dbconfig/20230504-140958-ladsgroup.json [14:10:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [14:10:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [14:10:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47557 and previous config saved to /var/cache/conftool/dbconfig/20230504-141024-ladsgroup.json [14:10:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P47558 and previous config saved to /var/cache/conftool/dbconfig/20230504-141057-ladsgroup.json [14:11:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [14:12:20] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [14:12:23] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [14:12:31] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd1005.eqiad.wmnet [14:13:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd1004.eqiad.wmnet [14:15:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet [14:17:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd1004.eqiad.wmnet [14:17:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47559 and previous config saved to /var/cache/conftool/dbconfig/20230504-141749-ladsgroup.json [14:18:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P47560 and previous config saved to /var/cache/conftool/dbconfig/20230504-141814-ladsgroup.json [14:18:44] 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10herron) 05Open→03Resolved [14:20:02] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335966 (10phaultfinder) [14:20:32] (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914945 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus) [14:21:22] (03CR) 10Ssingh: [C: 03+2] lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [14:21:33] (03CR) 10Herron: [C: 03+1] "LGTM but let's let the new recording rule settle and give this query one last test before deploying" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914946 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus) [14:21:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [14:21:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P47561 and previous config saved to /var/cache/conftool/dbconfig/20230504-142140-ladsgroup.json [14:21:43] urandom: ok to merge your change? [14:21:46] Eevans: aqs: upgrade cluster to Cassandra 3.11.14 (e3566e48aa) [14:24:46] (03PS1) 10Elukey: custom_deploy.d: fix istio config for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/915685 (https://phabricator.wikimedia.org/T335756) [14:25:49] 10SRE-Access-Requests: Update access permissions - https://phabricator.wikimedia.org/T335967 (10FJoseph-WMF) [14:26:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P47562 and previous config saved to /var/cache/conftool/dbconfig/20230504-142604-ladsgroup.json [14:27:56] (03PS1) 10David Caro: k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) [14:28:06] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) @Eevans How can we help move this along? [14:28:09] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: fix istio config for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/915685 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [14:28:26] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [14:28:54] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [14:29:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:31:30] (03CR) 10David Caro: "Tested with a venv on tools:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro) [14:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P47563 and previous config saved to /var/cache/conftool/dbconfig/20230504-143255-ladsgroup.json [14:33:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P47564 and previous config saved to /var/cache/conftool/dbconfig/20230504-143320-ladsgroup.json [14:34:27] !log eevans@cumin1001 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [14:34:30] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [14:34:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [14:34:58] 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Yikes, thank you, yes let's delete ACLs for kafka logging. I'm guessing that by... [14:35:19] (03PS1) 10SBassett: Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) [14:35:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [14:35:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [14:36:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro) [14:36:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P47565 and previous config saved to /var/cache/conftool/dbconfig/20230504-143647-ladsgroup.json [14:36:47] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [14:38:21] 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) 05Open→03Resolved Host back up and the idrac is indeed up too! Thanks @Jhancock.wm! [14:38:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [14:39:34] (03CR) 10Gergő Tisza: [C: 03+1] Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett) [14:40:12] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2011.codfw.wmnet with OS bullseye [14:40:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [14:40:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [14:41:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [14:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47566 and previous config saved to /var/cache/conftool/dbconfig/20230504-144110-ladsgroup.json [14:41:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:41:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:45:48] (03PS1) 10Herron: kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) [14:45:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:46:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:46:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:46:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:46:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47567 and previous config saved to /var/cache/conftool/dbconfig/20230504-144625-ladsgroup.json [14:46:52] jouncebot: next [14:46:52] In 1 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600) [14:47:10] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [14:47:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [14:47:35] 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10fgiunchedi) [14:48:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P47568 and previous config saved to /var/cache/conftool/dbconfig/20230504-144801-ladsgroup.json [14:48:08] PROBLEM - puppet last run on prometheus6002 is CRITICAL: CRITICAL: Puppet has been disabled for 604863 seconds, message: Prometheus instances in drmrs dont have a replica label set, causing Thanos to ingest duplicate data - T335406 - denisse, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:48:09] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [14:48:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T335845)', diff saved to https://phabricator.wikimedia.org/P47569 and previous config saved to /var/cache/conftool/dbconfig/20230504-144827-ladsgroup.json [14:48:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:48:33] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder) [14:48:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [14:48:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T335845)', diff saved to https://phabricator.wikimedia.org/P47570 and previous config saved to /var/cache/conftool/dbconfig/20230504-144852-ladsgroup.json [14:49:07] (03CR) 10Jdlrobson: [C: 03+1] Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett) [14:49:21] (03CR) 10David Caro: [C: 03+2] k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro) [14:50:09] (03Merged) 10jenkins-bot: k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro) [14:50:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) (owner: 10Jbond) [14:51:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47571 and previous config saved to /var/cache/conftool/dbconfig/20230504-145153-ladsgroup.json [14:52:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47572 and previous config saved to /var/cache/conftool/dbconfig/20230504-145251-ladsgroup.json [14:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47573 and previous config saved to /var/cache/conftool/dbconfig/20230504-145307-ladsgroup.json [14:53:34] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder) [14:54:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet [14:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T335845)', diff saved to https://phabricator.wikimedia.org/P47574 and previous config saved to /var/cache/conftool/dbconfig/20230504-145627-ladsgroup.json [14:59:06] PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:57] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/915695 [15:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47575 and previous config saved to /var/cache/conftool/dbconfig/20230504-150307-ladsgroup.json [15:03:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [15:03:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host lvs2011.codfw.wmnet [15:03:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [15:03:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:03:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [15:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T335838)', diff saved to https://phabricator.wikimedia.org/P47576 and previous config saved to /var/cache/conftool/dbconfig/20230504-150336-ladsgroup.json [15:03:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host lvs2011.codfw.wmnet [15:03:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host lvs2011.codfw.wmnet [15:03:59] (03CR) 10Jbond: [C: 03+1] wmf-laptop-sre: Add bast2003 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915639 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff) [15:04:12] (03CR) 10Jbond: [C: 03+1] Remove decommed bastions from ssh config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915640 (owner: 10Muehlenhoff) [15:07:24] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [15:07:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P47578 and previous config saved to /var/cache/conftool/dbconfig/20230504-150758-ladsgroup.json [15:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P47579 and previous config saved to /var/cache/conftool/dbconfig/20230504-150813-ladsgroup.json [15:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:10:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T335838)', diff saved to https://phabricator.wikimedia.org/P47580 and previous config saved to /var/cache/conftool/dbconfig/20230504-151000-ladsgroup.json [15:11:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P47581 and previous config saved to /var/cache/conftool/dbconfig/20230504-151133-ladsgroup.json [15:13:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [15:16:12] (03PS1) 10Dzahn: add project language 'gpe', Ghanaian Pidgin [dns] - 10https://gerrit.wikimedia.org/r/915696 (https://phabricator.wikimedia.org/T335969) [15:18:11] (03CR) 10Dzahn: [C: 03+2] "approved by langcom" [dns] - 10https://gerrit.wikimedia.org/r/915696 (https://phabricator.wikimedia.org/T335969) (owner: 10Dzahn) [15:18:15] (03PS2) 10Dzahn: add project language 'gpe', Ghanaian Pidgin [dns] - 10https://gerrit.wikimedia.org/r/915696 (https://phabricator.wikimedia.org/T335969) [15:19:57] (03PS1) 10Filippo Giunchedi: pontoon: install dhcp client [puppet] - 10https://gerrit.wikimedia.org/r/915697 [15:19:59] (03PS1) 10Filippo Giunchedi: pontoon: mark git repo dirs as safe. [puppet] - 10https://gerrit.wikimedia.org/r/915698 [15:21:08] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: mark git repo dirs as safe. [puppet] - 10https://gerrit.wikimedia.org/r/915698 (owner: 10Filippo Giunchedi) [15:21:13] !log adding new project langauge 'gpe' - https://en.wikipedia.org/wiki/Ghanaian_Pidgin_English [15:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:09] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: install dhcp client [puppet] - 10https://gerrit.wikimedia.org/r/915697 (owner: 10Filippo Giunchedi) [15:22:59] (03CR) 10Jcrespo: [C: 03+1] "I checked and both proxies point to the same hosts: db1164/db1217" [dns] - 10https://gerrit.wikimedia.org/r/915695 (owner: 10Marostegui) [15:23:01] (03CR) 10Muehlenhoff: [C: 03+2] Add bast2003 to Bastion hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/915631 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff) [15:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P47582 and previous config saved to /var/cache/conftool/dbconfig/20230504-152304-ladsgroup.json [15:23:12] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/915695 (owner: 10Marostegui) [15:23:16] (03PS2) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/915695 [15:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P47583 and previous config saved to /var/cache/conftool/dbconfig/20230504-152319-ladsgroup.json [15:24:40] !log Failover m1-master from dbproxy1012 to dbproxy1014 [15:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:06] (03CR) 10Dzahn: "good catch, I wonder how I managed to do this since I did a search/replace globally in my editor :p" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [15:25:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P47584 and previous config saved to /var/cache/conftool/dbconfig/20230504-152506-ladsgroup.json [15:26:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P47585 and previous config saved to /var/cache/conftool/dbconfig/20230504-152640-ladsgroup.json [15:26:53] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] wmf-laptop-sre: Add bast2003 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915639 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff) [15:27:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [15:29:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:20] (03CR) 10Muehlenhoff: "One note inline, rest looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:29:28] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove decommed bastions from ssh config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915640 (owner: 10Muehlenhoff) [15:29:42] jouncebot: next [15:29:42] In 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600) [15:30:01] (03CR) 10Filippo Giunchedi: [C: 03+1] kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:31:38] RECOVERY - Host mc2040 is UP: PING OK - Packet loss = 0%, RTA = 31.80 ms [15:32:24] (03PS4) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) [15:32:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on moscovium.eqiad.wmnet with reason: reboot [15:32:44] RECOVERY - Check systemd state on ml-serve1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:46] (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [15:32:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on moscovium.eqiad.wmnet with reason: reboot [15:33:08] !log moscovium (https://rt.wikimedia.org) - rebooting [15:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [15:34:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [15:38:02] !log doc2002 - rebooting [15:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47586 and previous config saved to /var/cache/conftool/dbconfig/20230504-153810-ladsgroup.json [15:38:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [15:38:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47587 and previous config saved to /var/cache/conftool/dbconfig/20230504-153825-ladsgroup.json [15:38:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [15:38:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:38:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1183 (T335845)', diff saved to https://phabricator.wikimedia.org/P47588 and previous config saved to /var/cache/conftool/dbconfig/20230504-153834-ladsgroup.json [15:38:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47589 and previous config saved to /var/cache/conftool/dbconfig/20230504-153850-ladsgroup.json [15:40:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P47590 and previous config saved to /var/cache/conftool/dbconfig/20230504-154012-ladsgroup.json [15:40:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47591 and previous config saved to /var/cache/conftool/dbconfig/20230504-154021-ladsgroup.json [15:41:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T335845)', diff saved to https://phabricator.wikimedia.org/P47592 and previous config saved to /var/cache/conftool/dbconfig/20230504-154146-ladsgroup.json [15:41:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [15:42:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [15:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47593 and previous config saved to /var/cache/conftool/dbconfig/20230504-154211-ladsgroup.json [15:43:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T335845)', diff saved to https://phabricator.wikimedia.org/P47594 and previous config saved to /var/cache/conftool/dbconfig/20230504-154344-ladsgroup.json [15:43:54] (03PS1) 10Chad: WIP: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/915701 (https://phabricator.wikimedia.org/T320390) [15:43:55] Hey all - I’d like to deploy a quick config backport if I can: https://gerrit.wikimedia.org/r/914823. Please let me know if I should wait. [15:45:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [15:46:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47595 and previous config saved to /var/cache/conftool/dbconfig/20230504-154630-ladsgroup.json [15:47:04] sbassett: yes please, deploys are currently locked [15:47:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet [15:47:25] we were trying to do an LVS reimage and it is stalled. so let me revert the patch and I will let you know when it's done [15:47:42] jouncebot: nowandnext [15:47:42] For the next 0 hour(s) and 12 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1400) [15:47:42] In 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600) [15:47:57] sbassett: is it urgent? [15:48:17] (03PS2) 10Herron: kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) [15:48:20] the puppet request window is also empty, so it can be maintenance for that hour too [15:48:57] mutante: yeah that's true though I am not sure how much progress we will make + the additional time for reimaging [15:49:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org [15:49:16] sukhe: the calendar is wide open, no worries [15:49:18] so if sbassett's patch is urgent, I will remove the lock [15:49:41] (03PS1) 10Elukey: ml-services: add env variable to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/915727 (https://phabricator.wikimedia.org/T330414) [15:50:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:50:33] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add env variable to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/915727 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [15:50:41] (03CR) 10Elukey: [C: 03+2] ml-services: add env variable to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/915727 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey) [15:50:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47596 and previous config saved to /var/cache/conftool/dbconfig/20230504-155041-ladsgroup.json [15:50:48] (03CR) 10Herron: kafkamon: cleanup buster classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:50:55] sukhe: not urgent no [15:51:04] But would like to deploy before train [15:51:58] sbassett: thanks! and yes, I will make sure that I lift it before that [15:52:01] I will ping you [15:52:02] thanks for checking [15:52:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [15:53:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet [15:53:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org [15:54:10] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:54:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:54:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org [15:55:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T335838)', diff saved to https://phabricator.wikimedia.org/P47597 and previous config saved to /var/cache/conftool/dbconfig/20230504-155518-ladsgroup.json [15:55:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:55:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:55:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T335838)', diff saved to https://phabricator.wikimedia.org/P47598 and previous config saved to /var/cache/conftool/dbconfig/20230504-155544-ladsgroup.json [15:57:19] (03CR) 10Herron: [C: 03+2] kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron) [15:57:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:57:35] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [15:58:14] (03CR) 10Dzahn: "maybe it should be a separate change, with reviewer Ryan Kemper, because we are touching config of production WDQS and WCWS with that part" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [15:58:29] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [15:58:48] (03CR) 10Dzahn: "this change could be merged first or second, doesn't matter until we want to remove the old name" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [15:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P47599 and previous config saved to /var/cache/conftool/dbconfig/20230504-155850-ladsgroup.json [15:59:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org [16:00:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [16:01:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P47600 and previous config saved to /var/cache/conftool/dbconfig/20230504-160136-ladsgroup.json [16:01:53] (03PS1) 10Dzahn: wdqs/wcqs: change discovery name of backends for GUIs [puppet] - 10https://gerrit.wikimedia.org/r/915737 [16:01:58] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 3516 MB (3% inode=84%): /tmp 3516 MB (3% inode=84%): /var/tmp 3516 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [16:02:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) I think this is superseded by https://phabricator.wikimedia.org/T335941 but will look now into changing my email association. I was thinking I'd keep lorenjohnson@gmail.com for WikiTech and... [16:02:31] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [16:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T335838)', diff saved to https://phabricator.wikimedia.org/P47601 and previous config saved to /var/cache/conftool/dbconfig/20230504-160307-ladsgroup.json [16:03:13] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [16:04:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [16:05:31] (03CR) 10Dzahn: "Jelto, I am not even sure if I like my own change, haha. It kind of makes a switch-over for WDQS/WCQS more complex than before, and for th" [puppet] - 10https://gerrit.wikimedia.org/r/915737 (owner: 10Dzahn) [16:05:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P47602 and previous config saved to /var/cache/conftool/dbconfig/20230504-160547-ladsgroup.json [16:06:51] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet [16:07:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [16:09:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [16:10:25] !log doc1002 (https://doc.wikimedia.org) - reboot, <1 min downtime [16:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on doc1002.eqiad.wmnet with reason: reboot [16:10:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc1002.eqiad.wmnet with reason: reboot [16:10:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but just FYI a re-parse can be achieved with a GET request as well." [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [16:11:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [16:12:35] !log doc1003 - rebooting [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet [16:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P47603 and previous config saved to /var/cache/conftool/dbconfig/20230504-161356-ladsgroup.json [16:14:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki-cache-warmup: Rename `Request` to `Task` [puppet] - 10https://gerrit.wikimedia.org/r/892569 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [16:14:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2002.codfw.wmnet [16:15:11] (03PS1) 10Muehlenhoff: Failover urldownloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/915741 [16:16:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P47604 and previous config saved to /var/cache/conftool/dbconfig/20230504-161643-ladsgroup.json [16:17:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) Ok, I've now updated my Phabricator address here to loren.johnson@wikimedia.de and aligned my WMDE mediawikie.org account to the same address. [16:18:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P47605 and previous config saved to /var/cache/conftool/dbconfig/20230504-161813-ladsgroup.json [16:19:19] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [16:20:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P47606 and previous config saved to /var/cache/conftool/dbconfig/20230504-162055-ladsgroup.json [16:23:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2002.codfw.wmnet [16:24:17] (03PS1) 10Ssingh: Revert "lvs2011: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/915708 [16:26:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [16:26:45] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [16:27:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit1003.wikimedia.org with reason: reboot [16:27:37] (03CR) 10Ssingh: [C: 03+2] Revert "lvs2011: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/915708 (owner: 10Ssingh) [16:27:55] !log gerrit1003 (gerrit-new.wikimedia.org) - rebooting [16:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit1003.wikimedia.org with reason: reboot [16:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T335845)', diff saved to https://phabricator.wikimedia.org/P47607 and previous config saved to /var/cache/conftool/dbconfig/20230504-162902-ladsgroup.json [16:29:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [16:29:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [16:29:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T335845)', diff saved to https://phabricator.wikimedia.org/P47608 and previous config saved to /var/cache/conftool/dbconfig/20230504-162926-ladsgroup.json [16:30:44] !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 152m 23s) [16:30:48] T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 [16:30:59] sbassett: please feel free to deploy! [16:31:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 211.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [16:31:13] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:43] ==> deploys are now unblocked [16:31:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47609 and previous config saved to /var/cache/conftool/dbconfig/20230504-163149-ladsgroup.json [16:32:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [16:33:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P47610 and previous config saved to /var/cache/conftool/dbconfig/20230504-163319-ladsgroup.json [16:33:23] (03PS5) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) [16:33:25] (03PS6) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) [16:33:27] (03PS6) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) [16:33:29] (03PS2) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) [16:33:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on etherpad1003.eqiad.wmnet with reason: reboot [16:34:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad1003.eqiad.wmnet with reason: reboot [16:34:04] !log etherpad1003 (https://etherpad.wikimedia.org) rebooting, 1 min downtime [16:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:13] hnowlan: is that expected? ^^ [16:34:41] got the page [16:35:19] herron: nope, looking [16:36:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47611 and previous config saved to /var/cache/conftool/dbconfig/20230504-163601-ladsgroup.json [16:36:02] it feels like the alert is not new but the part that it p.ages is? [16:36:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [16:36:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [16:36:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [16:36:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [16:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T335845)', diff saved to https://phabricator.wikimedia.org/P47612 and previous config saved to /var/cache/conftool/dbconfig/20230504-163626-ladsgroup.json [16:36:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T335845)', diff saved to https://phabricator.wikimedia.org/P47613 and previous config saved to /var/cache/conftool/dbconfig/20230504-163646-ladsgroup.json [16:37:02] (03PS2) 10JMeybohm: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus) [16:37:03] (03PS4) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [16:37:05] (03PS1) 10JMeybohm: Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) [16:38:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:24] (03CR) 10JMeybohm: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus) [16:39:23] !log extending logical volume of backup1003, backup2003 for backup storage [16:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47614 and previous config saved to /var/cache/conftool/dbconfig/20230504-164004-ladsgroup.json [16:41:23] cwhite, herron: sorry for the noise - looks like a spike in traffic. I'll add back in some capacity to stop it happening for the rest of the day [16:41:35] hnowlan: sounds good thanks! [16:42:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T335845)', diff saved to https://phabricator.wikimedia.org/P47615 and previous config saved to /var/cache/conftool/dbconfig/20230504-164247-ladsgroup.json [16:42:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [16:43:02] sukhe: thanks! [16:43:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sbassett@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett) [16:44:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [16:44:30] (03Merged) 10jenkins-bot: Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett) [16:44:59] !log sbassett@deploy1002 Started scap: Backport for [[gerrit:914823|Re-enable the Graph extension on test2wiki (T334940)]] [16:45:03] T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940 [16:45:06] !log hnowlan@puppetmaster1001 conftool action : set/weight=5; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet [16:46:00] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet [16:46:00] 10SRE-Access-Requests, 10Phabricator, 10serviceops-collab: let Eoghan see security tickets in Phabricator - https://phabricator.wikimedia.org/T335981 (10Dzahn) [16:46:08] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:18] 10SRE-Access-Requests, 10Phabricator, 10serviceops-collab: let Eoghan see security tickets in Phabricator - https://phabricator.wikimedia.org/T335981 (10Dzahn) [16:46:26] !log sbassett@deploy1002 sbassett: Backport for [[gerrit:914823|Re-enable the Graph extension on test2wiki (T334940)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [16:48:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T335838)', diff saved to https://phabricator.wikimedia.org/P47616 and previous config saved to /var/cache/conftool/dbconfig/20230504-164826-ladsgroup.json [16:48:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:48:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T335838)', diff saved to https://phabricator.wikimedia.org/P47617 and previous config saved to /var/cache/conftool/dbconfig/20230504-164850-ladsgroup.json [16:51:36] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet [16:51:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P47618 and previous config saved to /var/cache/conftool/dbconfig/20230504-165152-ladsgroup.json [16:52:03] !log sbassett@deploy1002 Finished scap: Backport for [[gerrit:914823|Re-enable the Graph extension on test2wiki (T334940)]] (duration: 07m 04s) [16:52:06] T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940 [16:52:53] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [16:53:58] (03Abandoned) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe) [16:55:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P47619 and previous config saved to /var/cache/conftool/dbconfig/20230504-165511-ladsgroup.json [16:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T335838)', diff saved to https://phabricator.wikimedia.org/P47620 and previous config saved to /var/cache/conftool/dbconfig/20230504-165521-ladsgroup.json [16:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P47621 and previous config saved to /var/cache/conftool/dbconfig/20230504-165753-ladsgroup.json [16:58:48] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [16:58:54] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:06] brennen and mutante: OwO what's this, a deployment window?? Phabricator update window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700). nyaa~ [17:00:06] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700). [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700) [17:00:17] o/ [17:00:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host lvs2011.codfw.wmnet [17:00:50] jouncebot: now [17:00:50] For the next 0 hour(s) and 59 minute(s): Phabricator update window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700) [17:00:51] For the next 0 hour(s) and 59 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700) [17:00:51] For the next 0 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700) [17:01:07] !log Phabricator upgrade - maintenance incoming [17:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:17] Good luck! [17:01:36] a minute or two while i juggle deployment repo state and then i'm ready to run scap [17:01:41] ty:) [17:01:55] $deityspeed mutante! [17:02:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: maintenance upgrade [17:02:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: maintenance upgrade [17:03:03] (03PS2) 10JMeybohm: Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) [17:03:05] (03PS5) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus) [17:03:10] thanks, brennen does the actual work:) [17:03:29] downtimed [17:03:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: maintenance upgrade [17:03:49] brennen: phab2002 will go first? [17:03:51] mutante: cool, will do phab2002 then the prod box. [17:03:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: maintenance upgrade [17:03:54] :) [17:04:15] Nothing for me to deploy in the Technical Engagement slot this week. [17:04:31] we should aso reboot aphlict if we have time for it [17:04:33] in the window [17:04:34] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) User has been contacted for verification [17:04:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet [17:05:00] thx bd808 [17:05:11] (sorry to step on window) [17:05:15] !log brennen@deploy1002 Started deploy [phabricator/deployment@0529926]: deploy latest state to phab2002 [17:05:23] brennen: it's all good :) [17:05:27] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) [17:05:40] (03CR) 10JMeybohm: "After this we can merge https://gerrit.wikimedia.org/r/c/integration/config/+/914785" [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) (owner: 10JMeybohm) [17:05:47] there is still that ONE Icinga check left, that is for phab.wmfusercontent.org [17:05:53] !log brennen@deploy1002 Finished deploy [phabricator/deployment@0529926]: deploy latest state to phab2002 (duration: 00m 37s) [17:05:58] and cookbook won't find that host [17:06:04] because it's a service name [17:06:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P47622 and previous config saved to /var/cache/conftool/dbconfig/20230504-170658-ladsgroup.json [17:07:33] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) uid: sg912 uidNumber: 41194 [17:07:56] !log brennen@deploy1002 Started deploy [phabricator/deployment@0529926]: deploy latest state to phab1004 [17:08:31] !log brennen@deploy1002 Finished deploy [phabricator/deployment@0529926]: deploy latest state to phab1004 (duration: 00m 34s) [17:08:51] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet [17:09:42] !log phab1004 deployed and restarted, phab up, MR widget still seems to work [17:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:59] :) [17:10:04] wfm [17:10:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P47623 and previous config saved to /var/cache/conftool/dbconfig/20230504-171017-ladsgroup.json [17:10:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P47624 and previous config saved to /var/cache/conftool/dbconfig/20230504-171028-ladsgroup.json [17:11:26] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet [17:12:52] mutante: i'm good, no objections here to an aphlict restart if needed [17:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P47625 and previous config saved to /var/cache/conftool/dbconfig/20230504-171300-ladsgroup.json [17:13:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:13:32] brennen: great! thank you, one moment [17:15:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet [17:16:20] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:43] !log aphlict2001 - not active, rebooting [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:19:43] (03PS1) 10Raymond Ndibe: toolforge: add tekton metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/915771 (https://phabricator.wikimedia.org/T325163) [17:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T335845)', diff saved to https://phabricator.wikimedia.org/P47626 and previous config saved to /var/cache/conftool/dbconfig/20230504-172204-ladsgroup.json [17:22:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [17:22:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [17:22:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001 [17:22:28] T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383 [17:22:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T335845)', diff saved to https://phabricator.wikimedia.org/P47627 and previous config saved to /var/cache/conftool/dbconfig/20230504-172228-ladsgroup.json [17:24:48] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet [17:24:49] (03PS3) 10Ebernhardson: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) [17:25:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47628 and previous config saved to /var/cache/conftool/dbconfig/20230504-172523-ladsgroup.json [17:25:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [17:25:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [17:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P47629 and previous config saved to /var/cache/conftool/dbconfig/20230504-172534-ladsgroup.json [17:25:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [17:25:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47630 and previous config saved to /var/cache/conftool/dbconfig/20230504-172546-ladsgroup.json [17:25:49] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [17:26:38] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host aphlict2001.codfw.wmnet [17:27:13] (03PS1) 10Effie Mouzeli: data.yaml: Add Hasan Akgün (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/915780 (https://phabricator.wikimedia.org/T335101) [17:28:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T335845)', diff saved to https://phabricator.wikimedia.org/P47631 and previous config saved to /var/cache/conftool/dbconfig/20230504-172806-ladsgroup.json [17:28:08] (03CR) 10RLazarus: [C: 03+2] thanos: Migrate from 100-scale to unit-scale SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/914945 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus) [17:28:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [17:28:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [17:28:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:28:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:28:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T335845)', diff saved to https://phabricator.wikimedia.org/P47632 and previous config saved to /var/cache/conftool/dbconfig/20230504-172835-ladsgroup.json [17:28:38] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:04] (03PS2) 10Effie Mouzeli: data.yaml: Add Hasan Akgün (WMDE) to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915780 (https://phabricator.wikimedia.org/T335101) [17:29:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T335845)', diff saved to https://phabricator.wikimedia.org/P47633 and previous config saved to /var/cache/conftool/dbconfig/20230504-172932-ladsgroup.json [17:30:46] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict2001.codfw.wmnet [17:30:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on people1003.eqiad.wmnet with reason: maintenance upgrade [17:31:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on people1003.eqiad.wmnet with reason: maintenance upgrade [17:31:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet [17:31:27] !log people1003 - rebooting [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:43] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host aphlict1002.eqiad.wmnet [17:32:09] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [17:32:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [17:32:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [17:32:32] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [17:32:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [17:33:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47634 and previous config saved to /var/cache/conftool/dbconfig/20230504-173309-ladsgroup.json [17:35:38] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict1002.eqiad.wmnet [17:35:46] (03PS1) 10Effie Mouzeli: data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) [17:35:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T335845)', diff saved to https://phabricator.wikimedia.org/P47635 and previous config saved to /var/cache/conftool/dbconfig/20230504-173555-ladsgroup.json [17:37:10] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [17:37:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [17:38:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10jijiki) [17:40:06] (03CR) 10Krinkle: "@Marostegui We're ready for it. The puppet patches are ready from my side. We might make minor tweaks still but this and prod equiv can go" [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [17:40:14] (03CR) 10Dzahn: [C: 03+1] "lgtm, confirmed LDAP and has approval from Tyler" [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) (owner: 10Effie Mouzeli) [17:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T335838)', diff saved to https://phabricator.wikimedia.org/P47637 and previous config saved to /var/cache/conftool/dbconfig/20230504-174040-ladsgroup.json [17:41:15] 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Dzahn) [17:42:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [17:42:57] (03PS2) 10Effie Mouzeli: data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) [17:42:59] (03PS1) 10Effie Mouzeli: data.yaml: Add Julia Kieserman to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) [17:44:38] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet [17:44:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47638 and previous config saved to /var/cache/conftool/dbconfig/20230504-174438-ladsgroup.json [17:45:10] (03CR) 10Dzahn: [C: 03+1] "lgtm, confirmed LDAP and has group approval" [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) (owner: 10Effie Mouzeli) [17:47:06] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:16] PROBLEM - puppet last run on prometheus5002 is CRITICAL: CRITICAL: Puppet has been disabled for 604811 seconds, message: Disabling Puppet and Thanos sidecar as part of the migration of Prometheus hosts to Bullseye - T309979 - denisse, last run 6 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:48:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P47639 and previous config saved to /var/cache/conftool/dbconfig/20230504-174815-ladsgroup.json [17:48:34] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [17:48:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [17:48:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [17:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P47640 and previous config saved to /var/cache/conftool/dbconfig/20230504-175102-ladsgroup.json [17:51:16] (03PS1) 10Dwisehaupt: Direct frbast.wm.o at the new frbast1002 host [dns] - 10https://gerrit.wikimedia.org/r/915811 (https://phabricator.wikimedia.org/T319460) [17:51:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet [17:53:14] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [17:53:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [17:54:41] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [17:54:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) for the record: It's perfectly fine to have 2 accounts, one with work email and one with volunteer/personal email, if you really prefer that. some people do this, other's don.t [17:54:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [17:58:52] (03CR) 10Jgreen: [C: 03+2] Direct frbast.wm.o at the new frbast1002 host [dns] - 10https://gerrit.wikimedia.org/r/915811 (https://phabricator.wikimedia.org/T319460) (owner: 10Dwisehaupt) [17:59:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [17:59:32] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47641 and previous config saved to /var/cache/conftool/dbconfig/20230504-175945-ladsgroup.json [18:00:05] brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1800). [18:01:22] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) @odimitrijevic or @Ottomata can you please approve this request for the group analytics-privatedata-users ? [18:02:01] o/ [18:03:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P47642 and previous config saved to /var/cache/conftool/dbconfig/20230504-180322-ladsgroup.json [18:03:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:03:44] checking out a couple of log messages before rolling train. [18:04:42] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet [18:04:48] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [18:04:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [18:04:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [18:05:54] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [18:06:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [18:06:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P47643 and previous config saved to /var/cache/conftool/dbconfig/20230504-180608-ladsgroup.json [18:08:11] !log train 1.41.0-wmf.7 (T330213): logs fairly quiet and no current blockers, rolling to group2 [18:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:15] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:08:45] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915812 (https://phabricator.wikimedia.org/T330213) [18:08:47] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915812 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [18:09:53] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915812 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [18:11:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet [18:11:16] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [18:11:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [18:12:11] !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011'] [18:12:30] !log fab@deploy1002 Started deploy [airflow-dags/research@88ebdf7]: (no justification provided) [18:12:59] !log fab@deploy1002 Finished deploy [airflow-dags/research@88ebdf7]: (no justification provided) (duration: 00m 28s) [18:13:05] !log cmooney@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2011'] [18:14:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) [18:14:43] !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011'] [18:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T335845)', diff saved to https://phabricator.wikimedia.org/P47644 and previous config saved to /var/cache/conftool/dbconfig/20230504-181451-ladsgroup.json [18:14:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:14:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [18:15:04] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10Ottomata) Approved [18:15:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:15:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47645 and previous config saved to /var/cache/conftool/dbconfig/20230504-181516-ladsgroup.json [18:15:39] !log cmooney@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2011'] [18:16:35] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.7 refs T330213 [18:16:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47646 and previous config saved to /var/cache/conftool/dbconfig/20230504-181636-ladsgroup.json [18:16:40] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:16:56] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:05] !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [18:17:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [18:17:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47647 and previous config saved to /var/cache/conftool/dbconfig/20230504-181828-ladsgroup.json [18:18:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [18:18:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [18:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T335838)', diff saved to https://phabricator.wikimedia.org/P47648 and previous config saved to /var/cache/conftool/dbconfig/20230504-181851-ladsgroup.json [18:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T335845)', diff saved to https://phabricator.wikimedia.org/P47649 and previous config saved to /var/cache/conftool/dbconfig/20230504-182114-ladsgroup.json [18:21:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [18:21:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [18:21:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47650 and previous config saved to /var/cache/conftool/dbconfig/20230504-182139-ladsgroup.json [18:21:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [18:22:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:22:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47651 and previous config saved to /var/cache/conftool/dbconfig/20230504-182238-ladsgroup.json [18:23:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47652 and previous config saved to /var/cache/conftool/dbconfig/20230504-182301-ladsgroup.json [18:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T335838)', diff saved to https://phabricator.wikimedia.org/P47653 and previous config saved to /var/cache/conftool/dbconfig/20230504-182418-ladsgroup.json [18:24:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet [18:28:02] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:45] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [18:28:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [18:29:31] (03PS1) 10Ottomata: page_content_change - bump to v1.15.0-dev2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915817 (https://phabricator.wikimedia.org/T332948) [18:29:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [18:30:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47654 and previous config saved to /var/cache/conftool/dbconfig/20230504-183010-ladsgroup.json [18:31:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet [18:31:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [18:37:05] !log fab@deploy1002 Started deploy [airflow-dags/research@88ebdf7]: (no justification provided) [18:37:11] (03CR) 10Ottomata: [C: 03+2] page_content_change - bump to v1.15.0-dev2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915817 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata) [18:37:15] !log fab@deploy1002 Finished deploy [airflow-dags/research@88ebdf7]: (no justification provided) (duration: 00m 09s) [18:37:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P47655 and previous config saved to /var/cache/conftool/dbconfig/20230504-183744-ladsgroup.json [18:38:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [18:39:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P47656 and previous config saved to /var/cache/conftool/dbconfig/20230504-183925-ladsgroup.json [18:44:30] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet [18:45:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P47657 and previous config saved to /var/cache/conftool/dbconfig/20230504-184516-ladsgroup.json [18:46:40] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [18:50:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet [18:51:05] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Aklapper) [general comment] For staff and contractors of legal entities I strongly recommend using an account for paid work that's clearly identifiable as a work account for the sake of transparency... [18:52:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P47658 and previous config saved to /var/cache/conftool/dbconfig/20230504-185250-ladsgroup.json [18:52:54] (03PS1) 10Jdlrobson: Fix file page integration [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915709 (https://phabricator.wikimedia.org/T335997) [18:54:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [18:54:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P47659 and previous config saved to /var/cache/conftool/dbconfig/20230504-185431-ladsgroup.json [18:55:08] what's ipmiseld.service ? [18:55:42] it got resolved, nevermind [18:57:44] 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Aklapper) @FJoseph-WMF: Hi, please follow the docs at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Turnilo#Access and see the corresponding Phabricator form which has a... [18:59:06] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:54] RECOVERY - Check systemd state on elastic1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P47660 and previous config saved to /var/cache/conftool/dbconfig/20230504-190022-ladsgroup.json [19:01:13] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:32] !log fab@deploy1002 Started deploy [airflow-dags/research@88ebdf7]: (no justification provided) [19:02:36] !log fab@deploy1002 Finished deploy [airflow-dags/research@88ebdf7]: (no justification provided) (duration: 00m 03s) [19:04:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet [19:04:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [19:07:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47661 and previous config saved to /var/cache/conftool/dbconfig/20230504-190757-ladsgroup.json [19:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T335838)', diff saved to https://phabricator.wikimedia.org/P47662 and previous config saved to /var/cache/conftool/dbconfig/20230504-190937-ladsgroup.json [19:09:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [19:09:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [19:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47663 and previous config saved to /var/cache/conftool/dbconfig/20230504-191001-ladsgroup.json [19:10:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet [19:11:17] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [19:15:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47664 and previous config saved to /var/cache/conftool/dbconfig/20230504-191528-ladsgroup.json [19:16:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47665 and previous config saved to /var/cache/conftool/dbconfig/20230504-191612-ladsgroup.json [19:16:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47666 and previous config saved to /var/cache/conftool/dbconfig/20230504-191623-ladsgroup.json [19:21:20] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [19:23:34] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet [19:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47667 and previous config saved to /var/cache/conftool/dbconfig/20230504-192747-ladsgroup.json [19:28:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [19:29:09] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet [19:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P47668 and previous config saved to /var/cache/conftool/dbconfig/20230504-193118-ladsgroup.json [19:31:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P47669 and previous config saved to /var/cache/conftool/dbconfig/20230504-193129-ladsgroup.json [19:32:19] (03PS1) 10Dzahn: gerrit: disable replication from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) [19:38:08] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet [19:38:16] (03CR) 10Dzahn: "So, if $replication = lookup('profile::gerrit::replication'),is not set then in modules/gerrit/templates/replication.config.erb the "remot" [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) (owner: 10Dzahn) [19:38:41] (03PS2) 10Dzahn: gerrit: disable replication from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) [19:42:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [19:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P47670 and previous config saved to /var/cache/conftool/dbconfig/20230504-194254-ladsgroup.json [19:43:52] (03CR) 10Dzahn: [V: 03+1] "actual compiler diff: https://puppet-compiler.wmflabs.org/output/915830/41055/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) (owner: 10Dzahn) [19:44:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T335684 (10phaultfinder) [19:45:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet [19:46:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P47671 and previous config saved to /var/cache/conftool/dbconfig/20230504-194624-ladsgroup.json [19:46:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P47672 and previous config saved to /var/cache/conftool/dbconfig/20230504-194635-ladsgroup.json [19:47:08] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [19:52:38] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335966 (10Jclark-ctr) a:03Jclark-ctr [19:56:37] (03PS1) 10Ottomata: page_content_change - bump t0 v1.15.0-dev3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915832 [19:57:52] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P47673 and previous config saved to /var/cache/conftool/dbconfig/20230504-195800-ladsgroup.json [19:58:04] (03CR) 10Ottomata: [C: 03+2] page_content_change - bump t0 v1.15.0-dev3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915832 (owner: 10Ottomata) [20:00:05] brennen and TheresNoTime: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:50] !log people2002 (people.wikimedia.org) reboot, <1 min downtime [20:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on people2002.codfw.wmnet with reason: maintenance upgrade [20:01:18] present [20:01:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on people2002.codfw.wmnet with reason: maintenance upgrade [20:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47674 and previous config saved to /var/cache/conftool/dbconfig/20230504-200131-ladsgroup.json [20:01:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47675 and previous config saved to /var/cache/conftool/dbconfig/20230504-200141-ladsgroup.json [20:01:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:01:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:03:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on miscweb1003.eqiad.wmnet with reason: reboot [20:03:24] brennen: around? I think Sammy is out today. [20:03:30] Jdlrobson: yeah, i can sling that out [20:03:35] thanks :) [20:03:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on miscweb1003.eqiad.wmnet with reason: reboot [20:03:53] (03PS1) 10Ottomata: page_content_change - remove python.fn-execution.bundle.size setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/915834 (https://phabricator.wikimedia.org/T332948) [20:03:58] !log miscweb1003 - rebooting [20:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915709 (https://phabricator.wikimedia.org/T335997) (owner: 10Jdlrobson) [20:05:27] (03CR) 10Ottomata: [C: 03+2] page_content_change - remove python.fn-execution.bundle.size setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/915834 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata) [20:06:15] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:06:18] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:06:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [20:06:30] (03Merged) 10jenkins-bot: Fix file page integration [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915709 (https://phabricator.wikimedia.org/T335997) (owner: 10Jdlrobson) [20:06:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [20:06:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T335838)', diff saved to https://phabricator.wikimedia.org/P47676 and previous config saved to /var/cache/conftool/dbconfig/20230504-200644-ladsgroup.json [20:06:48] !log brennen@deploy1002 Started scap: Backport for [[gerrit:915709|Fix file page integration (T335997)]] [20:06:51] T335997: MMV broken on file page (TypeError: Cannot read properties of undefined (reading 'then') / TypeError: undefined is not an object (evaluating 'bs.openImage(this,title).then')) - https://phabricator.wikimedia.org/T335997 [20:08:14] !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:915709|Fix file page integration (T335997)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:08:30] Jdlrobson: lemme know when to proceed [20:09:01] brennen: looking now [20:09:50] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335966 (10Jclark-ctr) 05Open→03Resolved reseated power supply [20:10:23] LGTM brennen please sync [20:11:50] (03CR) 10RLazarus: [C: 03+2] mediawiki-cache-warmup: Rename `Request` to `Task` [puppet] - 10https://gerrit.wikimedia.org/r/892569 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [20:12:07] goin' [20:12:08] (03CR) 10Ryan Kemper: [C: 03+2] elastic: remove redundant usage [cookbooks] - 10https://gerrit.wikimedia.org/r/915092 (owner: 10Ryan Kemper) [20:12:34] (03PS1) 10Ryan Kemper: elastic: extend downtime for operations [cookbooks] - 10https://gerrit.wikimedia.org/r/915836 [20:13:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47677 and previous config saved to /var/cache/conftool/dbconfig/20230504-201306-ladsgroup.json [20:13:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [20:13:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [20:13:29] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: extend downtime for operations [cookbooks] - 10https://gerrit.wikimedia.org/r/915836 (owner: 10Ryan Kemper) [20:13:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47678 and previous config saved to /var/cache/conftool/dbconfig/20230504-201332-ladsgroup.json [20:14:45] (03PS1) 10Ottomata: page_content_change - v1.15.0-dev4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915837 [20:15:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T335838)', diff saved to https://phabricator.wikimedia.org/P47679 and previous config saved to /var/cache/conftool/dbconfig/20230504-201514-ladsgroup.json [20:16:26] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:38] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:915709|Fix file page integration (T335997)]] (duration: 10m 50s) [20:17:42] T335997: MMV broken on file page (TypeError: Cannot read properties of undefined (reading 'then') / TypeError: undefined is not an object (evaluating 'bs.openImage(this,title).then')) - https://phabricator.wikimedia.org/T335997 [20:18:23] (03CR) 10Ottomata: [C: 03+2] page_content_change - v1.15.0-dev4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915837 (owner: 10Ottomata) [20:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:10] (03PS5) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) [20:19:39] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:19:42] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:19:48] (03CR) 10RLazarus: [C: 03+2] mediawiki-cache-warmup: Add POSTs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus) [20:19:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47680 and previous config saved to /var/cache/conftool/dbconfig/20230504-201955-ladsgroup.json [20:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:27:57] 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10FJoseph-WMF) @Aklapper I'm new to the foundation. I reached out to ITS they said to open a phab ticket and tag SRE. There was an autocomplete option for SRE access request - it seemed t... [20:28:44] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:06] RECOVERY - Disk space on stat1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [20:29:51] (03CR) 10Xcollazo: "PPC is happy with the changes https://puppet-compiler.wmflabs.org/output/914928/41056/" [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721) (owner: 10Xcollazo) [20:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P47681 and previous config saved to /var/cache/conftool/dbconfig/20230504-203021-ladsgroup.json [20:30:32] gonna do a train rollback here. [20:31:10] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915841 (https://phabricator.wikimedia.org/T330213) [20:31:12] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915841 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [20:32:01] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915841 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot) [20:33:37] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10RLazarus) 892570 is merged now, and I think we'll be in better shape for the next one. @Clement_Goubert I'm tempted to resolve this, and reopen if we... [20:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P47682 and previous config saved to /var/cache/conftool/dbconfig/20230504-203501-ladsgroup.json [20:35:11] 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10FJoseph-WMF) This ticket can be closed out. [20:35:14] 10SRE, 10LDAP-Access-Requests: Grant Access to fjoseph for Fjoseph - https://phabricator.wikimedia.org/T336009 (10FJoseph-WMF) [20:40:16] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:45:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P47683 and previous config saved to /var/cache/conftool/dbconfig/20230504-204527-ladsgroup.json [20:46:55] (03CR) 10Xcollazo: "This is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721) (owner: 10Xcollazo) [20:47:28] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P47684 and previous config saved to /var/cache/conftool/dbconfig/20230504-205007-ladsgroup.json [20:51:13] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.7 refs T330213 [20:51:16] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [20:52:06] 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Aklapper) >>! In T335967#8828179, @FJoseph-WMF wrote: > they said to open a phab ticket and tag SRE Ah, thanks a lot, that's useful to know. I asked as I'm always curious how to improv... [20:52:15] 10SRE, 10LDAP-Access-Requests: Grant Access to fjoseph for Fjoseph - https://phabricator.wikimedia.org/T336009 (10Aklapper) [20:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:38] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf and Turnilo for Fjoseph - https://phabricator.wikimedia.org/T336009 (10Aklapper) [20:52:40] 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Aklapper) [20:57:16] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.7 refs T330213 (duration: 06m 02s) [20:57:20] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [20:57:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:58:16] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T335838)', diff saved to https://phabricator.wikimedia.org/P47685 and previous config saved to /var/cache/conftool/dbconfig/20230504-210033-ladsgroup.json [21:00:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [21:00:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [21:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T335838)', diff saved to https://phabricator.wikimedia.org/P47686 and previous config saved to /var/cache/conftool/dbconfig/20230504-210057-ladsgroup.json [21:02:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47687 and previous config saved to /var/cache/conftool/dbconfig/20230504-210513-ladsgroup.json [21:05:57] (03CR) 10Btullis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [21:09:05] (03CR) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [21:09:16] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:09:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T335838)', diff saved to https://phabricator.wikimedia.org/P47688 and previous config saved to /var/cache/conftool/dbconfig/20230504-210928-ladsgroup.json [21:14:00] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:16:58] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:30] (03PS56) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [21:18:33] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41058/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [21:21:18] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:22:58] (03PS19) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [21:24:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P47689 and previous config saved to /var/cache/conftool/dbconfig/20230504-212434-ladsgroup.json [21:26:05] (03PS1) 10Eevans: deployment-prep: upgrade Cassandra (restbase) to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915846 (https://phabricator.wikimedia.org/T335383) [21:28:04] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:52] (03CR) 10Eevans: [C: 03+2] deployment-prep: upgrade Cassandra (restbase) to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915846 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans) [21:36:26] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P47690 and previous config saved to /var/cache/conftool/dbconfig/20230504-213941-ladsgroup.json [21:42:32] (03PS1) 10Brennen Bearnes: api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) [21:44:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:44:45] (03PS1) 10Barakat Ajadi: CentralNoticeTiming: Remove enablement of the topic for legacy eventlogging and refinery [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) [21:45:30] (03CR) 10Ladsgroup: [C: 03+2] api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes) [21:45:50] brennen: I'm deploying the fix. Wanna roll forward afterwards? [21:46:52] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:59] (03CR) 10CI reject: [V: 04-1] CentralNoticeTiming: Remove enablement of the topic for legacy eventlogging and refinery [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [21:47:38] Amir1: from discussion at https://phabricator.wikimedia.org/T336008#8828350 i'm not sure this will reduce the error rate? [21:50:42] ah, yeah, it's mostly for logging [21:51:08] let's see if it surfaces any useful logging at group1? [21:51:45] i'm kind of fried here, i think it might be more sensible of me to pause train for the day. [21:52:15] i have already tempted the deployment gods enough for one day [21:54:12] (03CR) 10Ladsgroup: api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes) [21:54:25] okay, I removed my +2 from the backport [21:54:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T335838)', diff saved to https://phabricator.wikimedia.org/P47691 and previous config saved to /var/cache/conftool/dbconfig/20230504-215447-ladsgroup.json [21:54:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [21:55:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [21:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T335838)', diff saved to https://phabricator.wikimedia.org/P47692 and previous config saved to /var/cache/conftool/dbconfig/20230504-215511-ladsgroup.json [21:58:09] Amir1: i'll go ahead with the backport, but leave wmf.7 at group1. hopefully that gets something useful on the bug. [21:58:25] (03CR) 10Ladsgroup: [C: 03+2] api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes) [21:58:32] okay :D [21:58:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes) [21:58:51] in the morning perhaps i will be smarter. :D [21:59:10] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:47] well, technically it's morning here now [22:01:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T335838)', diff saved to https://phabricator.wikimedia.org/P47693 and previous config saved to /var/cache/conftool/dbconfig/20230504-220127-ladsgroup.json [22:02:35] (03Merged) 10jenkins-bot: api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes) [22:03:03] !log brennen@deploy1002 Started scap: Backport for [[gerrit:915710|api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords (T336008)]] [22:03:07] T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008 [22:03:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:04:32] !log brennen@deploy1002 brennen: Backport for [[gerrit:915710|api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords (T336008)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:12:11] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:915710|api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords (T336008)]] (duration: 09m 07s) [22:12:18] T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008 [22:12:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:16:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P47694 and previous config saved to /var/cache/conftool/dbconfig/20230504-221633-ladsgroup.json [22:17:34] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:22:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host lvs2011.codfw.wmnet [22:28:22] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:31:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P47695 and previous config saved to /var/cache/conftool/dbconfig/20230504-223139-ladsgroup.json [22:46:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T335838)', diff saved to https://phabricator.wikimedia.org/P47696 and previous config saved to /var/cache/conftool/dbconfig/20230504-224646-ladsgroup.json [22:46:48] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:47:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:49:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:49:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:49:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [22:50:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [22:50:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47697 and previous config saved to /var/cache/conftool/dbconfig/20230504-225013-ladsgroup.json [22:53:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:53:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:53:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T335845)', diff saved to https://phabricator.wikimedia.org/P47698 and previous config saved to /var/cache/conftool/dbconfig/20230504-225336-ladsgroup.json [22:57:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47699 and previous config saved to /var/cache/conftool/dbconfig/20230504-225747-ladsgroup.json [22:59:12] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T335845)', diff saved to https://phabricator.wikimedia.org/P47700 and previous config saved to /var/cache/conftool/dbconfig/20230504-230001-ladsgroup.json [23:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:12:32] 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10RLazarus) 05Open→03Resolved Boldly resolving -- last I heard from Haroon, everyone's satisfied with this explanation. Feel free to reopen if there are... [23:12:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P47701 and previous config saved to /var/cache/conftool/dbconfig/20230504-231254-ladsgroup.json [23:15:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P47702 and previous config saved to /var/cache/conftool/dbconfig/20230504-231507-ladsgroup.json [23:17:58] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P47703 and previous config saved to /var/cache/conftool/dbconfig/20230504-232800-ladsgroup.json [23:29:06] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P47704 and previous config saved to /var/cache/conftool/dbconfig/20230504-233013-ladsgroup.json [23:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47705 and previous config saved to /var/cache/conftool/dbconfig/20230504-234306-ladsgroup.json [23:43:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [23:43:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [23:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T335845)', diff saved to https://phabricator.wikimedia.org/P47706 and previous config saved to /var/cache/conftool/dbconfig/20230504-234330-ladsgroup.json [23:44:44] 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10HShaikh) Sorry I thought I had already responded but it seems I forgot to hit submit on the ticket. Reuven is correct we are fine with the current explanat... [23:45:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T335845)', diff saved to https://phabricator.wikimedia.org/P47707 and previous config saved to /var/cache/conftool/dbconfig/20230504-234520-ladsgroup.json [23:45:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [23:45:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [23:45:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T335845)', diff saved to https://phabricator.wikimedia.org/P47708 and previous config saved to /var/cache/conftool/dbconfig/20230504-234544-ladsgroup.json [23:46:32] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T335845)', diff saved to https://phabricator.wikimedia.org/P47709 and previous config saved to /var/cache/conftool/dbconfig/20230504-234840-ladsgroup.json [23:53:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T335845)', diff saved to https://phabricator.wikimedia.org/P47710 and previous config saved to /var/cache/conftool/dbconfig/20230504-235326-ladsgroup.json [23:59:18] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state