[00:20:51] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2023-04-25 00:00:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:33:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[00:39:19] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/914811
[00:39:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/914811 (owner: 10TrainBranchBot)
[00:41:47] <wikibugs>	 (03PS1) 10RLazarus: thanos: Migrate from 100-scale to unit-scale SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/914945 (https://phabricator.wikimedia.org/T289615)
[00:42:10] <wikibugs>	 (03PS1) 10RLazarus: Migrate from 100-scale to unit-scale SLO recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914946 (https://phabricator.wikimedia.org/T289615)
[00:51:25] <icinga-wm>	 RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2023-05-02 15:17:24 (4378 GiB, +0.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:55:43] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/914811 (owner: 10TrainBranchBot)
[02:07:54] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:17] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[02:22:54] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[03:36:29] <wikibugs>	 (03PS1) 10Jdrewniak: [10%] Enable Vector 2022 as the default skin for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686)
[03:36:31] <wikibugs>	 (03PS1) 10Jdrewniak: Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686)
[03:36:33] <wikibugs>	 (03PS1) 10Jdrewniak: Enable Vector 2022 as the default skin on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686)
[04:10:55] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[04:12:23] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[04:29:52] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 50 hosts with reason: Rolling reboot of eqiad for T335835
[04:30:26] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 50 hosts with reason: Rolling reboot of eqiad for T335835
[04:33:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[04:36:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835
[04:36:58] <ryankemper>	 !log [Elastic] Beginning rolling reboot of eqiad elastic, 3 nodes at a time, `ryankemper@cumin1001` tmux session `reboot_eqiad`
[04:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:37:25] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835
[04:38:22] <ryankemper>	 !log [Elastic] Reboot operation failed w/ (likely transient) read timeouts, will try again in 10 mins
[04:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:38:51] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on relforge[1003-1004].eqiad.wmnet with reason: Rolling reboot T335835
[04:39:15] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on relforge[1003-1004].eqiad.wmnet with reason: Rolling reboot T335835
[04:43:14] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: remove redundant usage [cookbooks] - 10https://gerrit.wikimedia.org/r/915092
[04:45:57] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T335835
[04:47:29] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:38] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Rolling reboot for T335835
[04:47:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Rolling reboot for T335835
[04:49:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:50:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot
[04:51:26] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835
[04:52:29] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:53:57] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 182, active_shards: 364, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max
[04:53:57] <icinga-wm>	 _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[04:54:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:54:08] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T335835
[05:00:39] <icinga-wm>	 PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:02:15] <icinga-wm>	 RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:09:07] <icinga-wm>	 PROBLEM - Check systemd state on elastic1091 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service,wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:01] <icinga-wm>	 PROBLEM - Check systemd state on elastic1096 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:39] <icinga-wm>	 PROBLEM - Check systemd state on elastic1101 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:19:31] <icinga-wm>	 PROBLEM - Check systemd state on elastic1099 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:22:47] <ryankemper>	 Unfortunately these `Check systemd state` alerts don't seem to be suppressed by the icinga downtime. Sorry for the noise
[05:25:49] <icinga-wm>	 RECOVERY - Check systemd state on elastic1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:25:53] <icinga-wm>	 RECOVERY - Check systemd state on elastic1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:27:37] <icinga-wm>	 PROBLEM - Check systemd state on elastic1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:29:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:31:17] <icinga-wm>	 RECOVERY - Check systemd state on elastic1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:39] <icinga-wm>	 PROBLEM - Check systemd state on elastic1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:40:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:41:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127
[05:44:33] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:48:41] <icinga-wm>	 PROBLEM - Check systemd state on elastic1084 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:49:39] <icinga-wm>	 PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:51:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:51:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[05:51:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm
[05:53:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5003.wikimedia.org
[05:54:08] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835
[05:56:31] <icinga-wm>	 RECOVERY - Check systemd state on elastic1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:59:30] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host bast5003.wikimedia.org
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0600).
[06:01:53] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.decommission for hosts test-reimage2001.codfw.wmnet
[06:05:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[06:05:52] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox
[06:07:00] <wikibugs>	 (03PS1) 10Slyngshede: site.pp: decommision test-reimage2001 [puppet] - 10https://gerrit.wikimedia.org/r/915144 (https://phabricator.wikimedia.org/T335835)
[06:07:57] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test-reimage2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1001"
[06:08:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[06:10:37] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test-reimage2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1001"
[06:10:37] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:10:38] <logmsgbot>	 !log slyngshede@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts test-reimage2001.codfw.wmnet
[06:15:02] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[06:18:00] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[06:22:06] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.07 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[06:22:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915144 (https://phabricator.wikimedia.org/T335835) (owner: 10Slyngshede)
[06:23:11] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] site.pp: decommision test-reimage2001 [puppet] - 10https://gerrit.wikimedia.org/r/915144 (https://phabricator.wikimedia.org/T335835) (owner: 10Slyngshede)
[06:24:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4004.wikimedia.org
[06:25:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[06:26:00] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722)
[06:26:30] <wikibugs>	 (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915151 (https://phabricator.wikimedia.org/T334722)
[06:27:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[06:27:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm
[06:27:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm completed: - sretest1002 (**PAS...
[06:27:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915151 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui)
[06:30:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4004.wikimedia.org
[06:30:38] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.3 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[06:31:03] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-05-04-054118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915215
[06:31:47] <wikibugs>	 (03CR) 10Muehlenhoff: Django 3.2 support (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede)
[06:33:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast2003.wikimedia.org with OS bookworm
[06:33:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast2003.wikimedia.org with OS bookworm
[06:35:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui)
[06:36:31] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.82 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[06:40:57] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-04-054118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915215 (owner: 10KartikMistry)
[06:43:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org
[06:43:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-05-04-054118-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915215 (owner: 10KartikMistry)
[06:44:12] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:44:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kafkamon: cut over to bullseye exporters [puppet] - 10https://gerrit.wikimedia.org/r/914876 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[06:44:26] <marostegui>	 kart_: can I deploy MW?
[06:46:26] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:46:28] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915357
[06:46:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/915357 (owner: 10Marostegui)
[06:46:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org
[06:48:05] <kart_>	 marostegui: Yes yes. This was quick staging one.
[06:48:10] <marostegui>	 cool! thanks!
[06:48:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui)
[06:48:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet
[06:49:01] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915150 (https://phabricator.wikimedia.org/T334722) (owner: 10Marostegui)
[06:49:54] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:915150|ProductionServices.php: Promote pc2014 to pc1 master (T334722)]]
[06:49:57] <stashbot>	 T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722
[06:51:16] <icinga-wm>	 RECOVERY - Check systemd state on krb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:51:29] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:915150|ProductionServices.php: Promote pc2014 to pc1 master (T334722)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[06:52:11] <marostegui>	 !log Promote pc2014 as pc1 master codfw dbmaint - T334722
[06:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet
[06:54:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org
[06:54:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[06:54:46] <urbanecm>	 Good morning apergos et al: do we have trainees for the morning window? I plan to deploy a bunch of stuff (I'll put them to calendar ASAP), so that's why I'm asking. 
[06:55:10] <apergos>	 we don't have any trainees signed up, no
[06:55:29] <apergos>	 your patches would be the only ones on the calendar
[06:55:45] <urbanecm>	 Ack, thanks. I'll take over the window then. 
[06:56:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[06:56:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914836 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[06:56:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914302 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[06:56:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914303 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[06:56:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914304 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[06:56:37] <apergos>	 ok, enjoy!
[06:56:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[06:57:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[06:57:17] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:915150|ProductionServices.php: Promote pc2014 to pc1 master (T334722)]] (duration: 07m 23s)
[06:57:21] <stashbot>	 T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722
[06:57:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[06:58:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on pc2011.codfw.wmnet with reason: Onsite maintenance T334722
[06:58:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org
[06:58:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on pc2011.codfw.wmnet with reason: Onsite maintenance T334722
[06:58:53] <wikibugs>	 (03PS1) 10Majavah: hieradata: swap dumps_dist_active_* params [puppet] - 10https://gerrit.wikimedia.org/r/915362
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T0700).
[07:00:57] <apergos>	 no trainees are signed up for the window and urbanecm has several patches to be scheduled for deployment, self-deploying I assume, so that's how this morning's window will go.
[07:01:40] <urbanecm>	 Yep yep, waiting on CI atm. 
[07:01:56] <wikibugs>	 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) @Jhancock.wm pc2011 is now OFF, so you can work on it whenever you want.
[07:02:35] <apergos>	 ok, enjoy this  morning's episode of Zuul Watch!  
[07:02:57] <urbanecm>	 Thanks! :D
[07:07:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6002.wikimedia.org
[07:08:31] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:11:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6002.wikimedia.org
[07:13:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2003.wikimedia.org with OS bookworm
[07:13:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install bast2003 - https://phabricator.wikimedia.org/T334287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast2003.wikimedia.org with OS bookworm completed: - bast2003 (**WARN**)   - D...
[07:13:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add new LVS host lvs2011 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/914871 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh)
[07:15:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5002.wikimedia.org
[07:16:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:16:44] <urbanecm>	 this episode of Zuul Watch ended with selenium failure. let's restart!
[07:16:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "rerun..." [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:18:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign bastion role to bast2003 [puppet] - 10https://gerrit.wikimedia.org/r/915363 (https://phabricator.wikimedia.org/T334287)
[07:21:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5002.wikimedia.org
[07:22:52] <apergos>	 pperhaps if you whack the tv on its side  the reception will come back in again ;-)
[07:24:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: networkpolicy for metrics-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/915364 (https://phabricator.wikimedia.org/T331505)
[07:27:24] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] hieradata: swap dumps_dist_active_* params [puppet] - 10https://gerrit.wikimedia.org/r/915362 (owner: 10Majavah)
[07:28:27] <wikibugs>	 (03CR) 10Ayounsi: "Overall it looks good to me, but before approving it could you split this patch in 2: One for the bird/anycast.pp change and one for the c" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[07:29:07] <wikibugs>	 (03Merged) 10jenkins-bot: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914836 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:29:12] <wikibugs>	 (03Merged) 10jenkins-bot: EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914302 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[07:29:16] <wikibugs>	 (03Merged) 10jenkins-bot: ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914303 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[07:29:18] <wikibugs>	 10SRE-OnFire, 10Traffic, 10conftool, 10serviceops, 10Sustainability (Incident Followup): Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10Joe) I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from...
[07:29:24] <wikibugs>	 (03Merged) 10jenkins-bot: EditPage: Support preloading from i18n messages [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914304 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[07:29:31] <urbanecm>	 finally something
[07:29:57] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914836|Mentor dashboard: Move away from alpha/beta/stable (T334630)]], [[gerrit:914302|EditPage: Support preloading from i18n messages (T330337)]], [[gerrit:914303|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914304|EditPage: Support preloading from i18n messages (T330337)]]
[07:30:02] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[07:30:02] <stashbot>	 T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337
[07:31:28] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:914836|Mentor dashboard: Move away from alpha/beta/stable (T334630)]], [[gerrit:914302|EditPage: Support preloading from i18n messages (T330337)]], [[gerrit:914303|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914304|EditPage: Support preloading from i18n messages (T330337)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2
[07:31:28] <logmsgbot>	 001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[07:33:41] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 293
[07:33:46] <urbanecm>	 patches work, continuing
[07:33:55] <urbanecm>	 and (still) waiting CI on the remainding few patches
[07:34:04] <urbanecm>	 hopefully will be faster
[07:34:25] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.43 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[07:34:32] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 293
[07:35:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 134823
[07:37:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 134823
[07:37:55] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914836|Mentor dashboard: Move away from alpha/beta/stable (T334630)]], [[gerrit:914302|EditPage: Support preloading from i18n messages (T330337)]], [[gerrit:914303|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914304|EditPage: Support preloading from i18n messages (T330337)]] (duration: 07m 58s)
[07:37:59] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[07:37:59] <stashbot>	 T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337
[07:38:35] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: (2) Alert for device asw-a-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[07:45:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[07:45:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:46:50] <wikibugs>	 (03Merged) 10jenkins-bot: ApiVisualEditor: Support preloading from i18n messages [extensions/VisualEditor] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914305 (https://phabricator.wikimedia.org/T330337) (owner: 10Urbanecm)
[07:47:16] <urbanecm>	 looks i might be able to finish all patches in time after all :)
[07:48:42] <wikibugs>	 (03Merged) 10jenkins-bot: Mentor dashboard: Move away from alpha/beta/stable [extensions/GrowthExperiments] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/914837 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:49:08] <wikibugs>	 (03CR) 10Ayounsi: "At first glance it looks good to me, but someone from traffic (or who knows more about DNS) needs to review it." [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez)
[07:49:15] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914305|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914837|Mentor dashboard: Move away from alpha/beta/stable (T334630)]]
[07:49:19] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[07:49:20] <stashbot>	 T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337
[07:49:26] <wikibugs>	 (03PS5) 10Urbanecm: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630)
[07:49:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:49:39] <wikibugs>	 (03PS5) 10Urbanecm: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630)
[07:50:18] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Remove GEMentorDashboardDeploymentMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914375 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:50:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:50:50] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:914305|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914837|Mentor dashboard: Move away from alpha/beta/stable (T334630)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:51:06] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Deploy Personalized praise to AR, BN, CS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914393 (https://phabricator.wikimedia.org/T334630) (owner: 10Urbanecm)
[07:56:24] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914305|ApiVisualEditor: Support preloading from i18n messages (T330337)]], [[gerrit:914837|Mentor dashboard: Move away from alpha/beta/stable (T334630)]] (duration: 07m 08s)
[07:56:28] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[07:56:29] <stashbot>	 T330337: MediaWiki preloading query parameter do not allow to preload from i18n messages - https://phabricator.wikimedia.org/T330337
[07:56:40] <wikibugs>	 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) I am working with MariaDB foundation to see if we can find more information about this. For now I am running `mariadb-check --check --extended --database enwiki` on both hosts and it is sh...
[07:56:47] <urbanecm>	 ahh! scap backport still has issues with dependencies :-/
[07:56:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1132.eqiad.wmnet with reason: Onsite maintenance T334722
[07:56:51] <stashbot>	 T334722: ManagementSSHDown - https://phabricator.wikimedia.org/T334722
[07:57:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1132.eqiad.wmnet with reason: Onsite maintenance T334722
[07:57:06] <urbanecm>	 fortunately, "continue with unexpected commits" works :)
[07:57:23] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:914393|[Growth] Deploy Personalized praise to AR, BN, CS (T334630)]]
[07:57:41] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.32 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[07:58:52] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:914393|[Growth] Deploy Personalized praise to AR, BN, CS (T334630)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[08:01:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: add ml-staging among helmfile_namespace_certs's options [deployment-charts] - 10https://gerrit.wikimedia.org/r/914859 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[08:01:45] <urbanecm>	 okay, not exactly in time, but still almost :)
[08:01:48] <urbanecm>	 last sync in progress
[08:04:47] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:914393|[Growth] Deploy Personalized praise to AR, BN, CS (T334630)]] (duration: 07m 24s)
[08:04:51] <stashbot>	 T334630: Personalized praise: Deployment of the new mentor dashboard module to Growth Pilot Wikis - https://phabricator.wikimedia.org/T334630
[08:04:53] <urbanecm>	 done
[08:07:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:07:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:08:37] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[08:11:22] <wikibugs>	 (03PS1) 10Elukey: admin_ng: remove tls hostname override for ores-legacy-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/915416 (https://phabricator.wikimedia.org/T335756)
[08:13:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I would suggest to rename this to cronjob instead of job as there are plain job objects in k8s as well, so the name is a bit misleading." [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto)
[08:14:11] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[08:14:51] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.33 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[08:15:58] <wikibugs>	 (03CR) 10Jelto: "lgtm, but what about profile::query_service::gui_url?" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn)
[08:16:55] <apergos>	 looks like there's a long break before the next window so it's all good
[08:18:22] <urbanecm>	 i overran just four minutes :)
[08:18:57] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:19:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: networkpolicy for metrics-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/915364 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[08:20:20] <wikibugs>	 (03PS1) 10ArielGlenn: Add dump user subdirectories to support testing of new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915423 (https://phabricator.wikimedia.org/T325232)
[08:20:22] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: networkpolicy for metrics-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/915364 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[08:22:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Looks really nice and simpler than what we have, I am in favor to proceed :) Since it is a new thing, I guess that we could let it bake fo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm)
[08:25:36] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add a postgresql database and user for airflow_analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/911296 (https://phabricator.wikimedia.org/T333000) (owner: 10Stevemunene)
[08:30:35] <wikibugs>	 (03PS1) 10Volans: decorators: fix dry_run detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434
[08:31:13] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:31:37] <wikibugs>	 (03CR) 10Volans: "Thanks for finding the error and sending this patch. I proposed a slightly different one with additional tests to catch this use case in I" [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855) (owner: 10EoghanGaffney)
[08:32:10] <wikibugs>	 (03PS1) 10ArielGlenn: add nfs tester to dumps worker (snapshot) testbed role [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232)
[08:34:46] <wikibugs>	 (03CR) 10Elukey: "Checked all the kubernetes.yaml config (IPs, etc..) and everything checks out (also way tidier than before!)" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[08:35:01] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[08:37:54] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes1006.eqiad.wmnet
[08:37:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes1006.eqiad.wmnet
[08:37:56] <wikibugs>	 (03PS2) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[08:38:54] <wikibugs>	 (03PS3) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[08:40:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[08:40:25] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[08:40:40] <wikibugs>	 (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915437/41033/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[08:41:23] <wikibugs>	 (03PS1) 10ArielGlenn: create custom db list files for testing of nfs shares for xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232)
[08:43:58] <wikibugs>	 (03CR) 10Elukey: "Did a quick pass and added two comments, but the work is really good and we should really push to deploy it before it gets stale." [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[08:46:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[08:46:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[08:47:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: remove tls hostname override for ores-legacy-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/915416 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[08:47:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:47:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[08:47:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:47:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:47:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T335838)', diff saved to https://phabricator.wikimedia.org/P47481 and previous config saved to /var/cache/conftool/dbconfig/20230504-084741-ladsgroup.json
[08:47:59] <icinga-wm>	 PROBLEM - Host wcqs1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:49:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:49:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:49:44] <wikibugs>	 (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915447/41035/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[08:49:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[08:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[08:50:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T335838)', diff saved to https://phabricator.wikimedia.org/P47482 and previous config saved to /var/cache/conftool/dbconfig/20230504-085008-ladsgroup.json
[08:50:35] <wikibugs>	 (03PS1) 10ArielGlenn: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232)
[08:51:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T335838)', diff saved to https://phabricator.wikimedia.org/P47483 and previous config saved to /var/cache/conftool/dbconfig/20230504-085151-ladsgroup.json
[08:52:21] <wikibugs>	 (03PS1) 10JMeybohm: Copy configuration_1.1.1 to configuration_1.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324)
[08:52:23] <wikibugs>	 (03PS1) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231)
[08:52:39] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2023-05-04-084420-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472)
[08:54:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Assign bastion role to bast2003 [puppet] - 10https://gerrit.wikimedia.org/r/915363 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff)
[08:56:11] <wikibugs>	 (03PS2) 10Volans: decorators: fix dry_run detection [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 (https://phabricator.wikimedia.org/T335855)
[08:56:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T335838)', diff saved to https://phabricator.wikimedia.org/P47484 and previous config saved to /var/cache/conftool/dbconfig/20230504-085637-ladsgroup.json
[08:56:44] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[08:56:55] <wikibugs>	 (03Abandoned) 10KartikMistry: machinetranslation: Fix gunicorn workers setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/912298 (owner: 10KartikMistry)
[08:59:01] <wikibugs>	 (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915455/41036/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[08:59:51] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.85 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[08:59:53] <wikibugs>	 (03PS2) 10KartikMistry: Update MinT to 2023-05-04-085722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472)
[09:01:37] <kart_>	 Deploying to MinT (staging only) ^^
[09:02:15] <wikibugs>	 (03PS1) 10ArielGlenn: add a custom xml dumps config file for testing new nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232)
[09:02:21] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-05-04-085722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472) (owner: 10KartikMistry)
[09:03:16] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-05-04-085722-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/915459 (https://phabricator.wikimedia.org/T335472) (owner: 10KartikMistry)
[09:04:30] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[09:05:01] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] "This is great, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/915434 (https://phabricator.wikimedia.org/T335855) (owner: 10Volans)
[09:05:10] <wikibugs>	 (03Abandoned) 10EoghanGaffney: [spicerack/decorators] Don't miss dry_run if it's disabled in kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/914923 (https://phabricator.wikimedia.org/T335855) (owner: 10EoghanGaffney)
[09:05:58] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: cookbooks.sre.ganeti.reimage: failure reported when first puppet run succeeds after a retry - https://phabricator.wikimedia.org/T335863 (10Volans)
[09:06:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans)
[09:06:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:06:41] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[09:06:43] <wikibugs>	 (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/915463/41038/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/915463 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:06:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P47485 and previous config saved to /var/cache/conftool/dbconfig/20230504-090657-ladsgroup.json
[09:07:34] <wikibugs>	 (03PS18) 10ArielGlenn: add documentation on commands to run for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232)
[09:07:39] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.24 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[09:07:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:08:03] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:10:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet
[09:11:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:11:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P47486 and previous config saved to /var/cache/conftool/dbconfig/20230504-091143-ladsgroup.json
[09:12:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw
[09:12:25] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759)
[09:13:38] <wikibugs>	 (03PS1) 10Elukey: admin_ng: complete ml-staging support in helmfile_namespace_certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465
[09:14:23] <wikibugs>	 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10jijiki)
[09:14:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet
[09:17:05] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.31 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[09:17:28] <wikibugs>	 (03CR) 10Volans: "Thanks for the patch, but with PCC it fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:18:27] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: profile::bird::anycast: allow setting the BGP IP address from the profile [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760)
[09:20:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw
[09:21:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: profile::bird::anycast: allow setting the BGP IP address from the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[09:21:23] <wikibugs>	 (03PS4) 10Clément Goubert: k8s::proxy: Start kube-proxy after ferm [puppet] - 10https://gerrit.wikimedia.org/r/915461
[09:21:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:21:41] <wikibugs>	 (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/913164/41041/snapshot1009.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/913164 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:22:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P47487 and previous config saved to /var/cache/conftool/dbconfig/20230504-092203-ladsgroup.json
[09:22:09] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.24 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[09:22:17] <wikibugs>	 (03PS4) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878
[09:23:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41044/console" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:24:11] <wikibugs>	 (03PS5) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878
[09:25:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41046/console" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:25:48] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:25:55] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:26:28] <wikibugs>	 (03PS6) 10Jbond: install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878
[09:26:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:26:47] <wikibugs>	 (03CR) 10Jbond: install_server: improve readability of netmask logic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:26:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P47488 and previous config saved to /var/cache/conftool/dbconfig/20230504-092649-ladsgroup.json
[09:26:59] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.51 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[09:27:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/915476 (https://phabricator.wikimedia.org/T335760)
[09:28:10] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: increase thumbor resource limits, eqiad replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/914737 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:30:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:34:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1114 T335837', diff saved to https://phabricator.wikimedia.org/P47490 and previous config saved to /var/cache/conftool/dbconfig/20230504-093419-ladsgroup.json
[09:34:23] <stashbot>	 T335837: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837
[09:35:59] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866
[09:36:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:36:35] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:37:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T335838)', diff saved to https://phabricator.wikimedia.org/P47491 and previous config saved to /var/cache/conftool/dbconfig/20230504-093710-ladsgroup.json
[09:37:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[09:37:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[09:37:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T335838)', diff saved to https://phabricator.wikimedia.org/P47492 and previous config saved to /var/cache/conftool/dbconfig/20230504-093733-ladsgroup.json
[09:37:58] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for cloudbackup1001-dev.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001
[09:38:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond)
[09:38:36] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cloudbackup1001-dev.eqiad.wmnet: Renew puppet certificate - jbond@cumin1001
[09:38:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[09:38:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[09:41:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T335838)', diff saved to https://phabricator.wikimedia.org/P47493 and previous config saved to /var/cache/conftool/dbconfig/20230504-094156-ladsgroup.json
[09:42:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[09:42:09] <wikibugs>	 (03PS1) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486
[09:42:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[09:42:17] <wikibugs>	 (03PS1) 10Ladsgroup: instances.yaml: Remove db1114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/915487 (https://phabricator.wikimedia.org/T335837)
[09:42:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T335838)', diff saved to https://phabricator.wikimedia.org/P47494 and previous config saved to /var/cache/conftool/dbconfig/20230504-094221-ladsgroup.json
[09:42:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T335838)', diff saved to https://phabricator.wikimedia.org/P47495 and previous config saved to /var/cache/conftool/dbconfig/20230504-094253-ladsgroup.json
[09:42:58] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505)
[09:44:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/915476 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[09:44:24] <icinga-wm>	 RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 4.892 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[09:45:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] install_server: improve readability of netmask logic [puppet] - 10https://gerrit.wikimedia.org/r/914878 (owner: 10Jbond)
[09:45:53] <wikibugs>	 (03PS1) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350)
[09:47:16] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[09:47:46] <wikibugs>	 (03PS2) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350)
[09:47:51] <wikibugs>	 (03PS2) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486
[09:47:56] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor200[3456].codfw.wmnet
[09:48:15] <wikibugs>	 (03PS2) 10Ladsgroup: instances.yaml: Remove db1114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/915487 (https://phabricator.wikimedia.org/T335837)
[09:48:31] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] instances.yaml: Remove db1114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/915487 (https://phabricator.wikimedia.org/T335837) (owner: 10Ladsgroup)
[09:48:33] <wikibugs>	 (03PS3) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350)
[09:48:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T335838)', diff saved to https://phabricator.wikimedia.org/P47496 and previous config saved to /var/cache/conftool/dbconfig/20230504-094850-ladsgroup.json
[09:49:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Remove db1114 from dbctl T335837', diff saved to https://phabricator.wikimedia.org/P47497 and previous config saved to /var/cache/conftool/dbconfig/20230504-094945-ladsgroup.json
[09:49:48] <stashbot>	 T335837: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837
[09:52:00] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[09:53:00] <wikibugs>	 (03PS1) 10Volans: Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378
[09:53:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 (owner: 10Volans)
[09:53:37] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505)
[09:53:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Ship a prometheus-statsd-export configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/915493 (https://phabricator.wikimedia.org/T331505)
[09:55:36] <wikibugs>	 (03PS2) 10Volans: Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378
[09:55:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 (owner: 10Volans)
[09:56:22] <wikibugs>	 (03CR) 10Jelto: "Thanks for the addition! Looks mostly good, two suggestions in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney)
[09:58:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P47498 and previous config saved to /var/cache/conftool/dbconfig/20230504-095800-ladsgroup.json
[10:00:05] <jouncebot>	 mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1000)
[10:00:30] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915494 (https://phabricator.wikimedia.org/T335837)
[10:01:34] <wikibugs>	 (03CR) 10Volans: "Answer/question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/914748 (owner: 10EoghanGaffney)
[10:01:37] <wikibugs>	 (03CR) 10Ayounsi: netbox: run the rqworker command as netbox user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:02:15] <wikibugs>	 (03PS3) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486
[10:02:19] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[10:02:27] <wikibugs>	 (03CR) 10Volans: "addressed comment" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:03:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P47499 and previous config saved to /var/cache/conftool/dbconfig/20230504-100357-ladsgroup.json
[10:04:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Revert "python_deploy: set the setgid bit on the git clone" [puppet] - 10https://gerrit.wikimedia.org/r/915378 (owner: 10Volans)
[10:05:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet
[10:05:59] <wikibugs>	 (03PS2) 10Elukey: admin_ng: complete ml-staging support in helmfile_namespace_certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465
[10:06:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Ship a prometheus-statsd-export configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/915493 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[10:06:44] <wikibugs>	 (03CR) 10Elukey: "Not the best approach of the world but I think it should work fine for the moment, since inference-staging is the only "custom" ingress se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465 (owner: 10Elukey)
[10:06:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:06:56] <wikibugs>	 (03Merged) 10jenkins-bot: Ship a prometheus-statsd-export configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/915493 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[10:07:08] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915494 (https://phabricator.wikimedia.org/T335837)
[10:07:12] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915494 (https://phabricator.wikimedia.org/T335837) (owner: 10Ladsgroup)
[10:07:50] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "mariadb: Remove puppet entries for db1114" [puppet] - 10https://gerrit.wikimedia.org/r/915379
[10:07:57] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "mariadb: Remove puppet entries for db1114" [puppet] - 10https://gerrit.wikimedia.org/r/915379 (owner: 10Ladsgroup)
[10:08:02] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505)
[10:08:20] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:08:53] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915380 (https://phabricator.wikimedia.org/T335837)
[10:09:17] <claime>	 BGP alert expected
[10:10:44] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[10:10:49] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866
[10:11:11] <wikibugs>	 (03PS4) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486
[10:11:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1114.eqiad.wmnet
[10:11:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[10:12:14] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Remove args, document env vars [deployment-charts] - 10https://gerrit.wikimedia.org/r/915488 (https://phabricator.wikimedia.org/T331505) (owner: 10Alexandros Kosiaris)
[10:12:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:13:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P47500 and previous config saved to /var/cache/conftool/dbconfig/20230504-101306-ladsgroup.json
[10:15:17] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10phaultfinder)
[10:16:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet
[10:16:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:16:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[10:16:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[10:16:59] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:17:34] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet
[10:17:40] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[10:19:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P47501 and previous config saved to /var/cache/conftool/dbconfig/20230504-101903-ladsgroup.json
[10:19:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1114.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[10:20:48] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:20:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1114.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[10:20:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:20:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1114.eqiad.wmnet
[10:21:11] <wikibugs>	 (03CR) 10Volans: "reply/question inline" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:21:27] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Remove puppet entries for db1114 [puppet] - 10https://gerrit.wikimedia.org/r/915380 (https://phabricator.wikimedia.org/T335837) (owner: 10Ladsgroup)
[10:21:45] <wikibugs>	 (03PS5) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324)
[10:23:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41049/console" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond)
[10:23:35] <Amir1>	 !log Removing db1114 from zarcillo T335837
[10:23:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudbackup100X-dev: set missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/915516
[10:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:38] <stashbot>	 T335837: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837
[10:24:12] <wikibugs>	 (03CR) 10Jbond: "i think it would be better if someone more familiar with dumps deployed this change so that they can confirm everything is deployed as exp" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond)
[10:25:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/915516 (owner: 10Arturo Borrero Gonzalez)
[10:26:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup100X-dev: set missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/915516 (owner: 10Arturo Borrero Gonzalez)
[10:26:22] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet
[10:27:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[10:27:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox: run the rqworker command as netbox user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:27:55] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Ladsgroup) a:05Ladsgroup→03wiki_willy
[10:28:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T335838)', diff saved to https://phabricator.wikimedia.org/P47502 and previous config saved to /var/cache/conftool/dbconfig/20230504-102812-ladsgroup.json
[10:28:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: complete ml-staging support in helmfile_namespace_certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915465 (owner: 10Elukey)
[10:28:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[10:28:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[10:28:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T335838)', diff saved to https://phabricator.wikimedia.org/P47503 and previous config saved to /var/cache/conftool/dbconfig/20230504-102835-ladsgroup.json
[10:28:38] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet
[10:29:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:30:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:30:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:34:01] <wikibugs>	 10SRE, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, and 4 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Michael) I'm not fully sure who would be reviewing and deploying these changes. Maybe someone from the #sre team?
[10:34:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T335838)', diff saved to https://phabricator.wikimedia.org/P47504 and previous config saved to /var/cache/conftool/dbconfig/20230504-103409-ladsgroup.json
[10:34:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[10:34:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[10:34:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T335838)', diff saved to https://phabricator.wikimedia.org/P47505 and previous config saved to /var/cache/conftool/dbconfig/20230504-103434-ladsgroup.json
[10:34:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T335838)', diff saved to https://phabricator.wikimedia.org/P47506 and previous config saved to /var/cache/conftool/dbconfig/20230504-103459-ladsgroup.json
[10:35:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond)
[10:35:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad
[10:35:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4002.wikimedia.org
[10:36:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10darthmon_wmde)
[10:37:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10darthmon_wmde)
[10:39:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, nice! Thank you for looking into this." [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney)
[10:40:01] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet
[10:40:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4002.wikimedia.org
[10:40:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3002.wikimedia.org
[10:40:28] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet
[10:41:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T335838)', diff saved to https://phabricator.wikimedia.org/P47507 and previous config saved to /var/cache/conftool/dbconfig/20230504-104107-ladsgroup.json
[10:42:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: machinetranslation: Fix some prometheus-stats mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/915522
[10:43:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad
[10:43:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] machinetranslation: Fix some prometheus-stats mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/915522 (owner: 10Alexandros Kosiaris)
[10:43:58] <wikibugs>	 (03PS5) 10Volans: netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486
[10:44:27] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: Fix some prometheus-stats mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/915522 (owner: 10Alexandros Kosiaris)
[10:44:41] <wikibugs>	 (03CR) 10Volans: "skipped PrivateTmp" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:44:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3002.wikimedia.org
[10:47:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:47:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2004.wikimedia.org
[10:48:38] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.ganeti.reimage for host aphlict2001.codfw.wmnet with OS bullseye
[10:48:48] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet
[10:50:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P47508 and previous config saved to /var/cache/conftool/dbconfig/20230504-105005-ladsgroup.json
[10:50:26] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: run the rqworker command as netbox user [puppet] - 10https://gerrit.wikimedia.org/r/915486 (owner: 10Volans)
[10:51:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[10:52:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2004.wikimedia.org
[10:52:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:52:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Something else to note: on transient/spike errors the alert will auto-resolve once the spike is gone, mentioning it in case this is a prob" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney)
[10:52:53] <wikibugs>	 (03PS1) 10Elukey: Rakefile: fix git branch check [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540
[10:53:05] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[10:53:35] <icinga-wm>	 PROBLEM - Kerberos KDC daemon on krb2002 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles
[10:54:05] <moritzm>	 ^krb2002 is me, WIP
[10:54:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112
[10:55:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112
[10:55:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] conftool-data: add config for the k8s ingress for ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914728 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[10:56:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P47509 and previous config saved to /var/cache/conftool/dbconfig/20230504-105613-ladsgroup.json
[10:56:35] <icinga-wm>	 RECOVERY - Kerberos KDC daemon on krb2002 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles
[10:57:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:59:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add conftool and service config for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/914735 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[11:01:34] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage
[11:03:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 5713
[11:04:44] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage
[11:04:50] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5713
[11:05:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P47510 and previous config saved to /var/cache/conftool/dbconfig/20230504-110511-ladsgroup.json
[11:07:39] <Lucas_WMDE>	 jouncebot: now
[11:07:39] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 52 minute(s)
[11:07:58] <wikibugs>	 (03PS7) 10Ayounsi: profile::bird::anycast: allow setting the BGP IP address from the profile [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[11:08:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:19] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[11:08:22] <Lucas_WMDE>	 I’d like to deploy some backports (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/914298, and probably also for wmf.6) – shouldn’t have an effect yet but will then let us do a config change in the window later without having to wait for the backports first
[11:08:30] <Lucas_WMDE>	 if no one objects to that :)
[11:08:31] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:11:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P47511 and previous config saved to /var/cache/conftool/dbconfig/20230504-111119-ladsgroup.json
[11:11:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add krb2002 as additional KDC [puppet] - 10https://gerrit.wikimedia.org/r/906560 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[11:13:45] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:13:47] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/915384 (https://phabricator.wikimedia.org/T300458)
[11:14:24] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aphlict2001.codfw.wmnet with OS bullseye
[11:15:18] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - volans@cumin1001
[11:15:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:47] <wikibugs>	 (03PS7) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782)
[11:16:25] <Lucas_WMDE>	 alright, I’ll get started then
[11:16:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große)
[11:18:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Make krb2002 available to Kerberos client [puppet] - 10https://gerrit.wikimedia.org/r/915569 (https://phabricator.wikimedia.org/T331695)
[11:20:06] <wikibugs>	 (03PS1) 10Jbond: ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722)
[11:20:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T335838)', diff saved to https://phabricator.wikimedia.org/P47512 and previous config saved to /var/cache/conftool/dbconfig/20230504-112017-ladsgroup.json
[11:20:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[11:20:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) (owner: 10Jbond)
[11:20:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[11:20:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T335838)', diff saved to https://phabricator.wikimedia.org/P47513 and previous config saved to /var/cache/conftool/dbconfig/20230504-112041-ladsgroup.json
[11:23:38] <wikibugs>	 (03PS2) 10Jbond: ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722)
[11:24:29] * kart_ updating cxserver
[11:24:43] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) (owner: 10KartikMistry)
[11:25:28] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) (owner: 10KartikMistry)
[11:25:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41052/console" [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) (owner: 10Jbond)
[11:26:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T335838)', diff saved to https://phabricator.wikimedia.org/P47514 and previous config saved to /var/cache/conftool/dbconfig/20230504-112625-ladsgroup.json
[11:26:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[11:26:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[11:26:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47515 and previous config saved to /var/cache/conftool/dbconfig/20230504-112650-ladsgroup.json
[11:27:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T335838)', diff saved to https://phabricator.wikimedia.org/P47516 and previous config saved to /var/cache/conftool/dbconfig/20230504-112705-ladsgroup.json
[11:27:23] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[11:27:43] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[11:30:47] <moritzm>	 !log installing curl security updates
[11:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:54] <moritzm>	 !log installing curl security updates (on buster)
[11:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:01] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[11:31:05] <wikibugs>	 (03PS2) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324)
[11:31:07] <wikibugs>	 (03PS2) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231)
[11:31:37] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[11:33:31] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[11:34:18] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[11:35:10] <wikibugs>	 (03PS8) 10KartikMistry: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782)
[11:35:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47518 and previous config saved to /var/cache/conftool/dbconfig/20230504-113529-ladsgroup.json
[11:35:38] <wikibugs>	 (03Merged) 10jenkins-bot: Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/914298 (https://phabricator.wikimedia.org/T300458) (owner: 10Michael Große)
[11:36:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914298|Fix output path of list=wbsubscribers API (T300458)]]
[11:36:10] <stashbot>	 T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458
[11:38:02] <kart_>	 !log Updated cxserver to 2023-05-03-044244-production (T333835, T335019, T331505)
[11:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:08] <stashbot>	 T331505: Self hosted machine translation service - https://phabricator.wikimedia.org/T331505
[11:38:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:914298|Fix output path of list=wbsubscribers API (T300458)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[11:38:09] <stashbot>	 T333835: Disable machine translation for Cantonese - https://phabricator.wikimedia.org/T333835
[11:38:09] <stashbot>	 T335019: Post-creation work for fatwiki - https://phabricator.wikimedia.org/T335019
[11:38:35] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: (2) Alert for device asw-a-codfw.mgmt.codfw.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[11:38:45] <Lucas_WMDE>	 no change on https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11 yet (good), syncing
[11:40:19] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385
[11:40:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (owner: 10Arturo Borrero Gonzalez)
[11:40:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943)
[11:40:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez)
[11:42:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P47519 and previous config saved to /var/cache/conftool/dbconfig/20230504-114211-ladsgroup.json
[11:44:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914298|Fix output path of list=wbsubscribers API (T300458)]] (duration: 08m 24s)
[11:44:35] <stashbot>	 T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458
[11:45:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:46:04] <Lucas_WMDE>	 jouncebot: next
[11:46:04] <jouncebot>	 In 1 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300)
[11:46:05] <jouncebot>	 In 1 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300)
[11:46:24] <Lucas_WMDE>	 then I’ll go ahead and do the wmf.6 backport too
[11:46:26] <Lucas_WMDE>	 (should be another noop)
[11:46:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/915384 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[11:49:05] <wikibugs>	 (03PS1) 10Ayounsi: set rq==1.13.0 to workaround bug [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/915591
[11:50:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:50:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P47520 and previous config saved to /var/cache/conftool/dbconfig/20230504-115035-ladsgroup.json
[11:56:08] <wikibugs>	 (03PS1) 10Slyngshede: signup: allow blocking of username with regex [software/bitu] - 10https://gerrit.wikimedia.org/r/915592 (https://phabricator.wikimedia.org/T320806)
[11:56:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make krb2002 available to Kerberos client [puppet] - 10https://gerrit.wikimedia.org/r/915569 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[11:56:36] <wikibugs>	 (03PS3) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324)
[11:56:38] <wikibugs>	 (03PS3) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231)
[11:56:40] <wikibugs>	 (03PS1) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324)
[11:57:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[11:57:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[11:57:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P47521 and previous config saved to /var/cache/conftool/dbconfig/20230504-115717-ladsgroup.json
[11:57:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm)
[12:02:44] <wikibugs>	 (03Merged) 10jenkins-bot: Fix output path of list=wbsubscribers API [extensions/Wikibase] (wmf/1.41.0-wmf.6) - 10https://gerrit.wikimedia.org/r/915384 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[12:03:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:915384|Fix output path of list=wbsubscribers API (T300458)]]
[12:03:18] <stashbot>	 T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458
[12:03:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2023-05-03-044244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/914468 (https://phabricator.wikimedia.org/T333835) (owner: 10KartikMistry)
[12:04:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:915384|Fix output path of list=wbsubscribers API (T300458)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[12:05:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P47522 and previous config saved to /var/cache/conftool/dbconfig/20230504-120542-ladsgroup.json
[12:06:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add MinT support to cxserver (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (https://phabricator.wikimedia.org/T335782) (owner: 10KartikMistry)
[12:08:28] <moritzm>	 !log installing libdatetime-timezone-perl updates
[12:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM minus a small piece of ruby style, but feel free to ignore" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 (owner: 10Elukey)
[12:10:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:915384|Fix output path of list=wbsubscribers API (T300458)]] (duration: 07m 43s)
[12:11:01] <stashbot>	 T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458
[12:11:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:12:01] * Lucas_WMDE done
[12:12:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T335838)', diff saved to https://phabricator.wikimedia.org/P47523 and previous config saved to /var/cache/conftool/dbconfig/20230504-121224-ladsgroup.json
[12:12:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[12:12:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[12:12:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T335838)', diff saved to https://phabricator.wikimedia.org/P47524 and previous config saved to /var/cache/conftool/dbconfig/20230504-121247-ladsgroup.json
[12:16:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:20:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T335838)', diff saved to https://phabricator.wikimedia.org/P47525 and previous config saved to /var/cache/conftool/dbconfig/20230504-122048-ladsgroup.json
[12:20:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[12:21:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[12:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47526 and previous config saved to /var/cache/conftool/dbconfig/20230504-122114-ladsgroup.json
[12:22:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47527 and previous config saved to /var/cache/conftool/dbconfig/20230504-122237-ladsgroup.json
[12:22:47] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943)
[12:25:20] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943)
[12:27:05] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez)
[12:27:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] ":fingers_crossed:" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez)
[12:27:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m"" [puppet] - 10https://gerrit.wikimedia.org/r/915385 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez)
[12:30:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/915591 (owner: 10Ayounsi)
[12:31:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47528 and previous config saved to /var/cache/conftool/dbconfig/20230504-123103-ladsgroup.json
[12:31:13] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:33:38] <wikibugs>	 (03PS9) 10KartikMistry: WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579
[12:34:05] <wikibugs>	 (03CR) 10KartikMistry: WIP: Add MinT support to cxserver (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry)
[12:34:27] <wikibugs>	 (03PS10) 10KartikMistry: WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579
[12:34:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add bast2003 to Bastion hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/915631 (https://phabricator.wikimedia.org/T334287)
[12:36:14] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] set rq==1.13.0 to workaround bug [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/915591 (owner: 10Ayounsi)
[12:38:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - ayounsi@cumin1001
[12:39:36] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9-wmf2 to netbox-next - ayounsi@cumin1001
[12:41:35] <wikibugs>	 (03PS1) 10Muehlenhoff: wmf-laptop-sre: Add bast2003 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915639 (https://phabricator.wikimedia.org/T334287)
[12:41:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove decommed bastions from ssh config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915640
[12:46:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P47529 and previous config saved to /var/cache/conftool/dbconfig/20230504-124609-ladsgroup.json
[12:48:47] <moritzm>	 !log installing ruby-rack security updates
[12:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1004.wikimedia.org
[12:50:07] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:52:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[12:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[12:52:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47530 and previous config saved to /var/cache/conftool/dbconfig/20230504-125250-ladsgroup.json
[12:53:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47531 and previous config saved to /var/cache/conftool/dbconfig/20230504-125309-ladsgroup.json
[12:54:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[12:54:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1004.wikimedia.org
[12:54:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[12:55:07] <wikibugs>	 (03PS1) 10Btullis: Fail back hive services to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/915657
[12:56:19] <wikibugs>	 (03PS2) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324)
[12:56:20] <wikibugs>	 (03PS4) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324)
[12:56:22] <wikibugs>	 (03PS4) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231)
[12:56:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:56:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:56:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm)
[12:57:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[12:57:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[12:59:40] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Allow new externallinks fields to be queried in wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666)
[13:00:07] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300)
[13:00:07] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300). Please do the needful.
[13:00:07] <jouncebot>	 jan_drewniak and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:24] <Lucas_WMDE>	 o/
[13:00:46] <Lucas_WMDE>	 jan_drewniak: do you want to self-service or should I deploy?
[13:01:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P47532 and previous config saved to /var/cache/conftool/dbconfig/20230504-130115-ladsgroup.json
[13:01:42] <jan_drewniak>	 Lucas_WMDE: yes, I can self-service :) this first patch might take 10min or so. We want to deploy Vector 2022 to eswiki, but makes sure all is good with traffic after we do that
[13:01:49] <Lucas_WMDE>	 ok!
[13:02:03] <Lucas_WMDE>	 only one of my three changes is left btw, I did the backports earlier already
[13:02:04] <Lucas_WMDE>	 go ahead :)
[13:03:48] <jan_drewniak>	 Lucas_WMDE: I got a strage message "Aborting: This scap command is disabled on this host" which host do you use for deployments?
[13:03:57] <Lucas_WMDE>	 deployment.eqiad.wmnet
[13:04:02] <Lucas_WMDE>	 which I think is currently an alias for deploy1002
[13:04:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[13:04:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[13:04:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T335845)', diff saved to https://phabricator.wikimedia.org/P47533 and previous config saved to /var/cache/conftool/dbconfig/20230504-130432-ladsgroup.json
[13:05:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:05:04] <Lucas_WMDE>	 yay
[13:05:07] <jan_drewniak>	 Lucas_WMDE:  k, thanks :)
[13:05:27] <wikibugs>	 (03PS2) 10Jdrewniak: [10%] Enable Vector 2022 as the default skin for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686)
[13:05:38] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:06:12] <jan_drewniak>	 Amir1: Just FYI, we're starting the eswiki deployment
[13:06:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47534 and previous config saved to /var/cache/conftool/dbconfig/20230504-130616-ladsgroup.json
[13:06:25] <Amir1>	 thanks
[13:06:27] <wikibugs>	 (03Merged) 10jenkins-bot: [10%] Enable Vector 2022 as the default skin for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915040 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:06:33] <Amir1>	 I'm around for a bit
[13:06:54] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:915040|[10%] Enable Vector 2022 as the default skin for eswiki (T335686)]]
[13:06:57] <stashbot>	 T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686
[13:07:19] <jan_drewniak>	 Amir1:  thanks, we're doing 10% then 100%, 10% is going out now.
[13:07:23] <wikibugs>	 (03PS3) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324)
[13:07:25] <Amir1>	 awesome
[13:07:25] <wikibugs>	 (03PS5) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324)
[13:07:27] <wikibugs>	 (03PS5) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231)
[13:07:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[13:07:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[13:07:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm)
[13:08:27] <Amir1>	 Gosh I can't wait for the zebra to be freed https://en.wikipedia.org/wiki/Polar_bear?VectorZebraDesign=1
[13:08:51] <jan_drewniak>	 Amir1:  thanks! It's almost there! 
[13:09:02] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:915040|[10%] Enable Vector 2022 as the default skin for eswiki (T335686)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:10:13] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre.hosts.reimage: improve failed first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/910461 (https://phabricator.wikimedia.org/T334880) (owner: 10Volans)
[13:10:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T335845)', diff saved to https://phabricator.wikimedia.org/P47535 and previous config saved to /var/cache/conftool/dbconfig/20230504-131054-ladsgroup.json
[13:11:25] <Amir1>	 <3
[13:13:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335838)', diff saved to https://phabricator.wikimedia.org/P47536 and previous config saved to /var/cache/conftool/dbconfig/20230504-131302-ladsgroup.json
[13:15:09] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:915040|[10%] Enable Vector 2022 as the default skin for eswiki (T335686)]] (duration: 08m 15s)
[13:15:13] <stashbot>	 T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686
[13:15:24] <jan_drewniak>	 Amir1: ok we're at 10% on eswiki
[13:15:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:15:43] <Amir1>	 checking
[13:15:51] <Amir1>	 it's in s2 I think
[13:16:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47537 and previous config saved to /var/cache/conftool/dbconfig/20230504-131621-ladsgroup.json
[13:17:07] <Amir1>	 already got one
[13:17:09] <Amir1>	 nice
[13:17:30] <Amir1>	 traffic in mysql is going back to normal
[13:17:38] <Amir1>	 I suggest you can continue to 100%
[13:17:48] <jan_drewniak>	 Amir1: awesome! 
[13:18:16] <wikibugs>	 (03PS2) 10Jdrewniak: Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686)
[13:18:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:20:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Display meta.wikimedia.org username, if authenticated, before linking - https://phabricator.wikimedia.org/T335955 (10SLyngshede-WMF)
[13:20:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:20:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Display meta.wikimedia.org username, if authenticated, before linking - https://phabricator.wikimedia.org/T335955 (10SLyngshede-WMF) p:05Triage→03Low
[13:20:52] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:21:19] <wikibugs>	 (03CR) 10Marostegui: "Did you get security to approve this?" [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup)
[13:21:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P47538 and previous config saved to /var/cache/conftool/dbconfig/20230504-132122-ladsgroup.json
[13:21:26] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915631 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff)
[13:21:50] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector 2022 as the default skin on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915041 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:22:16] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2003.codfw.wmnet
[13:22:17] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:915041|Enable Vector 2022 as the default skin on eswiki (T335686)]]
[13:22:21] <stashbot>	 T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686
[13:22:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:22:24] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Allow new externallinks fields to be queried in wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup)
[13:23:02] <wikibugs>	 (03CR) 10ArielGlenn: dumps::distribution::ferm: update to resolve hosts in puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond)
[13:23:34] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:23:43] <wikibugs>	 (03CR) 10Marostegui: mariadb: Allow new externallinks fields to be queried in wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup)
[13:23:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Allow new externallinks fields to be queried in wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup)
[13:23:48] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:915041|Enable Vector 2022 as the default skin on eswiki (T335686)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:24:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47539 and previous config saved to /var/cache/conftool/dbconfig/20230504-132439-ladsgroup.json
[13:26:00] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: Allow new externallinks fields to be queried in wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup)
[13:26:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P47540 and previous config saved to /var/cache/conftool/dbconfig/20230504-132600-ladsgroup.json
[13:26:05] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mariadb: Allow new externallinks fields to be queried in wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/915662 (https://phabricator.wikimedia.org/T312666) (owner: 10Ladsgroup)
[13:26:07] <wikibugs>	 (03PS2) 10Elukey: Rakefile: fix git branch check [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540
[13:26:10] <wikibugs>	 (03CR) 10Elukey: Rakefile: fix git branch check (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 (owner: 10Elukey)
[13:26:14] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2003.codfw.wmnet
[13:26:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Rakefile: fix git branch check [deployment-charts] - 10https://gerrit.wikimedia.org/r/915540 (owner: 10Elukey)
[13:26:51] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335775 (10Jclark-ctr) 05Open→03Resolved Reseated power supply
[13:27:15] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335776 (10Jclark-ctr) 05Open→03Resolved Reseated power supply
[13:27:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2002.codfw.wmnet
[13:28:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P47541 and previous config saved to /var/cache/conftool/dbconfig/20230504-132809-ladsgroup.json
[13:30:17] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2002.codfw.wmnet
[13:30:19] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:915041|Enable Vector 2022 as the default skin on eswiki (T335686)]] (duration: 08m 01s)
[13:30:22] <stashbot>	 T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686
[13:31:01] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2001.codfw.wmnet
[13:31:21] <jan_drewniak>	 Alright that's eswiki!
[13:31:47] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) >>! In T334733#8823968, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (...
[13:32:07] <wikibugs>	 (03PS2) 10Jdrewniak: Enable Vector 2022 as the default skin on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686)
[13:32:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:33:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2001.codfw.wmnet
[13:33:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector 2022 as the default skin on frwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915042 (https://phabricator.wikimedia.org/T335686) (owner: 10Jdrewniak)
[13:33:29] <jan_drewniak>	 Lucas_WMDE: almost done, one the last patch... 
[13:33:33] <Lucas_WMDE>	 ok!
[13:33:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good, with the separate bird change as Arzhel suggested!" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez)
[13:33:57] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:915042|Enable Vector 2022 as the default skin on frwikinews (T335686)]]
[13:34:00] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1006.eqiad.wmnet
[13:35:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond)
[13:35:36] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:915042|Enable Vector 2022 as the default skin on frwikinews (T335686)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[13:35:39] <stashbot>	 T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686
[13:36:05] <jan_drewniak>	 Amir1: thanks for checking traffic for us! We're live on eswiki 🎉
[13:36:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P47542 and previous config saved to /var/cache/conftool/dbconfig/20230504-133628-ladsgroup.json
[13:36:35] <wikibugs>	 (03PS1) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324)
[13:37:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:37:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[13:37:43] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10elukey) ` elukey@kafka-logging1001:~$ kafka acls --list kafka-acls --authorizer-properties...
[13:37:55] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1006.eqiad.wmnet
[13:37:58] <elukey>	 !log revert "Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in kafka logging clusters - T334733"
[13:37:58] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: drop support for Gen13 idrac [cookbooks] - 10https://gerrit.wikimedia.org/r/914866 (owner: 10Jbond)
[13:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:01] <stashbot>	 T334733: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733
[13:38:13] <Amir1>	 wohoo
[13:38:20] <wikibugs>	 (03PS1) 10Jelto: miscweb: add annualreport release to miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171)
[13:38:43] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Ladsgroup)
[13:38:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:39:05] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1005.eqiad.wmnet
[13:39:44] <wikibugs>	 (03PS6) 10Jbond: dumps::distribution::ferm: update to resolve hosts in puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324)
[13:39:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P47543 and previous config saved to /var/cache/conftool/dbconfig/20230504-133945-ladsgroup.json
[13:40:27] <wikibugs>	 (03CR) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[13:41:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41053/console" [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond)
[13:41:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P47544 and previous config saved to /var/cache/conftool/dbconfig/20230504-134106-ladsgroup.json
[13:41:44] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:915042|Enable Vector 2022 as the default skin on frwikinews (T335686)]] (duration: 07m 47s)
[13:41:47] <stashbot>	 T335686: Deploy Vector 2022 as the default desktop skin on Spanish Wikipedia and French Wikinews - https://phabricator.wikimedia.org/T335686
[13:41:57] <jan_drewniak>	 Lucas_WMDE: ok, finally done :)
[13:42:04] <Lucas_WMDE>	 \o/
[13:42:58] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1005.eqiad.wmnet
[13:42:58] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458)
[13:43:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P47545 and previous config saved to /var/cache/conftool/dbconfig/20230504-134315-ladsgroup.json
[13:43:29] <Lucas_WMDE>	 grmbl, scap backport gets confused by the Depends-On, as expected
[13:43:36] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagetcd1004.eqiad.wmnet
[13:43:41] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458)
[13:43:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] dumps::distribution::ferm: update to resolve hosts in puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/911338 (https://phabricator.wikimedia.org/T323324) (owner: 10Jbond)
[13:43:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[13:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: Make wbsubscribers API output sensible on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914752 (https://phabricator.wikimedia.org/T300458) (owner: 10Lucas Werkmeister (WMDE))
[13:45:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:914752|Make wbsubscribers API output sensible on Test Wikidata (T300458)]]
[13:45:07] <stashbot>	 T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458
[13:45:14] <wikibugs>	 (03PS1) 10Eevans: aqs: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915675 (https://phabricator.wikimedia.org/T335383)
[13:46:46] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/915675 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[13:47:05] <wikibugs>	 (03CR) 10Jelto: "Hi 👋 As discussed some time ago in T300171#8259774 this change adds a second static miscweb release. Does this make sense to you? Is it ok" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto)
[13:47:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jgreen)
[13:47:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[13:47:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:914752|Make wbsubscribers API output sensible on Test Wikidata (T300458)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:47:30] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd1004.eqiad.wmnet
[13:48:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Jgreen) a:05Jclark-ctr→03None
[13:48:19] <Lucas_WMDE>	 https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11 and https://test.wikidata.org/w/api.php?action=query&list=wbsubscribers&wblsentities=Q11&format=xmlfm look good on mwdebug, syncing
[13:48:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2006.codfw.wmnet
[13:48:26] <herron>	 !log switching to bullseye kafka monitoring hosts T335424
[13:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:29] <stashbot>	 T335424: kafkamon: upgrade to bullseye - https://phabricator.wikimedia.org/T335424
[13:48:43] <wikibugs>	 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frbast2002, frauth2002 - https://phabricator.wikimedia.org/T334505 (10Jgreen) a:05Jgreen→03None
[13:49:03] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafkamon: cut over to bullseye exporters [puppet] - 10https://gerrit.wikimedia.org/r/914876 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[13:49:25] <wikibugs>	 10SRE, 10fundraising-tech-ops: Q3:rack/setup/install frmon2002 - https://phabricator.wikimedia.org/T334501 (10Jgreen) a:05Jgreen→03Dwisehaupt
[13:50:29] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] WIP: Add MinT support to cxserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/905579 (owner: 10KartikMistry)
[13:51:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47546 and previous config saved to /var/cache/conftool/dbconfig/20230504-135135-ladsgroup.json
[13:52:19] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2006.codfw.wmnet
[13:52:20] <sukhe>	 jouncebot: nowandnext
[13:52:20] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300)
[13:52:20] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1300)
[13:52:20] <jouncebot>	 In 0 hour(s) and 7 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1400)
[13:53:28] <Lucas_WMDE>	 sukhe: I’m close to done, php-fpm-restart at 542%
[13:53:30] <Lucas_WMDE>	 *52%
[13:53:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2005.codfw.wmnet
[13:53:45] <sukhe>	 Lucas_WMDE: all good, not in a rush today! (no dc-ops on site waiting for me :)
[13:53:52] <sukhe>	 I will start once you are done
[13:53:52] <Lucas_WMDE>	 ok ^^
[13:54:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P47547 and previous config saved to /var/cache/conftool/dbconfig/20230504-135452-ladsgroup.json
[13:54:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:914752|Make wbsubscribers API output sensible on Test Wikidata (T300458)]] (duration: 09m 52s)
[13:54:59] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:54:59] <stashbot>	 T300458: [API] Inconsistencies in response of `list=wbsubscribers` API Query module - https://phabricator.wikimedia.org/T300458
[13:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47548 and previous config saved to /var/cache/conftool/dbconfig/20230504-135551-ladsgroup.json
[13:55:55] <wikibugs>	 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Jhancock.wm) @Marostegui Thanks for that. Draining the flea power worked. the mgmt port is now active and I can login to the idrac remotely. You can bring it back online now.
[13:56:05] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2005.codfw.wmnet
[13:56:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T335845)', diff saved to https://phabricator.wikimedia.org/P47549 and previous config saved to /var/cache/conftool/dbconfig/20230504-135612-ladsgroup.json
[13:56:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[13:56:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[13:56:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T335845)', diff saved to https://phabricator.wikimedia.org/P47550 and previous config saved to /var/cache/conftool/dbconfig/20230504-135637-ladsgroup.json
[13:56:52] <wikibugs>	 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) Thanks! Can you power on the host for me?
[13:57:39] <Lucas_WMDE>	 sukhe: you’re good to go as far as I’m concerned
[13:57:47] <sukhe>	 thanks Lucas_WMDE!
[13:57:51] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd2004.codfw.wmnet
[13:58:20] <logmsgbot>	 !log sukhe@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767
[13:58:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T335838)', diff saved to https://phabricator.wikimedia.org/P47551 and previous config saved to /var/cache/conftool/dbconfig/20230504-135821-ladsgroup.json
[13:58:23] <stashbot>	 T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767
[13:58:25] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1361 is CRITICAL: etcd last index (1915005) is outdated compared to the master one (1915008) https://wikitech.wikimedia.org/wiki/Etcd
[13:58:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[13:58:27] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw1379 is CRITICAL: etcd last index (1915005) is outdated compared to the master one (1915008) https://wikitech.wikimedia.org/wiki/Etcd
[13:58:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:58:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[13:58:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47552 and previous config saved to /var/cache/conftool/dbconfig/20230504-135845-ladsgroup.json
[13:59:59] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1361 is OK: etcd last index (1915011) matches the master one (1915011) https://wikitech.wikimedia.org/wiki/Etcd
[13:59:59] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw1379 is OK: etcd last index (1915011) matches the master one (1915011) https://wikitech.wikimedia.org/wiki/Etcd
[14:00:05] <jouncebot>	 sukhe: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for LVS maintenance deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1400).
[14:00:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47553 and previous config saved to /var/cache/conftool/dbconfig/20230504-140012-ladsgroup.json
[14:01:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd2004.codfw.wmnet
[14:03:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T335845)', diff saved to https://phabricator.wikimedia.org/P47554 and previous config saved to /var/cache/conftool/dbconfig/20230504-140308-ladsgroup.json
[14:03:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd1006.eqiad.wmnet
[14:03:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: (2) Device asw-a-codfw.mgmt.codfw.wmnet recovered from Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[14:04:46] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet
[14:06:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47555 and previous config saved to /var/cache/conftool/dbconfig/20230504-140634-ladsgroup.json
[14:07:00] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fail back hive services to the primary coordinator [dns] - 10https://gerrit.wikimedia.org/r/915657 (owner: 10Btullis)
[14:07:17] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd1006.eqiad.wmnet
[14:08:28] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd1005.eqiad.wmnet
[14:08:50] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] aqs: upgrade cluster to Cassandra 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915675 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[14:09:39] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet
[14:09:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47556 and previous config saved to /var/cache/conftool/dbconfig/20230504-140958-ladsgroup.json
[14:10:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[14:10:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[14:10:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47557 and previous config saved to /var/cache/conftool/dbconfig/20230504-141024-ladsgroup.json
[14:10:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P47558 and previous config saved to /var/cache/conftool/dbconfig/20230504-141057-ladsgroup.json
[14:11:38] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet
[14:12:20] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[14:12:23] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[14:12:31] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd1005.eqiad.wmnet
[14:13:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubetcd1004.eqiad.wmnet
[14:15:32] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet
[14:17:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubetcd1004.eqiad.wmnet
[14:17:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47559 and previous config saved to /var/cache/conftool/dbconfig/20230504-141749-ladsgroup.json
[14:18:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P47560 and previous config saved to /var/cache/conftool/dbconfig/20230504-141814-ladsgroup.json
[14:18:44] <wikibugs>	 10SRE, 10vm-requests, 10SRE Observability (FY2022/2023-Q4): 2 VMs for kafkamon - https://phabricator.wikimedia.org/T335426 (10herron) 05Open→03Resolved
[14:20:02] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335966 (10phaultfinder)
[14:20:32] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/914945 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus)
[14:21:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs2011: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/914856 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh)
[14:21:33] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM but let's let the new recording rule settle and give this query one last test before deploying" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/914946 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus)
[14:21:41] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet
[14:21:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P47561 and previous config saved to /var/cache/conftool/dbconfig/20230504-142140-ladsgroup.json
[14:21:43] <sukhe>	 urandom: ok to merge your change?
[14:21:46] <sukhe>	 Eevans: aqs: upgrade cluster to Cassandra 3.11.14 (e3566e48aa)
[14:24:46] <wikibugs>	 (03PS1) 10Elukey: custom_deploy.d: fix istio config for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/915685 (https://phabricator.wikimedia.org/T335756)
[14:25:49] <wikibugs>	 10SRE-Access-Requests: Update access permissions - https://phabricator.wikimedia.org/T335967 (10FJoseph-WMF)
[14:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P47562 and previous config saved to /var/cache/conftool/dbconfig/20230504-142604-ladsgroup.json
[14:27:56] <wikibugs>	 (03PS1) 10David Caro: k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888)
[14:28:06] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 12): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) @Eevans How can we help move this along?
[14:28:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] custom_deploy.d: fix istio config for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/915685 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[14:28:26] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet
[14:28:54] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet
[14:29:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:31:30] <wikibugs>	 (03CR) 10David Caro: "Tested with a venv on tools:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro)
[14:32:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P47563 and previous config saved to /var/cache/conftool/dbconfig/20230504-143255-ladsgroup.json
[14:33:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P47564 and previous config saved to /var/cache/conftool/dbconfig/20230504-143320-ladsgroup.json
[14:34:27] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.cassandra.roll-restart (exit_code=97) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[14:34:30] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[14:34:46] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet
[14:34:58] <wikibugs>	 10SRE, 10Data-Engineering, 10SRE Observability, 10Event-Platform Value Stream (Sprint 12): Grant IdempotentWrite Kafka Cluster ACL to User:ANONYOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10Ottomata) Yikes, thank you, yes let's delete ACLs for kafka logging.  I'm guessing that by...
[14:35:19] <wikibugs>	 (03PS1) 10SBassett: Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940)
[14:35:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[14:35:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[14:36:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro)
[14:36:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P47565 and previous config saved to /var/cache/conftool/dbconfig/20230504-143647-ladsgroup.json
[14:36:47] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[14:38:21] <wikibugs>	 10ops-codfw, 10DBA, 10Patch-For-Review: ManagementSSHDown - https://phabricator.wikimedia.org/T334722 (10Marostegui) 05Open→03Resolved Host back up and the idrac is indeed up too! Thanks @Jhancock.wm!
[14:38:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet
[14:39:34] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett)
[14:40:12] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2011.codfw.wmnet with OS bullseye
[14:40:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w...
[14:40:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[14:41:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[14:41:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47566 and previous config saved to /var/cache/conftool/dbconfig/20230504-144110-ladsgroup.json
[14:41:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[14:41:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[14:45:48] <wikibugs>	 (03PS1) 10Herron: kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424)
[14:45:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:46:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:46:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:46:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:46:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47567 and previous config saved to /var/cache/conftool/dbconfig/20230504-144625-ladsgroup.json
[14:46:52] <sukhe>	 jouncebot: next
[14:46:52] <jouncebot>	 In 1 hour(s) and 13 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600)
[14:47:10] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[14:47:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w...
[14:47:35] <wikibugs>	 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10fgiunchedi)
[14:48:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P47568 and previous config saved to /var/cache/conftool/dbconfig/20230504-144801-ladsgroup.json
[14:48:08] <icinga-wm>	 PROBLEM - puppet last run on prometheus6002 is CRITICAL: CRITICAL: Puppet has been disabled for 604863 seconds, message: Prometheus instances in drmrs dont have a replica label set, causing Thanos to ingest duplicate data - T335406 - denisse, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:48:09] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet
[14:48:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T335845)', diff saved to https://phabricator.wikimedia.org/P47569 and previous config saved to /var/cache/conftool/dbconfig/20230504-144827-ladsgroup.json
[14:48:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[14:48:33] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder)
[14:48:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[14:48:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T335845)', diff saved to https://phabricator.wikimedia.org/P47570 and previous config saved to /var/cache/conftool/dbconfig/20230504-144852-ladsgroup.json
[14:49:07] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett)
[14:49:21] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro)
[14:50:09] <wikibugs>	 (03Merged) 10jenkins-bot: k8s: don't set command in the container if empty [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/915689 (https://phabricator.wikimedia.org/T335888) (owner: 10David Caro)
[14:50:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] ssh::publish_fingerprints: Allow entries with no ip address [puppet] - 10https://gerrit.wikimedia.org/r/915570 (https://phabricator.wikimedia.org/T334722) (owner: 10Jbond)
[14:51:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47571 and previous config saved to /var/cache/conftool/dbconfig/20230504-145153-ladsgroup.json
[14:52:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47572 and previous config saved to /var/cache/conftool/dbconfig/20230504-145251-ladsgroup.json
[14:53:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47573 and previous config saved to /var/cache/conftool/dbconfig/20230504-145307-ladsgroup.json
[14:53:34] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T330930 (10phaultfinder)
[14:54:02] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet
[14:56:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T335845)', diff saved to https://phabricator.wikimedia.org/P47574 and previous config saved to /var/cache/conftool/dbconfig/20230504-145627-ladsgroup.json
[14:59:06] <icinga-wm>	 PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100%
[15:00:57] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/915695
[15:03:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47575 and previous config saved to /var/cache/conftool/dbconfig/20230504-150307-ladsgroup.json
[15:03:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[15:03:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host lvs2011.codfw.wmnet
[15:03:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[15:03:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[15:03:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[15:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T335838)', diff saved to https://phabricator.wikimedia.org/P47576 and previous config saved to /var/cache/conftool/dbconfig/20230504-150336-ladsgroup.json
[15:03:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host lvs2011.codfw.wmnet
[15:03:49] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host lvs2011.codfw.wmnet
[15:03:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] wmf-laptop-sre: Add bast2003 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915639 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff)
[15:04:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Remove decommed bastions from ssh config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915640 (owner: 10Muehlenhoff)
[15:07:24] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet
[15:07:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P47578 and previous config saved to /var/cache/conftool/dbconfig/20230504-150758-ladsgroup.json
[15:08:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P47579 and previous config saved to /var/cache/conftool/dbconfig/20230504-150813-ladsgroup.json
[15:08:31] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:10:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T335838)', diff saved to https://phabricator.wikimedia.org/P47580 and previous config saved to /var/cache/conftool/dbconfig/20230504-151000-ladsgroup.json
[15:11:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P47581 and previous config saved to /var/cache/conftool/dbconfig/20230504-151133-ladsgroup.json
[15:13:51] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet
[15:16:12] <wikibugs>	 (03PS1) 10Dzahn: add project language 'gpe', Ghanaian Pidgin [dns] - 10https://gerrit.wikimedia.org/r/915696 (https://phabricator.wikimedia.org/T335969)
[15:18:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "approved by langcom" [dns] - 10https://gerrit.wikimedia.org/r/915696 (https://phabricator.wikimedia.org/T335969) (owner: 10Dzahn)
[15:18:15] <wikibugs>	 (03PS2) 10Dzahn: add project language 'gpe', Ghanaian Pidgin [dns] - 10https://gerrit.wikimedia.org/r/915696 (https://phabricator.wikimedia.org/T335969)
[15:19:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: install dhcp client [puppet] - 10https://gerrit.wikimedia.org/r/915697
[15:19:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: mark git repo dirs as safe. [puppet] - 10https://gerrit.wikimedia.org/r/915698
[15:21:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: mark git repo dirs as safe. [puppet] - 10https://gerrit.wikimedia.org/r/915698 (owner: 10Filippo Giunchedi)
[15:21:13] <mutante>	 !log adding new project langauge 'gpe' - https://en.wikipedia.org/wiki/Ghanaian_Pidgin_English
[15:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: install dhcp client [puppet] - 10https://gerrit.wikimedia.org/r/915697 (owner: 10Filippo Giunchedi)
[15:22:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I checked and both proxies point to the same hosts: db1164/db1217" [dns] - 10https://gerrit.wikimedia.org/r/915695 (owner: 10Marostegui)
[15:23:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add bast2003 to Bastion hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/915631 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff)
[15:23:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P47582 and previous config saved to /var/cache/conftool/dbconfig/20230504-152304-ladsgroup.json
[15:23:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/915695 (owner: 10Marostegui)
[15:23:16] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/915695
[15:23:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P47583 and previous config saved to /var/cache/conftool/dbconfig/20230504-152319-ladsgroup.json
[15:24:40] <marostegui>	 !log Failover m1-master from dbproxy1012 to dbproxy1014
[15:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:06] <wikibugs>	 (03CR) 10Dzahn: "good catch, I wonder how I managed to do this since I did a search/replace globally in my editor :p" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn)
[15:25:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P47584 and previous config saved to /var/cache/conftool/dbconfig/20230504-152506-ladsgroup.json
[15:26:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P47585 and previous config saved to /var/cache/conftool/dbconfig/20230504-152640-ladsgroup.json
[15:26:53] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] wmf-laptop-sre: Add bast2003 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915639 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff)
[15:27:13] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet
[15:29:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:29:20] <wikibugs>	 (03CR) 10Muehlenhoff: "One note inline, rest looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[15:29:28] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove decommed bastions from ssh config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/915640 (owner: 10Muehlenhoff)
[15:29:42] <sukhe>	 jouncebot: next
[15:29:42] <jouncebot>	 In 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600)
[15:30:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[15:31:38] <icinga-wm>	 RECOVERY - Host mc2040 is UP: PING OK - Packet loss = 0%, RTA = 31.80 ms
[15:32:24] <wikibugs>	 (03PS4) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324)
[15:32:34] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on moscovium.eqiad.wmnet with reason: reboot
[15:32:44] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[15:32:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on moscovium.eqiad.wmnet with reason: reboot
[15:33:08] <mutante>	 !log moscovium (https://rt.wikimedia.org) - rebooting 
[15:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:56] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet
[15:34:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:35:40] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet
[15:38:02] <mutante>	 !log doc2002 - rebooting
[15:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T335845)', diff saved to https://phabricator.wikimedia.org/P47586 and previous config saved to /var/cache/conftool/dbconfig/20230504-153810-ladsgroup.json
[15:38:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[15:38:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T335838)', diff saved to https://phabricator.wikimedia.org/P47587 and previous config saved to /var/cache/conftool/dbconfig/20230504-153825-ladsgroup.json
[15:38:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[15:38:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[15:38:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1183 (T335845)', diff saved to https://phabricator.wikimedia.org/P47588 and previous config saved to /var/cache/conftool/dbconfig/20230504-153834-ladsgroup.json
[15:38:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[15:38:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47589 and previous config saved to /var/cache/conftool/dbconfig/20230504-153850-ladsgroup.json
[15:40:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P47590 and previous config saved to /var/cache/conftool/dbconfig/20230504-154012-ladsgroup.json
[15:40:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47591 and previous config saved to /var/cache/conftool/dbconfig/20230504-154021-ladsgroup.json
[15:41:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T335845)', diff saved to https://phabricator.wikimedia.org/P47592 and previous config saved to /var/cache/conftool/dbconfig/20230504-154146-ladsgroup.json
[15:41:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[15:42:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[15:42:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47593 and previous config saved to /var/cache/conftool/dbconfig/20230504-154211-ladsgroup.json
[15:43:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T335845)', diff saved to https://phabricator.wikimedia.org/P47594 and previous config saved to /var/cache/conftool/dbconfig/20230504-154344-ladsgroup.json
[15:43:54] <wikibugs>	 (03PS1) 10Chad: WIP: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/915701 (https://phabricator.wikimedia.org/T320390)
[15:43:55] <sbassett>	 Hey all - I’d like to deploy a quick config backport if I can: https://gerrit.wikimedia.org/r/914823.  Please let me know if I should wait.
[15:45:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet
[15:46:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47595 and previous config saved to /var/cache/conftool/dbconfig/20230504-154630-ladsgroup.json
[15:47:04] <sukhe>	 sbassett: yes please, deploys are currently locked
[15:47:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet
[15:47:25] <sukhe>	 we were trying to do an LVS reimage and it is stalled. so let me revert the patch and I will let you know when it's done
[15:47:42] <mutante>	 jouncebot: nowandnext
[15:47:42] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): LVS maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1400)
[15:47:42] <jouncebot>	 In 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600)
[15:47:57] <sukhe>	 sbassett: is it urgent?
[15:48:17] <wikibugs>	 (03PS2) 10Herron: kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424)
[15:48:20] <mutante>	 the puppet request window is also empty, so it can be maintenance for that hour too
[15:48:57] <sukhe>	 mutante: yeah that's true though I am not sure how much progress we will make + the additional time for reimaging
[15:49:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org
[15:49:16] <mutante>	 sukhe: the calendar is wide open, no worries
[15:49:18] <sukhe>	 so if sbassett's patch is urgent, I will remove the lock
[15:49:41] <wikibugs>	 (03PS1) 10Elukey: ml-services: add env variable to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/915727 (https://phabricator.wikimedia.org/T330414)
[15:50:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:50:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[15:50:33] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add env variable to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/915727 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey)
[15:50:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add env variable to ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/915727 (https://phabricator.wikimedia.org/T330414) (owner: 10Elukey)
[15:50:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47596 and previous config saved to /var/cache/conftool/dbconfig/20230504-155041-ladsgroup.json
[15:50:48] <wikibugs>	 (03CR) 10Herron: kafkamon: cleanup buster classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[15:50:55] <sbassett>	 sukhe: not urgent no
[15:51:04] <sbassett>	 But would like to deploy before train
[15:51:58] <sukhe>	 sbassett: thanks! and yes, I will make sure that I lift it before that
[15:52:01] <sukhe>	 I will ping you
[15:52:02] <sukhe>	 thanks for checking
[15:52:28] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet
[15:53:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet
[15:53:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org
[15:54:10] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:54:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[15:54:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org
[15:55:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:55:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T335838)', diff saved to https://phabricator.wikimedia.org/P47597 and previous config saved to /var/cache/conftool/dbconfig/20230504-155518-ladsgroup.json
[15:55:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:55:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:55:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T335838)', diff saved to https://phabricator.wikimedia.org/P47598 and previous config saved to /var/cache/conftool/dbconfig/20230504-155544-ladsgroup.json
[15:57:19] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafkamon: cleanup buster classes [puppet] - 10https://gerrit.wikimedia.org/r/915694 (https://phabricator.wikimedia.org/T335424) (owner: 10Herron)
[15:57:32] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[02-12].codfw.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:57:35] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[15:58:14] <wikibugs>	 (03CR) 10Dzahn: "maybe it should be a separate change, with reviewer Ryan Kemper, because we are touching config of production WDQS and WCWS with that part" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn)
[15:58:29] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[15:58:48] <wikibugs>	 (03CR) 10Dzahn: "this change could be merged first or second, doesn't matter until we want to remove the old name" [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn)
[15:58:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P47599 and previous config saved to /var/cache/conftool/dbconfig/20230504-155850-ladsgroup.json
[15:59:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org
[16:00:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:00:05] <jouncebot>	 jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet
[16:01:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P47600 and previous config saved to /var/cache/conftool/dbconfig/20230504-160136-ladsgroup.json
[16:01:53] <wikibugs>	 (03PS1) 10Dzahn: wdqs/wcqs: change discovery name of backends for GUIs [puppet] - 10https://gerrit.wikimedia.org/r/915737
[16:01:58] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 3516 MB (3% inode=84%): /tmp 3516 MB (3% inode=84%): /var/tmp 3516 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops
[16:02:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) I think this is superseded by https://phabricator.wikimedia.org/T335941 but will look now into changing my email association.   I was thinking I'd keep lorenjohnson@gmail.com for WikiTech and...
[16:02:31] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet
[16:03:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T335838)', diff saved to https://phabricator.wikimedia.org/P47601 and previous config saved to /var/cache/conftool/dbconfig/20230504-160307-ladsgroup.json
[16:03:13] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[16:04:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet
[16:05:31] <wikibugs>	 (03CR) 10Dzahn: "Jelto, I am not even sure if I like my own change, haha. It kind of makes a switch-over for WDQS/WCQS more complex than before, and for th" [puppet] - 10https://gerrit.wikimedia.org/r/915737 (owner: 10Dzahn)
[16:05:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P47602 and previous config saved to /var/cache/conftool/dbconfig/20230504-160547-ladsgroup.json
[16:06:51] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet
[16:07:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet
[16:09:16] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet
[16:10:25] <mutante>	 !log doc1002 (https://doc.wikimedia.org) - reboot, <1 min downtime
[16:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:33] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on doc1002.eqiad.wmnet with reason: reboot
[16:10:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc1002.eqiad.wmnet with reason: reboot
[16:10:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but just FYI a re-parse can be achieved with a GET request as well." [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus)
[16:11:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet
[16:12:35] <mutante>	 !log doc1003 - rebooting
[16:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:22] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet
[16:13:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P47603 and previous config saved to /var/cache/conftool/dbconfig/20230504-161356-ladsgroup.json
[16:14:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki-cache-warmup: Rename `Request` to `Task` [puppet] - 10https://gerrit.wikimedia.org/r/892569 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus)
[16:14:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2002.codfw.wmnet
[16:15:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover urldownloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/915741
[16:16:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P47604 and previous config saved to /var/cache/conftool/dbconfig/20230504-161643-ladsgroup.json
[16:17:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10lojo) Ok, I've now updated my Phabricator address here to loren.johnson@wikimedia.de and aligned my WMDE mediawikie.org account to the same address.
[16:18:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P47605 and previous config saved to /var/cache/conftool/dbconfig/20230504-161813-ladsgroup.json
[16:19:19] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet
[16:20:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P47606 and previous config saved to /var/cache/conftool/dbconfig/20230504-162055-ladsgroup.json
[16:23:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2002.codfw.wmnet
[16:24:17] <wikibugs>	 (03PS1) 10Ssingh: Revert "lvs2011: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/915708
[16:26:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet
[16:26:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet
[16:27:37] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on gerrit1003.wikimedia.org with reason: reboot
[16:27:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "lvs2011: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/915708 (owner: 10Ssingh)
[16:27:55] <mutante>	 !log gerrit1003 (gerrit-new.wikimedia.org) - rebooting
[16:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:01] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gerrit1003.wikimedia.org with reason: reboot
[16:29:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T335845)', diff saved to https://phabricator.wikimedia.org/P47607 and previous config saved to /var/cache/conftool/dbconfig/20230504-162902-ladsgroup.json
[16:29:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[16:29:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[16:29:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T335845)', diff saved to https://phabricator.wikimedia.org/P47608 and previous config saved to /var/cache/conftool/dbconfig/20230504-162926-ladsgroup.json
[16:30:44] <logmsgbot>	 !log sukhe@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS reimaging in codfw, blocking deploys T326767 (duration: 152m 23s)
[16:30:48] <stashbot>	 T326767: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767
[16:30:59] <sukhe>	 sbassett: please feel free to deploy!
[16:31:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 211.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[16:31:13] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:31:43] <sukhe>	 ==> deploys are now unblocked
[16:31:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T335838)', diff saved to https://phabricator.wikimedia.org/P47609 and previous config saved to /var/cache/conftool/dbconfig/20230504-163149-ladsgroup.json
[16:32:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet
[16:33:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:33:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:33:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P47610 and previous config saved to /var/cache/conftool/dbconfig/20230504-163319-ladsgroup.json
[16:33:23] <wikibugs>	 (03PS5) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324)
[16:33:25] <wikibugs>	 (03PS6) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324)
[16:33:27] <wikibugs>	 (03PS6) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231)
[16:33:29] <wikibugs>	 (03PS2) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324)
[16:33:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on etherpad1003.eqiad.wmnet with reason: reboot
[16:34:00] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on etherpad1003.eqiad.wmnet with reason: reboot
[16:34:04] <mutante>	 !log etherpad1003 (https://etherpad.wikimedia.org) rebooting, 1 min downtime
[16:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:13] <herron>	 hnowlan: is that expected?  ^^
[16:34:41] <cwhite>	 got the page
[16:35:19] <hnowlan>	 herron: nope, looking 
[16:36:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47611 and previous config saved to /var/cache/conftool/dbconfig/20230504-163601-ladsgroup.json
[16:36:02] <mutante>	 it feels like the alert is not new but the part that it p.ages is?
[16:36:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[16:36:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet
[16:36:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[16:36:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[16:36:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T335845)', diff saved to https://phabricator.wikimedia.org/P47612 and previous config saved to /var/cache/conftool/dbconfig/20230504-163626-ladsgroup.json
[16:36:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T335845)', diff saved to https://phabricator.wikimedia.org/P47613 and previous config saved to /var/cache/conftool/dbconfig/20230504-163646-ladsgroup.json
[16:37:02] <wikibugs>	 (03PS2) 10JMeybohm: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus)
[16:37:03] <wikibugs>	 (03PS4) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[16:37:05] <wikibugs>	 (03PS1) 10JMeybohm: Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660)
[16:38:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:38:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:38:24] <wikibugs>	 (03CR) 10JMeybohm: envoyproxy: Fix most validation errors in the `good` build_envoy_config tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus)
[16:39:23] <jynus>	 !log extending logical volume of backup1003, backup2003 for backup storage
[16:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47614 and previous config saved to /var/cache/conftool/dbconfig/20230504-164004-ladsgroup.json
[16:41:23] <hnowlan>	 cwhite, herron: sorry for the noise - looks like a spike in traffic. I'll add back in some capacity to stop it happening for the rest of the day 
[16:41:35] <herron>	 hnowlan: sounds good thanks!
[16:42:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T335845)', diff saved to https://phabricator.wikimedia.org/P47615 and previous config saved to /var/cache/conftool/dbconfig/20230504-164247-ladsgroup.json
[16:42:51] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet
[16:43:02] <sbassett>	 sukhe: thanks!
[16:43:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sbassett@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett)
[16:44:25] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[16:44:30] <wikibugs>	 (03Merged) 10jenkins-bot: Re-enable the Graph extension on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/914823 (https://phabricator.wikimedia.org/T334940) (owner: 10SBassett)
[16:44:59] <logmsgbot>	 !log sbassett@deploy1002 Started scap: Backport for [[gerrit:914823|Re-enable the Graph extension on test2wiki (T334940)]]
[16:45:03] <stashbot>	 T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940
[16:45:06] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=5; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[16:46:00] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet
[16:46:00] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10serviceops-collab: let Eoghan see security tickets in Phabricator - https://phabricator.wikimedia.org/T335981 (10Dzahn)
[16:46:08] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:18] <wikibugs>	 10SRE-Access-Requests, 10Phabricator, 10serviceops-collab: let Eoghan see security tickets in Phabricator - https://phabricator.wikimedia.org/T335981 (10Dzahn)
[16:46:26] <logmsgbot>	 !log sbassett@deploy1002 sbassett: Backport for [[gerrit:914823|Re-enable the Graph extension on test2wiki (T334940)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[16:48:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T335838)', diff saved to https://phabricator.wikimedia.org/P47616 and previous config saved to /var/cache/conftool/dbconfig/20230504-164826-ladsgroup.json
[16:48:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[16:48:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[16:48:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T335838)', diff saved to https://phabricator.wikimedia.org/P47617 and previous config saved to /var/cache/conftool/dbconfig/20230504-164850-ladsgroup.json
[16:51:36] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet
[16:51:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P47618 and previous config saved to /var/cache/conftool/dbconfig/20230504-165152-ladsgroup.json
[16:52:03] <logmsgbot>	 !log sbassett@deploy1002 Finished scap: Backport for [[gerrit:914823|Re-enable the Graph extension on test2wiki (T334940)]] (duration: 07m 04s)
[16:52:06] <stashbot>	 T334940: All Graphs broken on Wikimedia wikis (due to security issue T334895) - https://phabricator.wikimedia.org/T334940
[16:52:53] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet
[16:53:58] <wikibugs>	 (03Abandoned) 10Raymond Ndibe: profile:toolforge:harbor: setup blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/910798 (https://phabricator.wikimedia.org/T325165) (owner: 10Raymond Ndibe)
[16:55:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P47619 and previous config saved to /var/cache/conftool/dbconfig/20230504-165511-ladsgroup.json
[16:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T335838)', diff saved to https://phabricator.wikimedia.org/P47620 and previous config saved to /var/cache/conftool/dbconfig/20230504-165521-ladsgroup.json
[16:57:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P47621 and previous config saved to /var/cache/conftool/dbconfig/20230504-165753-ladsgroup.json
[16:58:48] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet
[16:58:54] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:06] <jouncebot>	 brennen and mutante: OwO what's this, a deployment window?? Phabricator update window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700). nyaa~
[17:00:06] <jouncebot>	 bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700).
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700)
[17:00:17] <brennen>	 o/
[17:00:47] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host lvs2011.codfw.wmnet
[17:00:50] <mutante>	 jouncebot: now
[17:00:50] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Phabricator update window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700)
[17:00:51] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700)
[17:00:51] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1700)
[17:01:07] <mutante>	 !log Phabricator upgrade - maintenance incoming
[17:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:17] <dancy>	 Good luck!
[17:01:36] <brennen>	 a minute or two while i juggle deployment repo state and then i'm ready to run scap
[17:01:41] <mutante>	 ty:) 
[17:01:55] <sukhe>	 $deityspeed mutante!
[17:02:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: maintenance upgrade
[17:02:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: maintenance upgrade
[17:03:03] <wikibugs>	 (03PS2) 10JMeybohm: Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660)
[17:03:05] <wikibugs>	 (03PS5) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[17:03:10] <mutante>	 thanks, brennen does the actual work:)
[17:03:29] <mutante>	 downtimed
[17:03:40] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: maintenance upgrade
[17:03:49] <mutante>	 brennen: phab2002 will go first?
[17:03:51] <brennen>	 mutante: cool, will do phab2002 then the prod box.
[17:03:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: maintenance upgrade
[17:03:54] <mutante>	 :) 
[17:04:15] <bd808>	 Nothing for me to deploy in the Technical Engagement slot this week.
[17:04:31] <mutante>	 we should aso reboot aphlict if we have time for it 
[17:04:33] <mutante>	 in the window
[17:04:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) User has been contacted for verification
[17:04:59] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet
[17:05:00] <brennen>	 thx bd808
[17:05:11] <brennen>	 (sorry to step on window)
[17:05:15] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@0529926]: deploy latest state to phab2002
[17:05:23] <bd808>	 brennen: it's all good :)
[17:05:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki)
[17:05:40] <wikibugs>	 (03CR) 10JMeybohm: "After this we can merge https://gerrit.wikimedia.org/r/c/integration/config/+/914785" [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) (owner: 10JMeybohm)
[17:05:47] <mutante>	 there is still that ONE Icinga check left, that is for phab.wmfusercontent.org
[17:05:53] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@0529926]: deploy latest state to phab2002 (duration: 00m 37s)
[17:05:58] <mutante>	 and cookbook won't find that host 
[17:06:04] <mutante>	 because it's a service name
[17:06:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P47622 and previous config saved to /var/cache/conftool/dbconfig/20230504-170658-ladsgroup.json
[17:07:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) uid: sg912 uidNumber: 41194
[17:07:56] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@0529926]: deploy latest state to phab1004
[17:08:31] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@0529926]: deploy latest state to phab1004 (duration: 00m 34s)
[17:08:51] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet
[17:09:42] <brennen>	 !log phab1004 deployed and restarted, phab up, MR widget still seems to work
[17:09:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:59] <mutante>	 :)
[17:10:04] <mutante>	 wfm
[17:10:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P47623 and previous config saved to /var/cache/conftool/dbconfig/20230504-171017-ladsgroup.json
[17:10:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P47624 and previous config saved to /var/cache/conftool/dbconfig/20230504-171028-ladsgroup.json
[17:11:26] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet
[17:12:52] <brennen>	 mutante: i'm good, no objections here to an aphlict restart if needed
[17:13:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P47625 and previous config saved to /var/cache/conftool/dbconfig/20230504-171300-ladsgroup.json
[17:13:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:13:32] <mutante>	 brennen: great! thank you, one moment
[17:15:23] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet
[17:16:20] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:43] <mutante>	 !log aphlict2001 - not active, rebooting
[17:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:19:43] <wikibugs>	 (03PS1) 10Raymond Ndibe: toolforge: add tekton metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/915771 (https://phabricator.wikimedia.org/T325163)
[17:22:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T335845)', diff saved to https://phabricator.wikimedia.org/P47626 and previous config saved to /var/cache/conftool/dbconfig/20230504-172204-ladsgroup.json
[17:22:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[17:22:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[17:22:25] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[11-21].eqiad.wmnet: Upgrade Cassandra — T335383 - eevans@cumin1001
[17:22:28] <stashbot>	 T335383: Upgrade Cassandra to latest 3.11.x (3.11.14) - https://phabricator.wikimedia.org/T335383
[17:22:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T335845)', diff saved to https://phabricator.wikimedia.org/P47627 and previous config saved to /var/cache/conftool/dbconfig/20230504-172228-ladsgroup.json
[17:24:48] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet
[17:24:49] <wikibugs>	 (03PS3) 10Ebernhardson: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199)
[17:25:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T335838)', diff saved to https://phabricator.wikimedia.org/P47628 and previous config saved to /var/cache/conftool/dbconfig/20230504-172523-ladsgroup.json
[17:25:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet
[17:25:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[17:25:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P47629 and previous config saved to /var/cache/conftool/dbconfig/20230504-172534-ladsgroup.json
[17:25:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[17:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47630 and previous config saved to /var/cache/conftool/dbconfig/20230504-172546-ladsgroup.json
[17:25:49] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[17:26:38] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host aphlict2001.codfw.wmnet
[17:27:13] <wikibugs>	 (03PS1) 10Effie Mouzeli: data.yaml: Add Hasan Akgün (WMDE) [puppet] - 10https://gerrit.wikimedia.org/r/915780 (https://phabricator.wikimedia.org/T335101)
[17:28:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T335845)', diff saved to https://phabricator.wikimedia.org/P47631 and previous config saved to /var/cache/conftool/dbconfig/20230504-172806-ladsgroup.json
[17:28:08] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] thanos: Migrate from 100-scale to unit-scale SLO recording rules [puppet] - 10https://gerrit.wikimedia.org/r/914945 (https://phabricator.wikimedia.org/T289615) (owner: 10RLazarus)
[17:28:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[17:28:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[17:28:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:28:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:28:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T335845)', diff saved to https://phabricator.wikimedia.org/P47632 and previous config saved to /var/cache/conftool/dbconfig/20230504-172835-ladsgroup.json
[17:28:38] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:04] <wikibugs>	 (03PS2) 10Effie Mouzeli: data.yaml: Add Hasan Akgün (WMDE) to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915780 (https://phabricator.wikimedia.org/T335101)
[17:29:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T335845)', diff saved to https://phabricator.wikimedia.org/P47633 and previous config saved to /var/cache/conftool/dbconfig/20230504-172932-ladsgroup.json
[17:30:46] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict2001.codfw.wmnet
[17:30:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on people1003.eqiad.wmnet with reason: maintenance upgrade
[17:31:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on people1003.eqiad.wmnet with reason: maintenance upgrade
[17:31:15] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet
[17:31:27] <mutante>	 !log people1003 - rebooting
[17:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:43] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host aphlict1002.eqiad.wmnet
[17:32:09] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[17:32:11] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet
[17:32:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed...
[17:32:32] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[17:32:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[17:33:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47634 and previous config saved to /var/cache/conftool/dbconfig/20230504-173309-ladsgroup.json
[17:35:38] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict1002.eqiad.wmnet
[17:35:46] <wikibugs>	 (03PS1) 10Effie Mouzeli: data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438)
[17:35:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T335845)', diff saved to https://phabricator.wikimedia.org/P47635 and previous config saved to /var/cache/conftool/dbconfig/20230504-173555-ladsgroup.json
[17:37:10] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[17:37:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed...
[17:38:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10jijiki)
[17:40:06] <wikibugs>	 (03CR) 10Krinkle: "@Marostegui We're ready for it. The puppet patches are ready from my side. We might make minor tweaks still but this and prod equiv can go" [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle)
[17:40:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, confirmed LDAP and has approval from Tyler" [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) (owner: 10Effie Mouzeli)
[17:40:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T335838)', diff saved to https://phabricator.wikimedia.org/P47637 and previous config saved to /var/cache/conftool/dbconfig/20230504-174040-ladsgroup.json
[17:41:15] <wikibugs>	 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Dzahn)
[17:42:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet
[17:42:57] <wikibugs>	 (03PS2) 10Effie Mouzeli: data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438)
[17:42:59] <wikibugs>	 (03PS1) 10Effie Mouzeli: data.yaml: Add Julia Kieserman to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529)
[17:44:38] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet
[17:44:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47638 and previous config saved to /var/cache/conftool/dbconfig/20230504-174438-ladsgroup.json
[17:45:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, confirmed LDAP and has group approval" [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) (owner: 10Effie Mouzeli)
[17:47:06] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:47:16] <icinga-wm>	 PROBLEM - puppet last run on prometheus5002 is CRITICAL: CRITICAL: Puppet has been disabled for 604811 seconds, message: Disabling Puppet and Thanos sidecar as part of the migration of Prometheus hosts to Bullseye - T309979 - denisse, last run 6 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:48:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P47639 and previous config saved to /var/cache/conftool/dbconfig/20230504-174815-ladsgroup.json
[17:48:34] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[17:48:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[17:48:58] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet
[17:51:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P47640 and previous config saved to /var/cache/conftool/dbconfig/20230504-175102-ladsgroup.json
[17:51:16] <wikibugs>	 (03PS1) 10Dwisehaupt: Direct frbast.wm.o at the new frbast1002 host [dns] - 10https://gerrit.wikimedia.org/r/915811 (https://phabricator.wikimedia.org/T319460)
[17:51:20] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet
[17:53:14] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[17:53:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed...
[17:54:41] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[17:54:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Dzahn) for the record: It's perfectly fine to have 2 accounts, one with work email and one with volunteer/personal email, if you really prefer that. some people do this, other's don.t
[17:54:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[17:58:52] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Direct frbast.wm.o at the new frbast1002 host [dns] - 10https://gerrit.wikimedia.org/r/915811 (https://phabricator.wikimedia.org/T319460) (owner: 10Dwisehaupt)
[17:59:01] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet
[17:59:32] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P47641 and previous config saved to /var/cache/conftool/dbconfig/20230504-175945-ladsgroup.json
[18:00:05] <jouncebot>	 brennen and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T1800).
[18:01:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) @odimitrijevic or @Ottomata can you please approve this request for the group analytics-privatedata-users ?
[18:02:01] <brennen>	 o/
[18:03:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P47642 and previous config saved to /var/cache/conftool/dbconfig/20230504-180322-ladsgroup.json
[18:03:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[18:03:44] <brennen>	 checking out a couple of log messages before rolling train.
[18:04:42] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet
[18:04:48] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[18:04:56] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet
[18:04:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed...
[18:05:54] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[18:06:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[18:06:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P47643 and previous config saved to /var/cache/conftool/dbconfig/20230504-180608-ladsgroup.json
[18:08:11] <brennen>	 !log train 1.41.0-wmf.7 (T330213): logs fairly quiet and no current blockers, rolling to group2
[18:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:15] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[18:08:45] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915812 (https://phabricator.wikimedia.org/T330213)
[18:08:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915812 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot)
[18:09:53] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915812 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot)
[18:11:06] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet
[18:11:16] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[18:11:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed...
[18:12:11] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011']
[18:12:30] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@88ebdf7]: (no justification provided)
[18:12:59] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@88ebdf7]: (no justification provided) (duration: 00m 28s)
[18:13:05] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2011']
[18:14:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt)
[18:14:43] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs2011']
[18:14:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T335845)', diff saved to https://phabricator.wikimedia.org/P47644 and previous config saved to /var/cache/conftool/dbconfig/20230504-181451-ladsgroup.json
[18:14:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[18:14:59] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet
[18:15:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10Ottomata) Approved
[18:15:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[18:15:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47645 and previous config saved to /var/cache/conftool/dbconfig/20230504-181516-ladsgroup.json
[18:15:39] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs2011']
[18:16:35] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.7  refs T330213
[18:16:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47646 and previous config saved to /var/cache/conftool/dbconfig/20230504-181636-ladsgroup.json
[18:16:40] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[18:16:56] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:05] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye
[18:17:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye
[18:17:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:18:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T335838)', diff saved to https://phabricator.wikimedia.org/P47647 and previous config saved to /var/cache/conftool/dbconfig/20230504-181828-ladsgroup.json
[18:18:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[18:18:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[18:18:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T335838)', diff saved to https://phabricator.wikimedia.org/P47648 and previous config saved to /var/cache/conftool/dbconfig/20230504-181851-ladsgroup.json
[18:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T335845)', diff saved to https://phabricator.wikimedia.org/P47649 and previous config saved to /var/cache/conftool/dbconfig/20230504-182114-ladsgroup.json
[18:21:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[18:21:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[18:21:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47650 and previous config saved to /var/cache/conftool/dbconfig/20230504-182139-ladsgroup.json
[18:21:44] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet
[18:22:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:22:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47651 and previous config saved to /var/cache/conftool/dbconfig/20230504-182238-ladsgroup.json
[18:23:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47652 and previous config saved to /var/cache/conftool/dbconfig/20230504-182301-ladsgroup.json
[18:24:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T335838)', diff saved to https://phabricator.wikimedia.org/P47653 and previous config saved to /var/cache/conftool/dbconfig/20230504-182418-ladsgroup.json
[18:24:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet
[18:28:02] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:45] <logmsgbot>	 !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye
[18:28:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed...
[18:29:31] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - bump to v1.15.0-dev2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915817 (https://phabricator.wikimedia.org/T332948)
[18:29:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[18:30:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47654 and previous config saved to /var/cache/conftool/dbconfig/20230504-183010-ladsgroup.json
[18:31:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet
[18:31:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet
[18:37:05] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@88ebdf7]: (no justification provided)
[18:37:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - bump to v1.15.0-dev2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915817 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata)
[18:37:15] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@88ebdf7]: (no justification provided) (duration: 00m 09s)
[18:37:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P47655 and previous config saved to /var/cache/conftool/dbconfig/20230504-183744-ladsgroup.json
[18:38:32] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet
[18:39:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P47656 and previous config saved to /var/cache/conftool/dbconfig/20230504-183925-ladsgroup.json
[18:44:30] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet
[18:45:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P47657 and previous config saved to /var/cache/conftool/dbconfig/20230504-184516-ladsgroup.json
[18:46:40] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:35] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet
[18:50:56] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet
[18:51:05] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10Aklapper) [general comment] For staff and contractors of legal entities I strongly recommend using an account for paid work that's clearly identifiable as a work account for the sake of transparency...
[18:52:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P47658 and previous config saved to /var/cache/conftool/dbconfig/20230504-185250-ladsgroup.json
[18:52:54] <wikibugs>	 (03PS1) 10Jdlrobson: Fix file page integration [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915709 (https://phabricator.wikimedia.org/T335997)
[18:54:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet
[18:54:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P47659 and previous config saved to /var/cache/conftool/dbconfig/20230504-185431-ladsgroup.json
[18:55:08] <jynus>	 what's ipmiseld.service ?
[18:55:42] <jynus>	 it got resolved, nevermind
[18:57:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Aklapper) @FJoseph-WMF: Hi, please follow the docs at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Turnilo#Access and see the corresponding Phabricator form which has a...
[18:59:06] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:54] <icinga-wm>	 RECOVERY - Check systemd state on elastic1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P47660 and previous config saved to /var/cache/conftool/dbconfig/20230504-190022-ladsgroup.json
[19:01:13] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9400.service Failed on elastic1091:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:02:32] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@88ebdf7]: (no justification provided)
[19:02:36] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@88ebdf7]: (no justification provided) (duration: 00m 03s)
[19:04:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet
[19:04:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet
[19:07:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T335845)', diff saved to https://phabricator.wikimedia.org/P47661 and previous config saved to /var/cache/conftool/dbconfig/20230504-190757-ladsgroup.json
[19:08:31] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T335838)', diff saved to https://phabricator.wikimedia.org/P47662 and previous config saved to /var/cache/conftool/dbconfig/20230504-190937-ladsgroup.json
[19:09:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[19:09:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[19:10:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47663 and previous config saved to /var/cache/conftool/dbconfig/20230504-191001-ladsgroup.json
[19:10:11] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet
[19:11:17] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet
[19:15:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47664 and previous config saved to /var/cache/conftool/dbconfig/20230504-191528-ladsgroup.json
[19:16:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47665 and previous config saved to /var/cache/conftool/dbconfig/20230504-191612-ladsgroup.json
[19:16:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47666 and previous config saved to /var/cache/conftool/dbconfig/20230504-191623-ladsgroup.json
[19:21:20] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet
[19:23:34] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet
[19:27:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47667 and previous config saved to /var/cache/conftool/dbconfig/20230504-192747-ladsgroup.json
[19:28:05] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet
[19:29:09] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet
[19:31:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P47668 and previous config saved to /var/cache/conftool/dbconfig/20230504-193118-ladsgroup.json
[19:31:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P47669 and previous config saved to /var/cache/conftool/dbconfig/20230504-193129-ladsgroup.json
[19:32:19] <wikibugs>	 (03PS1) 10Dzahn: gerrit: disable replication from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730)
[19:38:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet
[19:38:16] <wikibugs>	 (03CR) 10Dzahn: "So, if $replication = lookup('profile::gerrit::replication'),is not set then in modules/gerrit/templates/replication.config.erb the "remot" [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) (owner: 10Dzahn)
[19:38:41] <wikibugs>	 (03PS2) 10Dzahn: gerrit: disable replication from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730)
[19:42:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet
[19:42:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P47670 and previous config saved to /var/cache/conftool/dbconfig/20230504-194254-ladsgroup.json
[19:43:52] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "actual compiler diff: https://puppet-compiler.wmflabs.org/output/915830/41055/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) (owner: 10Dzahn)
[19:44:33] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T335684 (10phaultfinder)
[19:45:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet
[19:46:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P47671 and previous config saved to /var/cache/conftool/dbconfig/20230504-194624-ladsgroup.json
[19:46:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P47672 and previous config saved to /var/cache/conftool/dbconfig/20230504-194635-ladsgroup.json
[19:47:08] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:49:14] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet
[19:52:38] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335966 (10Jclark-ctr) a:03Jclark-ctr
[19:56:37] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - bump t0 v1.15.0-dev3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915832
[19:57:52] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P47673 and previous config saved to /var/cache/conftool/dbconfig/20230504-195800-ladsgroup.json
[19:58:04] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - bump t0 v1.15.0-dev3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915832 (owner: 10Ottomata)
[20:00:05] <jouncebot>	 brennen and TheresNoTime: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230504T2000).
[20:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:50] <mutante>	 !log people2002 (people.wikimedia.org) reboot, <1 min downtime
[20:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:06] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on people2002.codfw.wmnet with reason: maintenance upgrade
[20:01:18] <Jdlrobson>	 present 
[20:01:30] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on people2002.codfw.wmnet with reason: maintenance upgrade
[20:01:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T335845)', diff saved to https://phabricator.wikimedia.org/P47674 and previous config saved to /var/cache/conftool/dbconfig/20230504-200131-ladsgroup.json
[20:01:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T335838)', diff saved to https://phabricator.wikimedia.org/P47675 and previous config saved to /var/cache/conftool/dbconfig/20230504-200141-ladsgroup.json
[20:01:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[20:01:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[20:03:22] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on miscweb1003.eqiad.wmnet with reason: reboot
[20:03:24] <Jdlrobson>	 brennen: around? I think Sammy is out today.
[20:03:30] <brennen>	 Jdlrobson: yeah, i can sling that out
[20:03:35] <Jdlrobson>	 thanks :)
[20:03:35] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on miscweb1003.eqiad.wmnet with reason: reboot
[20:03:53] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - remove python.fn-execution.bundle.size setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/915834 (https://phabricator.wikimedia.org/T332948)
[20:03:58] <mutante>	 !log miscweb1003 - rebooting
[20:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915709 (https://phabricator.wikimedia.org/T335997) (owner: 10Jdlrobson)
[20:05:27] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - remove python.fn-execution.bundle.size setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/915834 (https://phabricator.wikimedia.org/T332948) (owner: 10Ottomata)
[20:06:15] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:06:18] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:06:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[20:06:30] <wikibugs>	 (03Merged) 10jenkins-bot: Fix file page integration [extensions/MultimediaViewer] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915709 (https://phabricator.wikimedia.org/T335997) (owner: 10Jdlrobson)
[20:06:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[20:06:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T335838)', diff saved to https://phabricator.wikimedia.org/P47676 and previous config saved to /var/cache/conftool/dbconfig/20230504-200644-ladsgroup.json
[20:06:48] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:915709|Fix file page integration (T335997)]]
[20:06:51] <stashbot>	 T335997: MMV broken on file page (TypeError: Cannot read properties of undefined (reading 'then') / TypeError: undefined is not an object (evaluating 'bs.openImage(this,title).then')) - https://phabricator.wikimedia.org/T335997
[20:08:14] <logmsgbot>	 !log brennen@deploy1002 brennen and jdlrobson: Backport for [[gerrit:915709|Fix file page integration (T335997)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[20:08:30] <brennen>	 Jdlrobson: lemme know when to proceed
[20:09:01] <Jdlrobson>	 brennen: looking now
[20:09:50] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T335966 (10Jclark-ctr) 05Open→03Resolved reseated power supply
[20:10:23] <Jdlrobson>	 LGTM brennen please sync
[20:11:50] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] mediawiki-cache-warmup: Rename `Request` to `Task` [puppet] - 10https://gerrit.wikimedia.org/r/892569 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus)
[20:12:07] <brennen>	 goin'
[20:12:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: remove redundant usage [cookbooks] - 10https://gerrit.wikimedia.org/r/915092 (owner: 10Ryan Kemper)
[20:12:34] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: extend downtime for operations [cookbooks] - 10https://gerrit.wikimedia.org/r/915836
[20:13:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T335845)', diff saved to https://phabricator.wikimedia.org/P47677 and previous config saved to /var/cache/conftool/dbconfig/20230504-201306-ladsgroup.json
[20:13:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[20:13:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[20:13:29] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: extend downtime for operations [cookbooks] - 10https://gerrit.wikimedia.org/r/915836 (owner: 10Ryan Kemper)
[20:13:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47678 and previous config saved to /var/cache/conftool/dbconfig/20230504-201332-ladsgroup.json
[20:14:45] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - v1.15.0-dev4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915837
[20:15:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T335838)', diff saved to https://phabricator.wikimedia.org/P47679 and previous config saved to /var/cache/conftool/dbconfig/20230504-201514-ladsgroup.json
[20:16:26] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:17:38] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:915709|Fix file page integration (T335997)]] (duration: 10m 50s)
[20:17:42] <stashbot>	 T335997: MMV broken on file page (TypeError: Cannot read properties of undefined (reading 'then') / TypeError: undefined is not an object (evaluating 'bs.openImage(this,title).then')) - https://phabricator.wikimedia.org/T335997
[20:18:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - v1.15.0-dev4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915837 (owner: 10Ottomata)
[20:18:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:19:10] <wikibugs>	 (03PS5) 10RLazarus: mediawiki-cache-warmup: Add POSTs [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989)
[20:19:39] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:19:42] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:19:48] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] mediawiki-cache-warmup: Add POSTs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/892570 (https://phabricator.wikimedia.org/T290989) (owner: 10RLazarus)
[20:19:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47680 and previous config saved to /var/cache/conftool/dbconfig/20230504-201955-ladsgroup.json
[20:23:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:27:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10FJoseph-WMF) @Aklapper I'm new to the foundation. I reached out to ITS they said to open a phab ticket and tag SRE. There was an autocomplete option for SRE access request - it seemed t...
[20:28:44] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:29:06] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops
[20:29:51] <wikibugs>	 (03CR) 10Xcollazo: "PPC is happy with the changes https://puppet-compiler.wmflabs.org/output/914928/41056/" [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721) (owner: 10Xcollazo)
[20:30:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P47681 and previous config saved to /var/cache/conftool/dbconfig/20230504-203021-ladsgroup.json
[20:30:32] <brennen>	 gonna do a train rollback here.
[20:31:10] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915841 (https://phabricator.wikimedia.org/T330213)
[20:31:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915841 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot)
[20:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915841 (https://phabricator.wikimedia.org/T330213) (owner: 10TrainBranchBot)
[20:33:37] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10RLazarus) 892570 is merged now, and I think we'll be in better shape for the next one. @Clement_Goubert I'm tempted to resolve this, and reopen if we...
[20:35:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P47682 and previous config saved to /var/cache/conftool/dbconfig/20230504-203501-ladsgroup.json
[20:35:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10FJoseph-WMF) This ticket can be closed out.
[20:35:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to fjoseph for Fjoseph - https://phabricator.wikimedia.org/T336009 (10FJoseph-WMF)
[20:40:16] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:41:14] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:45:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P47683 and previous config saved to /var/cache/conftool/dbconfig/20230504-204527-ladsgroup.json
[20:46:55] <wikibugs>	 (03CR) 10Xcollazo: "This is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/914928 (https://phabricator.wikimedia.org/T335721) (owner: 10Xcollazo)
[20:47:28] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:50:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P47684 and previous config saved to /var/cache/conftool/dbconfig/20230504-205007-ladsgroup.json
[20:51:13] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.7  refs T330213
[20:51:16] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[20:52:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Aklapper) >>! In T335967#8828179, @FJoseph-WMF wrote: > they said to open a phab ticket and tag SRE  Ah, thanks a lot, that's useful to know. I asked as I'm always curious how to improv...
[20:52:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to fjoseph for Fjoseph - https://phabricator.wikimedia.org/T336009 (10Aklapper)
[20:52:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:52:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf and Turnilo for Fjoseph - https://phabricator.wikimedia.org/T336009 (10Aklapper)
[20:52:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update access permissions to turnilo for FJoseph - https://phabricator.wikimedia.org/T335967 (10Aklapper)
[20:57:16] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.7  refs T330213 (duration: 06m 02s)
[20:57:20] <stashbot>	 T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213
[20:57:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:58:16] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T335838)', diff saved to https://phabricator.wikimedia.org/P47685 and previous config saved to /var/cache/conftool/dbconfig/20230504-210033-ladsgroup.json
[21:00:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[21:00:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[21:00:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T335838)', diff saved to https://phabricator.wikimedia.org/P47686 and previous config saved to /var/cache/conftool/dbconfig/20230504-210057-ladsgroup.json
[21:02:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:05:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47687 and previous config saved to /var/cache/conftool/dbconfig/20230504-210513-ladsgroup.json
[21:05:57] <wikibugs>	 (03CR) 10Btullis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[21:09:05] <wikibugs>	 (03CR) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[21:09:16] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:09:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T335838)', diff saved to https://phabricator.wikimedia.org/P47688 and previous config saved to /var/cache/conftool/dbconfig/20230504-210928-ladsgroup.json
[21:14:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:16:58] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:30] <wikibugs>	 (03PS56) 10Btullis: ceph: Add puppet management of OSDs on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[21:18:33] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41058/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[21:21:18] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[21:22:58] <wikibugs>	 (03PS19) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[21:24:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P47689 and previous config saved to /var/cache/conftool/dbconfig/20230504-212434-ladsgroup.json
[21:26:05] <wikibugs>	 (03PS1) 10Eevans: deployment-prep: upgrade Cassandra (restbase) to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915846 (https://phabricator.wikimedia.org/T335383)
[21:28:04] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:52] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] deployment-prep: upgrade Cassandra (restbase) to 3.11.14 [puppet] - 10https://gerrit.wikimedia.org/r/915846 (https://phabricator.wikimedia.org/T335383) (owner: 10Eevans)
[21:36:26] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P47690 and previous config saved to /var/cache/conftool/dbconfig/20230504-213941-ladsgroup.json
[21:42:32] <wikibugs>	 (03PS1) 10Brennen Bearnes: api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008)
[21:44:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:44:45] <wikibugs>	 (03PS1) 10Barakat Ajadi: CentralNoticeTiming:  Remove enablement of the topic for legacy eventlogging and refinery [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550)
[21:45:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes)
[21:45:50] <Amir1>	 brennen: I'm deploying the fix. Wanna roll forward afterwards?
[21:46:52] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CentralNoticeTiming:  Remove enablement of the topic for legacy eventlogging and refinery [puppet] - 10https://gerrit.wikimedia.org/r/915850 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi)
[21:47:38] <brennen>	 Amir1: from discussion at https://phabricator.wikimedia.org/T336008#8828350 i'm not sure this will reduce the error rate?
[21:50:42] <Amir1>	 ah, yeah, it's mostly for logging
[21:51:08] <brennen>	 let's see if it surfaces any useful logging at group1?
[21:51:45] <brennen>	 i'm kind of fried here, i think it might be more sensible of me to pause train for the day.
[21:52:15] <brennen>	 i have already tempted the deployment gods enough for one day
[21:54:12] <wikibugs>	 (03CR) 10Ladsgroup: api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes)
[21:54:25] <Amir1>	 okay, I removed my +2 from the backport
[21:54:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T335838)', diff saved to https://phabricator.wikimedia.org/P47691 and previous config saved to /var/cache/conftool/dbconfig/20230504-215447-ladsgroup.json
[21:54:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[21:55:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[21:55:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T335838)', diff saved to https://phabricator.wikimedia.org/P47692 and previous config saved to /var/cache/conftool/dbconfig/20230504-215511-ladsgroup.json
[21:58:09] <brennen>	 Amir1: i'll go ahead with the backport, but leave wmf.7 at group1.  hopefully that gets something useful on the bug.
[21:58:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes)
[21:58:32] <Amir1>	 okay :D
[21:58:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes)
[21:58:51] <brennen>	 in the morning perhaps i will be smarter. :D
[21:59:10] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:47] <Amir1>	 well, technically it's morning here now
[22:01:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T335838)', diff saved to https://phabricator.wikimedia.org/P47693 and previous config saved to /var/cache/conftool/dbconfig/20230504-220127-ladsgroup.json
[22:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915710 (https://phabricator.wikimedia.org/T336008) (owner: 10Brennen Bearnes)
[22:03:03] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:915710|api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords (T336008)]]
[22:03:07] <stashbot>	 T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008
[22:03:31] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[22:04:32] <logmsgbot>	 !log brennen@deploy1002 brennen: Backport for [[gerrit:915710|api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords (T336008)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[22:12:11] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:915710|api: Use Status::isGood in ApiQueryRevisionsBase::getRevisionRecords (T336008)]] (duration: 09m 07s)
[22:12:18] <stashbot>	 T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008
[22:12:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:16:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P47694 and previous config saved to /var/cache/conftool/dbconfig/20230504-221633-ladsgroup.json
[22:17:34] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:22:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host lvs2011.codfw.wmnet
[22:28:22] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[22:31:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P47695 and previous config saved to /var/cache/conftool/dbconfig/20230504-223139-ladsgroup.json
[22:46:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T335838)', diff saved to https://phabricator.wikimedia.org/P47696 and previous config saved to /var/cache/conftool/dbconfig/20230504-224646-ladsgroup.json
[22:46:48] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:46:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:47:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:49:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[22:49:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[22:49:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[22:50:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[22:50:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47697 and previous config saved to /var/cache/conftool/dbconfig/20230504-225013-ladsgroup.json
[22:53:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[22:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[22:53:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T335845)', diff saved to https://phabricator.wikimedia.org/P47698 and previous config saved to /var/cache/conftool/dbconfig/20230504-225336-ladsgroup.json
[22:57:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47699 and previous config saved to /var/cache/conftool/dbconfig/20230504-225747-ladsgroup.json
[22:59:12] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:00:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T335845)', diff saved to https://phabricator.wikimedia.org/P47700 and previous config saved to /var/cache/conftool/dbconfig/20230504-230001-ladsgroup.json
[23:08:31] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:12:32] <wikibugs>	 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10RLazarus) 05Open→03Resolved Boldly resolving -- last I heard from Haroon, everyone's satisfied with this explanation. Feel free to reopen if there are...
[23:12:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P47701 and previous config saved to /var/cache/conftool/dbconfig/20230504-231254-ladsgroup.json
[23:15:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P47702 and previous config saved to /var/cache/conftool/dbconfig/20230504-231507-ladsgroup.json
[23:17:58] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:28:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P47703 and previous config saved to /var/cache/conftool/dbconfig/20230504-232800-ladsgroup.json
[23:29:06] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:30:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P47704 and previous config saved to /var/cache/conftool/dbconfig/20230504-233013-ladsgroup.json
[23:43:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T335845)', diff saved to https://phabricator.wikimedia.org/P47705 and previous config saved to /var/cache/conftool/dbconfig/20230504-234306-ladsgroup.json
[23:43:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[23:43:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[23:43:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T335845)', diff saved to https://phabricator.wikimedia.org/P47706 and previous config saved to /var/cache/conftool/dbconfig/20230504-234330-ladsgroup.json
[23:44:44] <wikibugs>	 10SRE: How quickly is a vandalism revision propogated through the system and available through the Action APIs - https://phabricator.wikimedia.org/T334752 (10HShaikh) Sorry I thought I had already responded but it seems I forgot to hit submit on the ticket. Reuven is correct we are fine with the current explanat...
[23:45:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T335845)', diff saved to https://phabricator.wikimedia.org/P47707 and previous config saved to /var/cache/conftool/dbconfig/20230504-234520-ladsgroup.json
[23:45:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[23:45:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[23:45:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T335845)', diff saved to https://phabricator.wikimedia.org/P47708 and previous config saved to /var/cache/conftool/dbconfig/20230504-234544-ladsgroup.json
[23:46:32] <icinga-wm>	 PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T335845)', diff saved to https://phabricator.wikimedia.org/P47709 and previous config saved to /var/cache/conftool/dbconfig/20230504-234840-ladsgroup.json
[23:53:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T335845)', diff saved to https://phabricator.wikimedia.org/P47710 and previous config saved to /var/cache/conftool/dbconfig/20230504-235326-ladsgroup.json
[23:59:18] <icinga-wm>	 RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state