[00:00:01] (03Abandoned) 10Cwhite: profile: add loki output support to the logstash pipeline [puppet] - 10https://gerrit.wikimedia.org/r/602490 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [00:01:47] (03Abandoned) 10Cwhite: pontoon: add pontoon logging environment [puppet] - 10https://gerrit.wikimedia.org/r/710056 (owner: 10Cwhite) [00:01:57] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:02:32] !log jhuneidi@deploy1002 Installing scap version "4.37.0" for 564 hosts [00:02:56] !log jhuneidi@deploy1002 Installation of scap version "4.37.0" completed for 564 hosts [00:03:16] (03Abandoned) 10Cwhite: service::docker: enhance volume support [puppet] - 10https://gerrit.wikimedia.org/r/605343 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [00:04:03] PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:09] (03Abandoned) 10Cwhite: apifeatureusage: clean up legacy apifeatureusage config [puppet] - 10https://gerrit.wikimedia.org/r/747636 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [00:08:34] (03Abandoned) 10Cwhite: smart: remove unused function get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/832570 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [00:08:37] (03CR) 10BCornwall: "PCC output looks good (https://puppet-compiler.wmflabs.org/output/889892/39697/). It seems there's an unrelated problem with vrts-1001.dev" [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) (owner: 10BCornwall) [00:09:11] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review: Remove obsolete "Permissions-Policy: interest-cohort" header - https://phabricator.wikimedia.org/T312823 (10BCornwall) a:03BCornwall [00:15:05] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS1299/IPv6: Idle - Telia, AS1299/IPv4: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:30:04] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:08] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:24] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase[2012,2015-2018,2020,2022,2023,2025-2027].codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [00:55:26] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) [01:07:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:12:03] PROBLEM - Check systemd state on aphlict2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase[2012,2015-2018,2020,2022,2023,2025-2027].codfw.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [04:52:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [04:53:44] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [04:57:49] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [06:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230217T0700) [07:06:25] (03CR) 10Elukey: fix(presto): create pkcs12 server file with intermediate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230217T0800) [08:02:55] (03CR) 10Elukey: fix(presto): create pkcs12 server file with intermediate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [08:05:36] (03PS37) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:07:44] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [08:09:47] (03PS4) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) [08:11:16] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39698/console" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [08:11:43] 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10Vgutierrez) AFAIK this was only impacting envoy as the TLS terminator of the CDN and we went with HAProxy so this can be closed [08:13:38] (03PS38) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:15:41] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [08:16:24] (03CR) 10Nicolas Fraison: [V: 03+1] fix(presto): create pkcs12 server file with intermediate certificate (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [08:25:11] (03PS12) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [08:27:16] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [08:30:51] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.makevm for new host idm2001.wikimedia.org [08:30:52] !log slyngshede@cumin1001 START - Cookbook sre.dns.netbox [08:33:22] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idm2001.wikimedia.org - slyngshede@cumin1001" [08:35:08] (03PS13) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [08:35:44] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idm2001.wikimedia.org - slyngshede@cumin1001" [08:35:44] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:35:44] !log slyngshede@cumin1001 START - Cookbook sre.dns.wipe-cache idm2001.wikimedia.org on all recursors [08:35:47] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idm2001.wikimedia.org on all recursors [08:40:40] (03PS14) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [08:41:25] (03PS39) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:41:46] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [08:41:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39700/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [08:42:58] (03PS40) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:45:06] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [08:45:26] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host idm2001.wikimedia.org [08:49:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/889881 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [08:50:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat" [alerts] - 10https://gerrit.wikimedia.org/r/889887 (https://phabricator.wikimedia.org/T187708) (owner: 10BCornwall) [08:53:45] (03PS41) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [08:54:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [08:58:03] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [09:00:26] (03PS15) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [09:05:50] (03PS16) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [09:07:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39703/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [09:16:42] (03PS5) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [09:16:58] (03CR) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 (owner: 10Clément Goubert) [09:22:57] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert) [09:24:46] (03Merged) 10jenkins-bot: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert) [09:29:20] (03PS10) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [09:29:37] (03PS6) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [09:32:06] (03CR) 10Jaime Nuche: jenkins: remove hardcoded values from sudo rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [09:33:12] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: merge gitlab-restore scripts [puppet] - 10https://gerrit.wikimedia.org/r/889768 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [09:34:53] (03CR) 10Muehlenhoff: Access Requests, allow users to request more permissions (039 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 (owner: 10Slyngshede) [09:42:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:40] (03CR) 10David Caro: replica_cnf_api_test: check if user with id USER_ID exists (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [09:57:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:10] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reimage for host idm2001.wikimedia.org with OS bullseye [10:00:15] 10SRE, 10Infrastructure-Foundations: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bullseye [10:01:20] (03CR) 10Muehlenhoff: SUL account linking (034 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/888003 (https://phabricator.wikimedia.org/T320807) (owner: 10Slyngshede) [10:05:36] (03CR) 10Arturo Borrero Gonzalez: bullseye-sssd/: add mysql client command line utility (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/889769 (https://phabricator.wikimedia.org/T320178) (owner: 10Arturo Borrero Gonzalez) [10:05:46] (03Abandoned) 10Arturo Borrero Gonzalez: bullseye-sssd/: add mysql client command line utility [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/889769 (https://phabricator.wikimedia.org/T320178) (owner: 10Arturo Borrero Gonzalez) [10:08:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] bullseye: add bzip2 and zstd compression programs (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) (owner: 10BryanDavis) [10:09:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Thanks! Also using this for T320178" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [10:09:14] (03Merged) 10jenkins-bot: bullseye: add bzip2 and zstd compression programs [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) (owner: 10BryanDavis) [10:09:36] (03Merged) 10jenkins-bot: mariadb: new image for mariadb/mysql backups [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [10:11:55] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm2001.wikimedia.org with reason: host reimage [10:13:56] PROBLEM - Check systemd state on an-airflow1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@search.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:12] (03CR) 10David Caro: replica_cnf_api_test: check if user with id USER_ID exists (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [10:14:56] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm2001.wikimedia.org with reason: host reimage [10:16:11] (03PS17) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [10:19:44] (03CR) 10Jbond: fix(presto): create pkcs12 server file with intermediate certificate (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [10:23:01] (03PS18) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [10:23:22] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [10:24:55] (03PS19) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [10:28:07] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39705/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [10:28:10] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host idm2001.wikimedia.org with OS bullseye [10:28:15] 10SRE, 10Infrastructure-Foundations: Initial production deployment of the IDM - https://phabricator.wikimedia.org/T320797 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bullseye completed: - idm2001 (**PASS**) - Removed from... [10:30:34] (03PS2) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [10:31:57] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:21] (03PS1) 10Nicolas Fraison: resuse-zookeeper-data: add reuse partman conf for zk data [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) [10:35:48] (03CR) 10CI reject: [V: 04-1] resuse-zookeeper-data: add reuse partman conf for zk data [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) (owner: 10Nicolas Fraison) [10:36:10] (03PS5) 10Nicolas Fraison: fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) [10:36:48] (03PS3) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [10:36:57] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:37:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [10:38:23] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39707/console" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [10:39:32] (03CR) 10Nicolas Fraison: [V: 03+1] fix(presto): create pkcs12 server file with intermediate certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [10:40:15] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) 05Open→03Stalled Dry-runs and live-test stalled until resolution of {T329533} [10:40:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [10:42:26] (03CR) 10Ilias Sarantopoulos: "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/889773 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [10:43:36] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [10:44:41] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 2 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Clement_Goubert) 05Open→03Resolved Resolving as wikitech will not be moving, but adding {T237773} to {T328907} [10:44:49] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [10:44:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [10:48:19] (03PS4) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [10:50:46] (03PS1) 10Jelto: gitlab: add restore script not all hosts [puppet] - 10https://gerrit.wikimedia.org/r/889955 (https://phabricator.wikimedia.org/T326315) [10:51:19] (03PS1) 10Elukey: Add sre.k8s.wipe-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) [10:53:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [10:54:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39710/console" [puppet] - 10https://gerrit.wikimedia.org/r/889955 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [10:54:04] (03PS1) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [10:55:31] (03CR) 10CI reject: [V: 04-1] Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [10:55:59] (03PS20) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [10:57:46] (03PS2) 10Elukey: Add sre.k8s.wipe-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) [10:58:08] (03CR) 10Muehlenhoff: "Nice! A few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [10:58:14] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: JVM upgrades - elukey@cumin1001 [11:00:36] (03PS1) 10Ayounsi: Netbox: activate validators [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) [11:01:57] (03PS2) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [11:02:02] (03PS21) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [11:02:05] (03PS3) 10Elukey: Add sre.k8s.wipe-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) [11:02:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39712/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [11:03:32] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39713/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [11:03:54] (03CR) 10Slyngshede: [V: 03+2] P:IDM secrets are mapped wrong. [labs/private] - 10https://gerrit.wikimedia.org/r/889798 (owner: 10Slyngshede) [11:03:59] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] P:IDM secrets are mapped wrong. [labs/private] - 10https://gerrit.wikimedia.org/r/889798 (owner: 10Slyngshede) [11:04:07] (03PS1) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [11:04:34] (03CR) 10CI reject: [V: 04-1] service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:09:27] (03PS2) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [11:15:21] (03CR) 10Btullis: "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) (owner: 10Nicolas Fraison) [11:15:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: JVM upgrades - elukey@cumin1001 [11:17:40] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) 05In progress→03Resolved [11:18:31] (03CR) 10JMeybohm: "just nits really" [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) (owner: 10Elukey) [11:18:59] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: simplify code and extend downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/889962 (https://phabricator.wikimedia.org/T327767) [11:19:22] (03CR) 10Btullis: [C: 03+1] fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [11:19:24] (03PS1) 10Ayounsi: Validators: add symlink to netbox-extra [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889963 (https://phabricator.wikimedia.org/T310590) [11:22:37] (03PS1) 10Majavah: P:toolforge::prometheus: add monitoring for cert-manager [puppet] - 10https://gerrit.wikimedia.org/r/889965 [11:23:56] (03CR) 10Btullis: "There are two small changes on the pcc run that I don't quite understand." [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [11:28:18] (03PS22) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [11:29:02] (03PS4) 10Elukey: Add sre.k8s.wipe-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) [11:29:04] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: simplify code and extend downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/889962 (https://phabricator.wikimedia.org/T327767) [11:29:06] (03CR) 10Elukey: Add sre.k8s.wipe-cluster.py (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) (owner: 10Elukey) [11:29:43] (03CR) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:29:47] (03PS1) 10Hnowlan: admin: update platform engineering approvers [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244) [11:30:24] (03CR) 10Btullis: [C: 03+2] hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:33:43] (03PS3) 10Btullis: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:35:33] (03PS23) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [11:35:54] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [11:36:06] (03Abandoned) 10Btullis: hadoop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850477 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:38:07] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: add restore script not all hosts [puppet] - 10https://gerrit.wikimedia.org/r/889955 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [11:44:16] (03PS1) 10Slyngshede: P:IDM Move fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/889968 [11:50:55] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add sre.k8s.wipe-cluster.py (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) (owner: 10Elukey) [11:52:59] (03Merged) 10jenkins-bot: Add sre.k8s.wipe-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/889956 (https://phabricator.wikimedia.org/T307943) (owner: 10Elukey) [11:54:57] (03CR) 10Jbond: "lgtm some minor nits,comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [11:56:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+1] P:IDM Move fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/889968 (owner: 10Slyngshede) [11:56:51] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] P:IDM Move fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/889968 (owner: 10Slyngshede) [11:58:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39717/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:00:26] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Continue to use the cergen cert for service-account signing [puppet] - 10https://gerrit.wikimedia.org/r/889808 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:01:27] (03PS1) 10Muehlenhoff: Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) [12:01:48] (03CR) 10CI reject: [V: 04-1] Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [12:02:34] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] fix(presto): create pkcs12 server file with intermediate certificate [puppet] - 10https://gerrit.wikimedia.org/r/889822 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [12:03:37] (03PS24) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:03:39] PROBLEM - puppet last run on aux-k8s-ctrl1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:04:43] (03PS2) 10Muehlenhoff: Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) [12:05:35] PROBLEM - puppet last run on aux-k8s-ctrl1002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:05:47] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Pigsonthewing) 05Resolved→03Open The question I highlighted in... [12:08:53] RECOVERY - puppet last run on aux-k8s-ctrl1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:08:58] (03CR) 10AikoChou: [C: 03+1] ml-services: update docker images for outlink and revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/889773 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [12:09:21] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] feat(presto): export splits and thread metrics [puppet] - 10https://gerrit.wikimedia.org/r/889756 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [12:10:41] (03PS7) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [12:10:49] (03PS11) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [12:10:55] RECOVERY - puppet last run on aux-k8s-ctrl1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:14:07] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39718/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:17:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [12:17:52] (03CR) 10Slyngshede: [V: 03+1] "I'm starting to feel this is moving in hodgepodge style direction. Feel free to suggest a better/pretty structure" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:22:32] (03PS1) 10JMeybohm: k8s.wipe-cluster: Don't disable puppet or downtime etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 [12:29:43] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add restore script not all hosts [puppet] - 10https://gerrit.wikimedia.org/r/889955 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [12:30:42] (03PS1) 10Slyngshede: P:idp add IDM OIDC profile [puppet] - 10https://gerrit.wikimedia.org/r/889974 [12:30:59] (03PS1) 10JMeybohm: kubernetes: Pass the cergen service-account key to controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/889975 (https://phabricator.wikimedia.org/T329826) [12:32:36] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39719/console" [puppet] - 10https://gerrit.wikimedia.org/r/889975 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:32:54] (03PS1) 10Muehlenhoff: gitlab: Remove net.core.somaxconn sysctl [puppet] - 10https://gerrit.wikimedia.org/r/889976 [12:33:16] (03CR) 10CI reject: [V: 04-1] gitlab: Remove net.core.somaxconn sysctl [puppet] - 10https://gerrit.wikimedia.org/r/889976 (owner: 10Muehlenhoff) [12:34:49] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes: Pass the cergen service-account key to controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/889975 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:38:39] !log jayme@cumin1001 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster aux-eqiad: T329826 [12:38:42] (03PS1) 10Slyngshede: P:IDP Add fake secret for IDM OIDC [labs/private] - 10https://gerrit.wikimedia.org/r/889978 [12:38:44] T329826: Kubernetes v1.23 multi master setup is broken - https://phabricator.wikimedia.org/T329826 [12:39:31] (03PS3) 10Jcrespo: Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) [12:42:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster aux-eqiad: T329826 [12:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:46:35] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:46:37] (03PS2) 10Muehlenhoff: gitlab: Remove net.core.somaxconn sysctl [puppet] - 10https://gerrit.wikimedia.org/r/889976 [12:46:37] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:46:51] (03PS3) 10Muehlenhoff: gitlab: Remove net.core.somaxconn sysctl [puppet] - 10https://gerrit.wikimedia.org/r/889976 [12:46:56] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:46:58] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:46:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:47:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:48:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:50:19] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:50:21] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:50:40] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:50:45] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:50:58] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:50:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:51:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244) (owner: 10Hnowlan) [12:51:33] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:51:47] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:51:53] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:51:58] (KubernetesCalicoDown) firing: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:52:06] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:52:08] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:52:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:01] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:54:07] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:54:11] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:54:14] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:54:26] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:54:42] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:55:00] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:55:06] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:55:13] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:55:22] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:55:33] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:55:37] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:55:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:58] (KubernetesCalicoDown) resolved: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:56:58] (KubernetesCalicoDown) resolved: (2) aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:58:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:01:20] (03CR) 10Slyngshede: [V: 03+2 C: 03+1] P:IDP Add fake secret for IDM OIDC [labs/private] - 10https://gerrit.wikimedia.org/r/889978 (owner: 10Slyngshede) [13:01:37] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] P:IDP Add fake secret for IDM OIDC [labs/private] - 10https://gerrit.wikimedia.org/r/889978 (owner: 10Slyngshede) [13:01:49] (03PS2) 10Nicolas Fraison: resuse-zookeeper-data: add reuse partman conf for zk data [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) [13:02:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39721/console" [puppet] - 10https://gerrit.wikimedia.org/r/889974 (owner: 10Slyngshede) [13:02:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:05:56] (03PS12) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [13:07:45] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Clement_Goubert) [13:09:26] (03CR) 10JMeybohm: k8s.wipe-cluster: Don't disable puppet or downtime etcd (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [13:11:28] (03CR) 10JMeybohm: k8s.wipe-cluster: Don't disable puppet or downtime etcd (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [13:16:16] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Aklapper) 05Open→03Resolved That question does not change the... [13:19:18] (03CR) 10Muehlenhoff: "Doesn't look hodge podge at all, a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [13:21:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889974 (owner: 10Slyngshede) [13:24:57] (03PS3) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [13:25:19] (03PS4) 10Jcrespo: Preparing for release 0.1.6 [software/mediabackups] - 10https://gerrit.wikimedia.org/r/889646 (https://phabricator.wikimedia.org/T327157) [13:25:48] (03CR) 10CI reject: [V: 04-1] Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:26:06] (03CR) 10Ayounsi: "Thanks. I addressed all the comments and added validators/dcim/interface.py" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:26:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) (owner: 10BCornwall) [13:27:13] (03PS4) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [13:28:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889963 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:29:01] (03PS2) 10Ayounsi: Netbox: activate validators [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) [13:30:31] jouncebot: nowandnext [13:30:31] For the next 18 hour(s) and 29 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230217T0800) [13:30:31] In 18 hour(s) and 29 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230218T0800) [13:31:01] !log docker system prune on alert1001 - root fs almost full [13:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:16] (03CR) 10Jbond: "lgtm, couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [13:33:19] !log pfischer@deploy1002 Started deploy [wikimedia/discovery/analytics@3a94765]: T327381: rdf-spark-tools update [13:33:23] T327381: Migrate RDF Tooling to Spark 3 - https://phabricator.wikimedia.org/T327381 [13:35:58] !log pfischer@deploy1002 Finished deploy [wikimedia/discovery/analytics@3a94765]: T327381: rdf-spark-tools update (duration: 02m 39s) [13:40:37] !log docker system prune on alert2001 - root fs almost full [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:59] (03CR) 10Jbond: "-1 because i don't spot anything installing redis?" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [13:45:04] (03CR) 10Jbond: [C: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [13:49:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:52:34] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10JMeybohm) [13:52:56] (03CR) 10Muehlenhoff: Tweak scalability of KDC requests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [13:53:17] (03PS3) 10Muehlenhoff: Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) [13:54:22] (03CR) 10David Caro: [C: 03+2] replica_cnf_api_test: check if user with id USER_ID exists (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889851 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [14:03:52] (03PS1) 10Nicolas Fraison: module_rake_tasks: load puppet to avoid uninitialized constant Puppet [puppet] - 10https://gerrit.wikimedia.org/r/889990 [14:05:19] (03CR) 10Nicolas Fraison: "It finally looks better to require puppet on our side than blocking the release of puppet-syntax" [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [14:06:20] (03PS3) 10Ayounsi: Netbox: activate validators [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) [14:12:46] (03PS25) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [14:13:07] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [14:13:45] (03CR) 10Nicolas Fraison: [C: 03+1] Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [14:15:35] (03PS26) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [14:15:56] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [14:20:02] (03CR) 10Cathal Mooney: [C: 03+1] "Wow great work! Really good stuff, and gives me a really good template on how to build such rules thanks. One or two comments in line, b" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [14:21:06] (03PS1) 10Papaul: Update model for ps1-a8-codfw [puppet] - 10https://gerrit.wikimedia.org/r/889992 (https://phabricator.wikimedia.org/T327404) [14:22:36] (03CR) 10Papaul: [C: 03+2] Update model for ps1-a8-codfw [puppet] - 10https://gerrit.wikimedia.org/r/889992 (https://phabricator.wikimedia.org/T327404) (owner: 10Papaul) [14:25:33] (03PS5) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [14:25:41] (03PS27) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [14:26:02] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [14:26:15] (03CR) 10Elukey: k8s.wipe-cluster: Don't disable puppet or downtime etcd (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [14:26:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader2003.wikimedia.org [14:26:30] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:27:22] RECOVERY - ps1-a8-codfw-infeed-load-tower-B-phase-Y on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-B-phase-Y 129 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:27:30] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-X on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-X 91 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:27:39] (03PS28) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [14:28:01] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [14:28:45] (03PS2) 10JMeybohm: k8s.wipe-cluster: Don't disable puppet or downtime etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 [14:28:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2003.wikimedia.org - jmm@cumin2002" [14:28:55] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) [14:29:30] (03PS29) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [14:29:32] (03CR) 10JMeybohm: k8s.wipe-cluster: Don't disable puppet or downtime etcd (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [14:30:09] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) 05Open→03Resolved Complete [14:30:26] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, I agree with explicitly requiring `puppet` rather than freezing the version." [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [14:31:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2003.wikimedia.org - jmm@cumin2002" [14:31:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:31:04] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader2003.wikimedia.org on all recursors [14:31:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader2003.wikimedia.org on all recursors [14:31:47] (03CR) 10Elukey: [C: 03+1] k8s.wipe-cluster: Don't disable puppet or downtime etcd (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [14:34:16] (03PS3) 10JMeybohm: k8s.wipe-cluster: Don't disable puppet or downtime etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 [14:35:09] (03CR) 10Elukey: [C: 03+2] k8s.wipe-cluster: Don't disable puppet or downtime etcd (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [14:35:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:35:19] (03PS6) 10Nicolas Fraison: perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) [14:35:21] (03PS6) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [14:35:23] (03PS7) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [14:35:35] (03PS30) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [14:36:04] (03PS2) 10Nicolas Fraison: fix(presto): fix typo from node.enviroment to node.environment [puppet] - 10https://gerrit.wikimedia.org/r/889807 [14:36:53] (03Merged) 10jenkins-bot: k8s.wipe-cluster: Don't disable puppet or downtime etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889972 (owner: 10JMeybohm) [14:37:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for db218[567] - pt1979@cumin2002" [14:37:40] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-Y on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-Y 119 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:38:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for db218[567] - pt1979@cumin2002" [14:38:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:16] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-Z on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-Z 683 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:40:34] (03CR) 10Ayounsi: Add validator classes for some objects (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [14:40:36] (03PS1) 10Nicolas Fraison: presto: add 5 nodes to the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889994 (https://phabricator.wikimedia.org/T329525) [14:40:40] (03PS1) 10Nicolas Fraison: presto: add last 5 nodes to prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889995 (https://phabricator.wikimedia.org/T329525) [14:40:44] RECOVERY - ps1-a8-codfw-infeed-load-tower-B-phase-X on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-B-phase-X 98 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:40:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader2003.wikimedia.org [14:42:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2185.mgmt.codfw.wmnet with reboot policy FORCED [14:42:16] RECOVERY - ps1-a8-codfw-infeed-load-tower-B-phase-Z on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-B-phase-Z 200 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:42:41] (03PS3) 10Elukey: sre.k8s.upgrade-cluster: simplify code and extend downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/889962 (https://phabricator.wikimedia.org/T327767) [14:42:48] (03CR) 10Cathal Mooney: [C: 03+1] "Thanks for the explainers +1 from me." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [14:44:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [14:45:58] !log elukey@cumin1001 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster ml-staging-codfw: T327767 [14:46:02] T327767: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 [14:46:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host urldownloader2004.wikimedia.org [14:46:49] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:49:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10Jclark-ctr) [14:50:31] (03PS1) 10Elukey: sre.k8s.wipe-cluster: add extra ask_confirmation for etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889997 [14:51:23] (03PS1) 10Vivian Rook: Update dns for paws prometheus [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) [14:51:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:51:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:52:16] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:52:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:52:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:52:33] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:52:40] (03PS2) 10Vivian Rook: Update dns for paws prometheus [puppet] - 10https://gerrit.wikimedia.org/r/889998 (https://phabricator.wikimedia.org/T329212) [14:53:58] (KubernetesCalicoDown) firing: (2) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:54:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:54:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:54:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:55:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:55:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [14:56:05] (03CR) 10Stevemunene: [C: 03+1] "Looks good to me, but someone else must approve." [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [14:56:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [14:56:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:57:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:00] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2186.mgmt.codfw.wmnet with reboot policy FORCED [14:58:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2187.mgmt.codfw.wmnet with reboot policy FORCED [14:58:58] (KubernetesCalicoDown) resolved: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:59:31] (03PS2) 10Jbond: module_rake_tasks: load puppet to avoid uninitialized constant Puppet [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [15:00:41] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [15:01:43] (03CR) 10Nicolas Fraison: [C: 03+2] module_rake_tasks: load puppet to avoid uninitialized constant Puppet [puppet] - 10https://gerrit.wikimedia.org/r/889990 (owner: 10Nicolas Fraison) [15:02:25] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2004.wikimedia.org - jmm@cumin2002" [15:03:15] (03CR) 10Herron: [C: 04-1] "Please see comments inline and also will need a rebase." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [15:03:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM urldownloader2004.wikimedia.org - jmm@cumin2002" [15:03:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:28] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache urldownloader2004.wikimedia.org on all recursors [15:03:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) urldownloader2004.wikimedia.org on all recursors [15:04:30] (03CR) 10Eevans: [C: 03+1] "The unfamiliar (to me) parts of this notwithstanding (e.g. envoy config), LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:04:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2185.mgmt.codfw.wmnet with reboot policy FORCED [15:05:46] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) [15:07:00] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:07:25] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2185'] [15:09:07] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) [15:10:51] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) (owner: 10Muehlenhoff) [15:12:54] (03PS2) 10Muehlenhoff: sre.ganeti.makevm: Stop printing the dhcp config snippet [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) [15:13:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host urldownloader2004.wikimedia.org [15:13:26] (03CR) 10Jbond: [C: 03+1] Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [15:16:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2187.mgmt.codfw.wmnet with reboot policy FORCED [15:16:29] (03PS5) 10Vgutierrez: varnish: Limit ESI processing to text/html pages [puppet] - 10https://gerrit.wikimedia.org/r/889530 (https://phabricator.wikimedia.org/T308799) [15:16:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2187'] [15:23:20] 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) [15:23:58] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10matmarex) >>! In T238285#8549627, @Pigsonthewing wrote: > T261624... [15:24:53] (03PS1) 10Muehlenhoff: Add SPDX headers to additional DE profiles [puppet] - 10https://gerrit.wikimedia.org/r/890000 (https://phabricator.wikimedia.org/T308013) [15:24:55] (03PS1) 10Muehlenhoff: openstack::base Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) [15:25:21] (03CR) 10CI reject: [V: 04-1] openstack::base Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:29:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2187'] [15:32:45] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) Case Number 2023-0217-642078 Case Type Tech Priority P2 - High Status Dispatch [15:35:25] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:35:47] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10MoritzMuehlenhoff) [15:36:01] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [15:36:23] (03PS1) 10Muehlenhoff: Add new VMs [puppet] - 10https://gerrit.wikimedia.org/r/890003 (https://phabricator.wikimedia.org/T329945) [15:37:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2185'] [15:40:15] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2187'] [15:40:45] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2187'] [15:41:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2187'] [15:42:12] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10*.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [15:42:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2187'] [15:45:28] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:47:16] (03CR) 10Muehlenhoff: [C: 03+2] Add new VMs [puppet] - 10https://gerrit.wikimedia.org/r/890003 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [15:48:02] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) [15:50:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster ml-staging-codfw: T327767 [15:50:07] T327767: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 [15:52:45] (03PS2) 10Elukey: Add istio and kserve settings for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) [15:52:47] (03PS2) 10Elukey: ml-services: update docker images for outlink and revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/889773 (https://phabricator.wikimedia.org/T328576) [15:53:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2187.mgmt.codfw.wmnet with reboot policy FORCED [15:55:00] (03CR) 10JMeybohm: [C: 04-1] Add istio and kserve settings for ml-staging-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:55:34] (03CR) 10JMeybohm: [C: 04-1] Add istio and kserve settings for ml-staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:00:33] (03CR) 10Elukey: Add istio and kserve settings for ml-staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:00:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2187.mgmt.codfw.wmnet with reboot policy FORCED [16:01:32] (03PS3) 10Elukey: Add istio and kserve settings for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) [16:01:34] (03PS3) 10Elukey: ml-services: update docker images for outlink and revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/889773 (https://phabricator.wikimedia.org/T328576) [16:01:47] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2187'] [16:02:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2187'] [16:09:39] (03CR) 10BCornwall: [C: 03+2] varnish: Runbook and dashboard for down exporter [alerts] - 10https://gerrit.wikimedia.org/r/889887 (https://phabricator.wikimedia.org/T187708) (owner: 10BCornwall) [16:11:05] 10SRE, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10BCornwall) 05In progress→03Resolved Fixed in https://gerrit.wikimedia.org/r/c/operations/alerts/+/889887 [16:20:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2187'] [16:20:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2187'] [16:25:54] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10BCornwall) [16:26:06] 10SRE, 10Traffic, 10serviceops: Upgrade envoyproxy to 1.16.2 - https://phabricator.wikimedia.org/T271407 (10BCornwall) 05Open→03Resolved a:03BCornwall Great, thanks! [16:26:16] (03PS1) 10Papaul: ADd db218[567] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/890004 (https://phabricator.wikimedia.org/T326342) [16:27:05] (03PS6) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [16:27:59] (03CR) 10Papaul: [C: 03+2] ADd db218[567] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/890004 (https://phabricator.wikimedia.org/T326342) (owner: 10Papaul) [16:28:08] (03PS2) 10Papaul: ADd db218[567] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/890004 (https://phabricator.wikimedia.org/T326342) [16:28:12] (03CR) 10Papaul: [V: 03+2] ADd db218[567] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/890004 (https://phabricator.wikimedia.org/T326342) (owner: 10Papaul) [16:28:21] (03CR) 10BCornwall: [C: 03+2] trafficserver: Remove restart count icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/889881 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:31:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2185.codfw.wmnet with OS bullseye [16:32:21] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye [16:32:49] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889999 (https://phabricator.wikimedia.org/T306661) (owner: 10Muehlenhoff) [16:37:25] (03CR) 10BCornwall: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39723/console" [puppet] - 10https://gerrit.wikimedia.org/r/889881 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:39:02] 10SRE, 10Observability-Alerting, 10Traffic: Move (or delete?) trafficserver restart count alert from icinga to alerts.git - https://phabricator.wikimedia.org/T327791 (10BCornwall) 05Open→03Resolved Fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/889881 [16:40:54] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2185.codfw.wmnet with OS bullseye [16:41:00] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye executed with errors: - db2185 (**FAIL... [16:42:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2185.codfw.wmnet with OS bullseye [16:42:06] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye [16:43:31] win 3 [16:53:31] (03PS2) 10Krinkle: Added extended confirmed on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [16:53:42] (03CR) 10Krinkle: "recheck - giving CI permission to run the tests" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [16:57:24] (03CR) 10Dzahn: [C: 03+1] "I confirmed with cumin that most hosts already use 1024 with a few exceptions that have higher values. nothing has lower values." [puppet] - 10https://gerrit.wikimedia.org/r/889976 (owner: 10Muehlenhoff) [16:57:33] (03PS7) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [16:57:36] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review: Remove obsolete "Permissions-Policy: interest-cohort" header - https://phabricator.wikimedia.org/T312823 (10BCornwall) 05Open→03In progress [16:59:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [16:59:30] (03CR) 10Dzahn: [C: 03+1] Add 'rup' as alias for 'roa-rup' [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [16:59:50] (03CR) 10JHathaway: Purge unused kernels on boot (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [17:00:06] (03CR) 10Dzahn: [C: 03+1] Add 'nrf' as alias for 'nrm' [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [17:00:34] (03CR) 10Dzahn: [C: 03+1] Add 'cbk' as alias for 'cbk-zam' [puppet] - 10https://gerrit.wikimedia.org/r/527912 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [17:01:09] (03CR) 10Dzahn: [C: 03+1] Add 'bho' as alias for 'bh' [puppet] - 10https://gerrit.wikimedia.org/r/528782 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix) [17:01:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2185.codfw.wmnet with reason: host reimage [17:01:52] (03CR) 10Dzahn: [C: 03+1] Add 'egl' as alias for 'eml' [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix) [17:02:18] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:03:02] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [17:04:20] (03CR) 10Dzahn: [C: 03+1] Add redirects from 'sgs' to 'bat-smg' [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [17:04:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2185.codfw.wmnet with reason: host reimage [17:06:13] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957 [17:06:16] T329957: Restart Elastic services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [17:06:31] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957 [17:06:58] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957 [17:10:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2187.codfw.wmnet with OS bullseye [17:10:23] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2187.codfw.wmnet with OS bullseye [17:10:38] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T329957 [17:11:23] 10SRE, 10Wikimedia-Mailing-lists: Puppet failing on mailman03.mailman.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T329647 (10Dzahn) I think the admins of that project should already be getting emails about failed puppet on instances in their project. names at: https://openstack-browser.toolfor... [17:19:16] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:23:18] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10Papaul) Dear Juniper Networks Customer, Your replacement part associated with RMA R200442866 Item # 100 has been successfully shipped. [17:23:52] (03CR) 10Krinkle: [C: 04-1] "The change to the autoconfirm count is not explained or mentioned in the Dutch discussion afaik. It also appears that the discussed was ba" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [17:25:54] (03PS1) 10JMeybohm: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/890008 (https://phabricator.wikimedia.org/T329827) [17:27:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:27:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2185.codfw.wmnet with OS bullseye [17:27:39] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2185.codfw.wmnet with OS bullseye completed: - db2185 (**PASS**) - Rem... [17:28:12] (03CR) 10JMeybohm: [C: 03+1] sre.k8s.upgrade-cluster: simplify code and extend downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/889962 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [17:30:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [17:33:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [17:34:30] (03PS4) 10JHathaway: Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 [17:34:32] (03PS3) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 [17:35:15] (03CR) 10CDanis: [C: 03+1] Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [17:38:05] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) @Volans fyi the 3 db nodes above are R650xs just receives those. We worked already on 1 R650 in the pass. On the 650xs provision cookbook is not setting the se... [17:39:10] (03CR) 10Bas dehaan: Added extended confirmed on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [17:40:25] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (owner: 10JHathaway) [17:42:32] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) [17:43:01] (03CR) 10BryanDavis: [C: 03+1] "dynamicproxy changes look fine" [puppet] - 10https://gerrit.wikimedia.org/r/889892 (https://phabricator.wikimedia.org/T312823) (owner: 10BCornwall) [17:48:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:49:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:49:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2187.codfw.wmnet with OS bullseye [17:49:45] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2187.codfw.wmnet with OS bullseye completed: - db2187 (**PASS**) - Rem... [18:01:45] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) [18:05:32] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) p:05Triage→03High [18:06:38] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Bad power supply on cr1-codfw(PEM 0) - https://phabricator.wikimedia.org/T329943 (10Papaul) ***** RMA DETAILS ***** RMA Number: R200447858 Defective Line Item Number: 100 Defective Serial Number: Defective Product ID: MX480-PWR2520-AC-S Def... [18:18:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) @MatthewVernon Are there any racks that need to avoid. Do to the weight of these servers and space availability i would be easyier to rack 2 row E, x2 row F. I... [18:19:24] (03PS1) 10Hnowlan: api-gateway: add rest gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/890012 (https://phabricator.wikimedia.org/T329049) [18:21:06] (03CR) 10Hnowlan: "Change is a noop for the api-gateway production config except for some cleanup of spacing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890012 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [18:23:15] 10SRE, 10Language-Team: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10santhosh) [18:33:17] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10MW-1.40-notes (1.40.0-wmf.24; 2023-02-20): Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) 05In progress→03Resolved Thanks to rzl for deploying this. The domain... [18:34:12] (03PS1) 10Dzahn: site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014 [18:34:35] (03CR) 10CI reject: [V: 04-1] site: differentiate between both serviceops teams for insetup roles [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [18:35:51] (03CR) 10Dzahn: [V: 04-1] "doesnt like class names containing a dash, but I wanted to avoid "serviceops::collab" and adding another level to this." [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [18:41:26] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10BCornwall) 05In progress→03Resolved a:03BCornwall `varnish` replaces `libvarnishapi1` so it's been obsoleted in a packaging sense. Since we've re-imaged all the cp servers `libvarnishapi1` has been removed. [18:46:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10*.eqiad.wmnet: Restarting Cassandra to apply JVM 1.8.0_362 - eevans@cumin1001 [18:49:37] 10SRE, 10DNS, 10Traffic, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10BCornwall) p:05Medium→03Low [18:51:51] 10SRE, 10Language-Team: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [18:55:08] 10SRE, 10Language-Team: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [18:57:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:00:56] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: OpenSent - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:02:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1016.eqiad.wmnet [19:10:04] 10SRE, 10DNS, 10Traffic, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10BCornwall) a:03BCornwall Looks like @RKemper has also contributed since this was opened. I'm going to assume this is the best place to collect agreement to the license chang... [19:10:10] 10SRE, 10DNS, 10Traffic, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10BCornwall) 05Open→03In progress [19:14:55] 10SRE, 10Traffic, 10Data Pipelines (Sprint 08): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10JArguello-WMF) 05Open→03Resolved [19:18:08] (03PS1) 10BCornwall: utils: Add SPDX Apache-2.0 license to utils [dns] - 10https://gerrit.wikimedia.org/r/890016 (https://phabricator.wikimedia.org/T291323) [19:22:56] PROBLEM - Host restbase1016 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:12] RECOVERY - Host restbase1016 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:27:49] 10SRE, 10Traffic, 10serviceops: Feedback for new service IP flowchart - https://phabricator.wikimedia.org/T279296 (10BCornwall) 05Open→03Resolved a:03BCornwall Seeing as that flowchart is in use at https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines and there's not been any activity for a... [19:28:18] PROBLEM - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.32 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:29:30] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10BCornwall) 05Open→03Resolved a:03BCornwall Confirmed via cumin that all hosts are running 9.1.4. Closing. [19:30:16] RECOVERY - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.32 port 9042 https://phabricator.wikimedia.org/T93886 [19:30:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1016.eqiad.wmnet [19:30:54] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10BCornwall) [19:32:49] restbase1016 is OK; not sure why those alerts came through, they should have been disabled by the reboot cookbook [19:32:52] 10SRE, 10Traffic, 10Patch-For-Review: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10BCornwall) Looks like this still isn't rolled out based on my check on a random cp node. Still intend to roll this out? [21:03:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [21:07:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [21:10:01] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10colewhite) [21:54:20] 10SRE, 10Infrastructure-Foundations, 10Traffic: Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot - https://phabricator.wikimedia.org/T323717 (10BCornwall) 05Open→03Resolved a:03BCornwall Looks like this has been fixed, so I'll close. [22:05:32] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957 [22:05:36] T329957: Restart Elastic services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [22:06:22] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957 [22:09:36] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957 [22:10:07] Assuming ^ is related but did just hit a search error for it being "busy" or something along those lines. Resolved on a reload. [22:42:05] perryprog thanks for the heads-up , LMK if you still see any issues [22:42:29] Yup. Been good so far. [22:45:04] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T329957 [22:45:08] T329957: Restart Elastic services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [22:45:58] Cool! The cookbook completed so we should be good [22:55:30] 10SRE, 10Diff-blog, 10Technical Blog, 10Traffic, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) @BCornwall the max-age has been increased to 106384710. Could you confirm all looks good? [22:57:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add jaeger-{builder,query,collector} [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)