[00:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009395 [00:38:48] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009395 (owner: 10TrainBranchBot) [00:47:04] PROBLEM - Disk space on restbase1026 is CRITICAL: DISK CRITICAL - free space: /srv/sda4 67618 MB (3% inode=99%): /srv/sdc4 106521 MB (6% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1026&var-datasource=eqiad+prometheus/ops [01:01:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1009395 (owner: 10TrainBranchBot) [01:21:15] (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 883.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:36:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: codfw mw-parsoid (k8s) 961ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:57:27] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:26] PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8330063280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [02:19:26] org/wiki/Search%23Administration [02:38:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:26] RECOVERY - OpenSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 737, active_shards: 1530, relocating_shards: 4, initializing_shards: 2, unassigned_shards: 196, delayed_unassigned_s [02:44:26] , number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 680, active_shards_percent_as_number: 88.54166666666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:53:15] (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 866.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:08:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: codfw mw-parsoid (k8s) 812ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:34:00] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 06serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#9614060 (10tstarling) p:05Highโ†’03Medium [03:34:34] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 06serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#9614053 (10tstarling) 05Stalledโ†’03Open Recalling that my sole objection to having Shellbox download files from Swift is the need to provide secrets to Shellbox which a... [03:47:26] PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd232a95280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [03:47:26] org/wiki/Search%23Administration [03:51:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:28] RECOVERY - OpenSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 736, active_shards: 1489, relocating_shards: 2, initializing_shards: 2, unassigned_shards: 237, delayed_unassigned_shar [04:13:28] umber_of_pending_tasks: 3, number_of_in_flight_fetch: 468, task_max_waiting_in_queue_millis: 1088, active_shards_percent_as_number: 86.16898148148148 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:19:26] PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc0eb8f6280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [04:19:26] org/wiki/Search%23Administration [04:25:26] RECOVERY - OpenSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 737, active_shards: 1490, relocating_shards: 3, initializing_shards: 1, unassigned_shards: 236, delayed_unassigned_s [04:25:26] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.27678054429647 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:35:26] PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f94eb7eb280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [04:35:26] org/wiki/Search%23Administration [04:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:43:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:06] 06SRE, 10SRE-swift-storage, 06Traffic-Icebox, 07Wikimedia-Performance-recommendation, 07affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217#9614183 (10tstarling) >>! In T256217#9194798, @MatthewVernon wrote: > I can confirm that we're running a ne... [05:07:04] RECOVERY - Disk space on restbase1026 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1026&var-datasource=eqiad+prometheus/ops [05:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:57:28] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:09:15] (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 923.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:19:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: codfw mw-parsoid (k8s) 954.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:22:15] (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 826.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:27:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: codfw mw-parsoid (k8s) 802.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:27:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2151', diff saved to https://phabricator.wikimedia.org/P58663 and previous config saved to /var/cache/conftool/dbconfig/20240308-062741-root.json [06:28:37] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2151.codfw.wmnet onto db2124.codfw.wmnet [06:30:15] (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 910.3ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:31:36] (03PS1) 10Marostegui: db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1009638 (https://phabricator.wikimedia.org/T359597) [06:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:33:21] (03CR) 10Marostegui: [C: 03+2] db2124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1009638 (https://phabricator.wikimedia.org/T359597) (owner: 10Marostegui) [06:35:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: codfw mw-parsoid (k8s) 899.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240308T0700) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:06:32] (03PS1) 10Marostegui: Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1009343 [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:16:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2151.codfw.wmnet onto db2124.codfw.wmnet [07:18:51] (03CR) 10Marostegui: [C: 03+2] Revert "db2124: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1009343 (owner: 10Marostegui) [07:19:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P58664 and previous config saved to /var/cache/conftool/dbconfig/20240308-071913-root.json [07:19:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P58665 and previous config saved to /var/cache/conftool/dbconfig/20240308-071924-root.json [07:34:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P58666 and previous config saved to /var/cache/conftool/dbconfig/20240308-073419-root.json [07:34:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P58667 and previous config saved to /var/cache/conftool/dbconfig/20240308-073429-root.json [07:49:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P58668 and previous config saved to /var/cache/conftool/dbconfig/20240308-074924-root.json [07:49:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P58669 and previous config saved to /var/cache/conftool/dbconfig/20240308-074934-root.json [07:50:14] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9614295 (10tstarling) The tempauth expiry time is 7 days. MW considers the token to be expired after 7.5 minutes of caching, but Swift just gives it... [07:52:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240308T0800) [08:03:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1614/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009558 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [08:04:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P58670 and previous config saved to /var/cache/conftool/dbconfig/20240308-080429-root.json [08:04:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P58671 and previous config saved to /var/cache/conftool/dbconfig/20240308-080439-root.json [08:04:52] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Remove Tomcat spec tests [puppet] - 10https://gerrit.wikimedia.org/r/1009558 (https://phabricator.wikimedia.org/T359333) (owner: 10Muehlenhoff) [08:19:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P58672 and previous config saved to /var/cache/conftool/dbconfig/20240308-081934-root.json [08:19:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P58673 and previous config saved to /var/cache/conftool/dbconfig/20240308-081944-root.json [08:21:29] (03PS1) 10Slyngshede: P:idp Use Tomcat9 build for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) [08:21:55] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1216.eqiad.wmnet with reason: Silence for upgrade [08:27:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1216.eqiad.wmnet with reason: Silence for upgrade [08:29:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1615/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:31:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1616/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:32:31] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1617/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [08:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P58674 and previous config saved to /var/cache/conftool/dbconfig/20240308-083439-root.json [08:34:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P58675 and previous config saved to /var/cache/conftool/dbconfig/20240308-083449-root.json [08:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 25%: Temporary repool for the weekend', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240308-084026-arnaudb.json [08:41:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 25%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58676 and previous config saved to /var/cache/conftool/dbconfig/20240308-084105-arnaudb.json [08:41:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 25%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58677 and previous config saved to /var/cache/conftool/dbconfig/20240308-084149-arnaudb.json [08:43:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:55] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P58678 and previous config saved to /var/cache/conftool/dbconfig/20240308-084944-root.json [08:49:56] (03PS1) 10KartikMistry: Update cxserver to 2024-03-08-084626-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009712 (https://phabricator.wikimedia.org/T359525) [08:50:29] Doing emergency deployment (staging only) for cxserver.. [08:51:21] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2024-03-08-084626-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009712 (https://phabricator.wikimedia.org/T359525) (owner: 10KartikMistry) [08:52:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:52:30] (03Merged) 10jenkins-bot: Update cxserver to 2024-03-08-084626-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009712 (https://phabricator.wikimedia.org/T359525) (owner: 10KartikMistry) [08:53:18] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:53:43] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:55:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 50%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58679 and previous config saved to /var/cache/conftool/dbconfig/20240308-085536-arnaudb.json [08:56:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 50%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58680 and previous config saved to /var/cache/conftool/dbconfig/20240308-085610-arnaudb.json [08:56:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 50%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58681 and previous config saved to /var/cache/conftool/dbconfig/20240308-085654-arnaudb.json [09:05:33] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1618/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:07:44] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:08:10] OK. Things are stable in staging, going ahead. [09:08:38] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:09:10] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:09:47] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idp Use Tomcat9 build for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009709 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:09:49] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:10:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58682 and previous config saved to /var/cache/conftool/dbconfig/20240308-091041-arnaudb.json [09:10:47] !log Updated cxserver to 2024-03-08-084626-production (T359525) [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:51] T359525: MinT: Translation with MinT/Apertium are failing: fetch failed - https://phabricator.wikimedia.org/T359525 [09:11:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58683 and previous config saved to /var/cache/conftool/dbconfig/20240308-091115-arnaudb.json [09:12:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 75%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58684 and previous config saved to /var/cache/conftool/dbconfig/20240308-091159-arnaudb.json [09:13:03] (03CR) 10DCausse: [C: 03+1] cirrus: Check backfill status prior to reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008895 (owner: 10Ebernhardson) [09:16:29] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: nova: convert cloudvirt2001-dev to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1009511 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [09:17:43] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt2001-dev.codfw.wmnet with OS bookworm [09:25:02] (03PS1) 10Majavah: hieradata: update striker to 2024-03-08-085857-production [puppet] - 10https://gerrit.wikimedia.org/r/1009713 [09:25:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58685 and previous config saved to /var/cache/conftool/dbconfig/20240308-092546-arnaudb.json [09:26:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58686 and previous config saved to /var/cache/conftool/dbconfig/20240308-092621-arnaudb.json [09:26:58] (03CR) 10Majavah: [C: 03+2] hieradata: update striker to 2024-03-08-085857-production [puppet] - 10https://gerrit.wikimedia.org/r/1009713 (owner: 10Majavah) [09:27:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2108 (re)pooling @ 100%: Temporary repool for the weekend', diff saved to https://phabricator.wikimedia.org/P58687 and previous config saved to /var/cache/conftool/dbconfig/20240308-092705-arnaudb.json [09:27:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:38:26] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [09:38:47] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@9bf7445] (releasing): (no justification provided) [09:39:27] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@9bf7445] (releasing): (no justification provided) (duration: 00m 40s) [09:40:58] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2001-dev.codfw.wmnet with reason: host reimage [09:42:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:49:19] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [09:49:26] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [10:08:27] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2001-dev.codfw.wmnet with OS bookworm [10:17:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:29] (03PS1) 10Phuedx: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) [10:27:13] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:29:21] !log jnuche@deploy2002 Installing scap version "4.70.1" for 374 hosts [10:30:02] !log fabfur@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp4037.ulsfo.wmnet with reason: T358109 [10:30:08] !log jnuche@deploy2002 Installation of scap version "4.70.1" completed for 374 hosts [10:30:09] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [10:30:17] !log fabfur@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp4037.ulsfo.wmnet with reason: T358109 [10:30:55] (03PS1) 10Ilias Sarantopoulos: ml-services: update ores-legacy image (fix boolean/str fields) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009720 (https://phabricator.wikimedia.org/T358953) [10:43:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:15] (03PS2) 10Phuedx: ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) [10:52:20] RECOVERY - Check nf_conntrack usage in neutron netns on cloudnet2007-dev is OK: OK: everything is apparently fine https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:01:22] (03CR) 10JMeybohm: "Yes." [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [11:03:56] (03CR) 10JMeybohm: "And "profile::dragonfly::dfdaemon::ensure: present" ofc... ๐Ÿ˜Š" [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [11:16:05] (03CR) 10MVernon: "If you're happy with this now, could I get a +1 please? I'm not going to merge this today (it being a Friday and just before the SRE Summi" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1009494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [11:19:39] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:21:44] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet2008-dev is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:29:39] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:31:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:38:16] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1009298 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [11:38:54] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [11:39:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:41:21] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update ores-legacy image (fix boolean/str fields) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009720 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [11:42:00] (03PS1) 10Fabfur: benthos: fixe metadata field [puppet] - 10https://gerrit.wikimedia.org/r/1009722 (https://phabricator.wikimedia.org/T358109) [11:44:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:48:27] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update ores-legacy image (fix boolean/str fields) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009720 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [11:49:35] (03Merged) 10jenkins-bot: ml-services: update ores-legacy image (fix boolean/str fields) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009720 (https://phabricator.wikimedia.org/T358953) (owner: 10Ilias Sarantopoulos) [11:54:08] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1619/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009722 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:01:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [12:31:26] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:36:26] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:24] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - No response from remote host 185.15.58.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:44:16] (03PS1) 10Fabfur: benthos/haproxy: fix parsing for possible missing headers [puppet] - 10https://gerrit.wikimedia.org/r/1009724 (https://phabricator.wikimedia.org/T358109) [13:44:52] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) [13:45:09] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) [13:46:13] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009724 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:46:41] (03PS2) 10Fabfur: benthos/haproxy: fix parsing for possible missing headers [puppet] - 10https://gerrit.wikimedia.org/r/1009724 (https://phabricator.wikimedia.org/T358109) [13:47:57] (03CR) 10Majavah: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [13:48:13] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) [13:49:05] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1009724 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:50:02] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9614966 (10cmooney) @ayounsi finally got back to this for a closer look. Really great work, I tried to make a device-centric dashboard... [13:50:34] 06SRE, 10Release Pipeline, 06serviceops, 10Release-Engineering-Team (Radar), and 2 others: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604#9614971 (10akosiaris) [13:51:06] 06SRE, 06serviceops, 07Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762#9614973 (10akosiaris) @JMeybohm is there anything left to dohere? I think we can resolve. [13:53:59] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9615011 (10jnuche) >>! In T358237#9611992, @Dzahn wrote: > zuul has now succesfully been deployed to this machine by @jn... [13:54:11] (03CR) 10Cathal Mooney: "LGTM! Really nice stuff, great examples of how to use nftables well!" [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [13:55:07] (03PS4) 10Arturo Borrero Gonzalez: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) [13:55:40] (03PS5) 10Arturo Borrero Gonzalez: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) [13:55:56] (03CR) 10Arturo Borrero Gonzalez: aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [13:56:32] 06SRE, 10ops-codfw: install (2) 1.92TB SSDs from decom into prometheus200[56] - https://phabricator.wikimedia.org/T359631 (10RobH) 03NEW p:05Triageโ†’03Medium [13:57:28] 06SRE, 10ops-eqiad, 10procurement: install (2) 1.92TB SSDs from decom into prometheus100[56] - https://phabricator.wikimedia.org/T359632 (10RobH) 03NEW p:05Triageโ†’03Medium [13:57:36] 06SRE, 10ops-eqiad, 10procurement: install (2) 1.92TB SSDs from decom into prometheus100[56] - https://phabricator.wikimedia.org/T359632#9615088 (10RobH) [13:58:29] (03CR) 10Majavah: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [13:58:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add new bookworm thirdparty/kubeadm-k8s-1-24 component [puppet] - 10https://gerrit.wikimedia.org/r/1009725 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [13:59:09] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s-1-24: don't use UDebComponents field [puppet] - 10https://gerrit.wikimedia.org/r/1009754 (https://phabricator.wikimedia.org/T359619) [13:59:29] 06SRE, 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9615106 (10dcaro) Notes from today's (more productive) meeting: About the number in specific, they got from the har... [13:59:53] (03CR) 10Majavah: [C: 03+1] aptrepo: thirdparty/kubeadm-k8s-1-24: don't use UDebComponents field [puppet] - 10https://gerrit.wikimedia.org/r/1009754 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [14:00:01] (03CR) 10CI reject: [V: 04-1] aptrepo: thirdparty/kubeadm-k8s-1-24: don't use UDebComponents field [puppet] - 10https://gerrit.wikimedia.org/r/1009754 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [14:00:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:17] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: thirdparty/kubeadm-k8s-1-24: don't use UDebComponents field [puppet] - 10https://gerrit.wikimedia.org/r/1009754 (https://phabricator.wikimedia.org/T359619) [14:01:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:25] !log update deb packages on bookworm thirdparty/kubeadm-k8s-1-24 for T359619 (apt1002) [14:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] T359619: toolforge: prepare deb packages for k8s 1.24 - https://phabricator.wikimedia.org/T359619 [14:01:42] (03PS2) 10Elukey: role::ml_k8s::staging::worker: add Dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) [14:02:22] (03PS1) 10Slyngshede: Inform users that their email address needs to be unique. [software/bitu] - 10https://gerrit.wikimedia.org/r/1009757 [14:03:08] (03PS1) 10Elukey: Add Docker secret for Dragonfly cache to ML K8s staging [labs/private] - 10https://gerrit.wikimedia.org/r/1009758 (https://phabricator.wikimedia.org/T359416) [14:03:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/kubeadm-k8s-1-24: don't use UDebComponents field [puppet] - 10https://gerrit.wikimedia.org/r/1009754 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [14:03:36] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add Docker secret for Dragonfly cache to ML K8s staging [labs/private] - 10https://gerrit.wikimedia.org/r/1009758 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:04:33] (03CR) 10Elukey: "Thanks! I am not 100% sure about the regex but I added the info that you suggested :)" [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:04:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1623/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:06:59] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: thirdpary/kubeadm-k8s.io-1-24: fix key and filter [puppet] - 10https://gerrit.wikimedia.org/r/1009760 (https://phabricator.wikimedia.org/T359619) [14:07:09] (03CR) 10Majavah: [C: 03+1] aptrepo: thirdpary/kubeadm-k8s.io-1-24: fix key and filter [puppet] - 10https://gerrit.wikimedia.org/r/1009760 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [14:07:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdpary/kubeadm-k8s.io-1-24: fix key and filter [puppet] - 10https://gerrit.wikimedia.org/r/1009760 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [14:08:20] (03PS6) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) [14:08:28] (03PS6) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560) [14:08:44] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9615146 (10cmooney) @fgiunchedi wondering if you'd any thoughts on the above suggestion to allow more series through from the gnmic pipe... [14:09:24] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [14:11:42] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1041.eqiad.wmnet with reason: Bootstrapping โ€” T354560 [14:11:48] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [14:11:56] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1041.eqiad.wmnet with reason: Bootstrapping โ€” T354560 [14:14:59] (03PS1) 10Eevans: New restbase hosts [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1009761 (https://phabricator.wikimedia.org/T354560) [14:16:01] (03CR) 10Eevans: [V: 03+2 C: 03+2] New restbase hosts [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1009761 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [14:16:50] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@c200e79]: Deploying to updated target list โ€” T354560 [14:16:55] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [14:17:26] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@c200e79]: Deploying to updated target list โ€” T354560 (duration: 00m 36s) [14:20:03] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@910b77d]: Deploying to updated target list โ€” T354560 [14:20:38] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@910b77d]: Deploying to updated target list โ€” T354560 (duration: 00m 35s) [14:21:17] 06SRE, 06serviceops, 07Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762#9615231 (10JMeybohm) > * Requests for the catalog are not cached > ** `curl -I -XGET 'https://docker-registry.wikimedia.org/v2/_catalog` catalog is now cached. > * Requests for tag... [14:22:03] 06SRE, 06serviceops, 07Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762#9615232 (10JMeybohm) p:05Mediumโ†’03Low [14:26:29] (03CR) 10JMeybohm: [C: 03+1] role::ml_k8s::staging::worker: add Dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:27:28] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:58] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::staging::worker: add Dragonfly [puppet] - 10https://gerrit.wikimedia.org/r/1009548 (https://phabricator.wikimedia.org/T359416) (owner: 10Elukey) [14:33:20] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [14:37:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:42] (03CR) 10Alexandros Kosiaris: "Do we need to remove that chart too? The commit message implies this might be used in the future, not that it is retired." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [14:41:21] (03CR) 10Jforrester: [C: 04-1] "Re-creating the chart in six months' time or a year is probably less work than have SRE keep it updated without any deployment to test it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [14:44:12] (03PS15) 10Arnaudb: mysqld-exporter-config: simplify manual runs [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) [14:44:14] (03CR) 10Arnaudb: "I was going in the wrong direction ! I think this is a bit closer to your original idea ๐Ÿ˜Š" [puppet] - 10https://gerrit.wikimedia.org/r/984232 (https://phabricator.wikimedia.org/T327384) (owner: 10Arnaudb) [14:44:39] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:47:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:39] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:07:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 just to signal SRE have some prep work to do before this is merged, but otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [15:09:13] 06SRE, 10ops-eqiad, 10procurement: install (2) 1.92TB SSDs from decom into prometheus100[56] - https://phabricator.wikimedia.org/T359632#9615326 (10Jclark-ctr) @RobH i have plenty of 1.92tb ssd i have pulled 10x ssd and will put a few away as spares if needed later [15:16:40] 06SRE, 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9615336 (10Jclark-ctr) @dcaro thanks for the notes much more productive meeting. although nothing popped out for... [15:17:58] (03CR) 10Jforrester: [C: 04-1] "Yeah, DNS entries etc.? I captured some things in https://phabricator.wikimedia.org/T345274 but I probably missed aspects." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009538 (https://phabricator.wikimedia.org/T345274) (owner: 10Jforrester) [15:19:16] (03PS1) 10Fabfur: hiera: temporary disable haproxy logging to benthos for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1009769 (https://phabricator.wikimedia.org/T358109) [15:20:43] (03PS2) 10Fabfur: hiera: temporary disable haproxy logging to benthos for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1009769 (https://phabricator.wikimedia.org/T358109) [15:24:59] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1009769 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:26:43] (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: temporary disable haproxy logging to benthos for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1009769 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:27:12] (03PS10) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [15:27:14] (03PS10) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [15:27:19] (03PS1) 10Andrew Bogott: puppetserver: link ca dir to /srv if ssldir_on_srv [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) [15:28:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [15:28:46] !log fabfur@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp4037.ulsfo.wmnet [15:28:47] !log fabfur@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp4037.ulsfo.wmnet [15:32:09] !log repooling cp4037 for this weekend, all log-format changes are reverted (T351117) [15:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:18] T351117: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 [15:32:32] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [15:37:02] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: link ca dir to /srv if ssldir_on_srv [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [15:37:13] (03PS2) 10Andrew Bogott: puppetserver: link ca dir to /srv if ssldir_on_srv [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) [15:42:08] (03PS11) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [15:42:10] (03PS3) 10Andrew Bogott: puppetserver: link ca dir to /srv if ssldir_on_srv [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) [15:44:27] (03PS55) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [15:53:10] (03PS4) 10MdsShakil: Add `suppressredirect` right to pagemover and filemover user groups in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009729 (https://phabricator.wikimedia.org/T359614) [15:57:07] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [15:59:44] (03PS4) 10Andrew Bogott: puppetserver: link ca dir to /srv if ssldir_on_srv [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) [15:59:52] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1009770 (https://phabricator.wikimedia.org/T276327) (owner: 10Andrew Bogott) [16:08:29] (03CR) 10Filippo Giunchedi: "From my POV it can be either a warning or discarded and re-instated when/if the need arise, HTH" [alerts] - 10https://gerrit.wikimedia.org/r/1008590 (owner: 10Tim Starling) [16:10:03] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops Remove new LDAP hosts from Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1009501 (https://phabricator.wikimedia.org/T359524) (owner: 10Slyngshede) [16:14:37] (03PS1) 10Dzahn: apache_exporter: fix argument syntax in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) [16:14:57] (03PS2) 10Dzahn: prometheus/apache_exporter: fix argument syntax in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) [16:15:40] (03CR) 10CI reject: [V: 04-1] prometheus/apache_exporter: fix argument syntax in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [16:16:20] (03CR) 10Filippo Giunchedi: "LGTM though I'm curious as to why we don't see the same behavior in other usages of ensure_packages() across the code base (or do we?)" [puppet] - 10https://gerrit.wikimedia.org/r/1009716 (owner: 10Majavah) [16:17:15] (03PS3) 10Dzahn: prometheus/apache_exporter: fix argument syntax in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) [16:17:39] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:18:42] (03CR) 10Dzahn: [C: 04-1] "wait, no, also needs the = to be inserted.. hmm.." [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [16:18:44] (03CR) 10Majavah: "We do and I've been fixing these as I spot them during reimages, see eg https://gerrit.wikimedia.org/r/c/operations/puppet/+/971160 from l" [puppet] - 10https://gerrit.wikimedia.org/r/1009716 (owner: 10Majavah) [16:21:08] (03CR) 10Dzahn: "nevermind, it works either way:" [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) (owner: 10Dzahn) [16:23:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ok! thank you for clarifying" [puppet] - 10https://gerrit.wikimedia.org/r/1009716 (owner: 10Majavah) [16:24:33] 06SRE, 10SRE-Access-Requests: Requesting ssh & kerberos access to analytics-privatedata-users (with ssh & kerberos) for bdgreenlee - https://phabricator.wikimedia.org/T359645 (10bdgreenlee) 03NEW [16:26:50] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9615636 (10fgiunchedi) Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more... [16:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:25] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:55] 06SRE, 10SRE-Access-Requests: Requesting ssh & kerberos access to analytics-privatedata-users (with ssh & kerberos) for bdgreenlee - https://phabricator.wikimedia.org/T359645#9615668 (10odimitrijevic) Approved! [16:52:55] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:59] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:01:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:25:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:28:37] 06SRE, 10ops-eqiad, 10procurement: install (2) 1.92TB SSDs from decom into prometheus100[56] - https://phabricator.wikimedia.org/T359632#9615815 (10RobH) Awesome, post-offsite please coordinate with @fgiunchedi on when to install these. As the systems are hot swap, it shouldn't cause an issue, but I'd clea... [17:37:39] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:43:55] 06SRE, 10ops-codfw: install (2) 1.92TB SSDs from decom into prometheus200[56] - https://phabricator.wikimedia.org/T359631#9615863 (10Jhancock.wm) I have at least sixteen of these drives. 10 of them still have carriers attached. I can install these at any time. [17:54:37] !Log Running `foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep 1 --verbose 2>&1 | tee ~/scan-files-in-scan-table-group2-sleep-1-no-render-now.txt` on a tmux session [17:55:00] (03PS1) 10Ahmon Dancy: mw-xml.sh: Update maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1009784 (https://phabricator.wikimedia.org/T99268) [17:55:08] !log Running `foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/scanFilesInScanTable.php --use-jobqueue --sleep 1 --verbose 2>&1 | tee ~/scan-files-in-scan-table-group2-sleep-1-no-render-now.txt` on a tmux session [17:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:31] (03CR) 10Majavah: [C: 03+2] prometheus: ethtool_exporter: use package{} directly [puppet] - 10https://gerrit.wikimedia.org/r/1009716 (owner: 10Majavah) [17:58:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [18:01:49] PROBLEM - Host durum1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:59] PROBLEM - Host eventlog1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:07] PROBLEM - Host an-airflow1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:07] PROBLEM - Host ml-serve-ctrl1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:15] PROBLEM - Host cumin1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:33] PROBLEM - Host people1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:03] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:03:03] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:03:05] PROBLEM - ganeti-noded running on ganeti1033 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [18:03:24] (ProbeDown) firing: Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:07] (ProbeDown) firing: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:14] uhm [18:04:25] !log Stopped scan on group 2 wiki (test complete) [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:53] did we just lose a switch or something [18:06:14] the BFD status alerts are for durum1001, which does anycast, so that's why [18:06:16] no, I think just ganeti1033 but I'm looking [18:06:47] hmm I can ssh there [18:06:50] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:07:13] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:08:12] I see DRBD hangs in the console [18:08:24] (ProbeDown) firing: (2) Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:54] Mar 08 18:01:10 ganeti1033 kernel: block drbd2: We did not send a P_BARRIER for 42520ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked? [18:08:54] Mar 08 18:01:11 ganeti1033 kernel: drbd resource1: meta connection shut down by peer. [18:09:01] Mar 08 18:04:25 ganeti1033 kernel: INFO: task md2_raid5:562 blocked for more than 120 seconds. [18:09:33] cdanis: my best guess would be a disk issue on ganeti1033 [18:11:55] (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:37] I'm going to try to migrate VMs away from ganeti1033 [18:12:40] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:13:20] !log โœ” cdanis@ganeti1027.eqiad.wmnet ~ ๐Ÿ•โ˜• sudo gnt-node migrate -f ganeti1033.eqiad.wmnet [18:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:42] seems reasonable [18:14:02] if that gets stuck, I'll probably just forcibly reboot ganeti1033 [18:16:54] heh [18:17:00] `drbdsetup status` hangs too [18:17:59] !log forcibly rebooting ganeti1033 [18:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:08] !log โŒcdanis@ganeti1027.eqiad.wmnet ~ ๐Ÿ•œโ˜• sudo gnt-node failover -f ganeti1033.eqiad.wmnet [18:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:09] PROBLEM - SSH on ganeti1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:22:00] (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:22:40] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:23:23] PROBLEM - Host ganeti1033 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:45] RECOVERY - Host ganeti1033 is UP: PING WARNING - Packet loss = 80%, RTA = 0.20 ms [18:34:05] RECOVERY - SSH on ganeti1033 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:34:16] I had to login to the mgmt interface and `serveraction hardreset` to get it to come back up [18:34:17] RECOVERY - ganeti-noded running on ganeti1033 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [18:34:18] nothing in the SEL [18:34:34] mdstat is showing md2 as read-only now [18:35:29] RECOVERY - Host people1004 is UP: PING OK - Packet loss = 0%, RTA = 5.68 ms [18:35:33] RECOVERY - Host an-airflow1004 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [18:35:41] RECOVERY - Host cumin1002 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [18:35:41] RECOVERY - Host eventlog1003 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [18:35:43] RECOVERY - Host ml-serve-ctrl1002 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [18:36:19] RECOVERY - Host durum1001 is UP: PING OK - Packet loss = 0%, RTA = 7.14 ms [18:37:13] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:38:11] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:38:13] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:38:24] (ProbeDown) resolved: (2) Service ml-serve-ctrl1002:6443 has failed probes (http_ml_serve_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:38:28] (KeyholderUnarmed) firing: 2 unarmed Keyholder key(s) on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [18:38:33] !log โœ” cdanis@ganeti1027.eqiad.wmnet ~ ๐Ÿ•œโ˜• sudo gnt-node migrate -f ganeti1033.eqiad.wmnet [18:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:07] taavi: I think it became r/w? [18:39:07] (ProbeDown) resolved: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:23] looks like yes [18:39:32] initially, it said 'md2 : active (auto-read-only) raid5 sdb3[2] sdd3[3] sdc3[1] sda3[0]' [18:40:50] ah [18:41:03] I think that just means the device was assembled but hadn't gotten its first write yet [18:41:50] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:43:03] aha, might be [18:43:11] are you still draining the host? [18:43:31] that finished, it just moved the 'primary' away from that host for all the VMs [18:43:54] the DRBD secondaries for all those nodes are still on ganeti1033 [18:44:05] I think that's a fine state for the weekend [18:44:58] this could be either a once-in-a million kernel bug, or a hardware issue with that host (or both I guess) [18:48:57] taavi: herron: think we should do anything else? [18:50:01] (03PS2) 10Superpes15: [itwiki]ย Set 'wgBlockAllowsUTEdit' to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 [18:50:54] (03CR) 10Zoranzoki21: [itwiki]ย Set 'wgBlockAllowsUTEdit' to true (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 (owner: 10Superpes15) [18:51:19] !log arm cumin_master keyholder key on cumin1002 after ganeti1033 froze and rebooted [18:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:41] cdanis: it might be a good idea to send a heads-up to people who might have had long-running processes on cumin1002 [18:51:49] yeah good call [18:51:55] I'll send an email with some details [18:52:11] thanks for the help taavi :) [18:52:57] the homer key on cumin1002 also needs arming, but only netops has access to that key in pwstore it seems [18:53:19] nothing to add here, although also not 100% clear on the cause [18:55:23] (03PS3) 10Superpes15: [itwiki]ย Set 'wgBlockAllowsUTEdit' to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 [18:55:46] (03CR) 10Superpes15: [itwiki]ย Set 'wgBlockAllowsUTEdit' to true (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 (owner: 10Superpes15) [18:58:31] heh ok now I think I'm clear https://phabricator.wikimedia.org/P58694 [19:04:15] heh [19:04:44] (03CR) 10Zoranzoki21: [C: 03+1] [itwiki]ย Set 'wgBlockAllowsUTEdit' to true (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 (owner: 10Superpes15) [19:10:38] (03CR) 10Urbanecm: [C: 03+1] [itwiki]ย Set 'wgBlockAllowsUTEdit' to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009731 (owner: 10Superpes15) [19:12:46] (03PS1) 10FNegri: [wmcs-backup] WIP: Add dummy test [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) [19:16:48] (03CR) 10CI reject: [V: 04-1] [wmcs-backup] WIP: Add dummy test [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) (owner: 10FNegri) [19:19:52] (03PS2) 10FNegri: [wmcs-backup] WIP: Add dummy test [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) [19:21:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:53] (03CR) 10CI reject: [V: 04-1] [wmcs-backup] WIP: Add dummy test [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) (owner: 10FNegri) [19:37:02] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9616215 (10cmooney) >>! In T326322#9615636, @fgiunchedi wrote: > Yeah having some ballpark numbers will be a great help @cmooney, unless... [19:40:51] RECOVERY - Host ripe-atlas-ulsfo is UP: PING WARNING - Packet loss = 77%, RTA = 30.48 ms [19:42:20] (03PS1) 10Cathal Mooney: Fix error when removing an interface's bridge membership [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009789 (https://phabricator.wikimedia.org/T359629) [19:44:34] (03PS2) 10Cathal Mooney: Fix error when removing an interface's bridge membership [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009789 (https://phabricator.wikimedia.org/T359629) [19:47:13] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [19:47:13] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [19:47:15] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [19:47:15] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [19:47:19] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Ncredir [19:47:54] taavi, cdanis: homer's key re-armed [19:48:13] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 346006 seconds left:Certificate wikipedia.com valid until 2024-04-05 02:10:51 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:48:13] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 392506 seconds left:Certificate wikimedia.is valid until 2024-04-11 10:06:15 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:48:26] * volans heading back off [19:48:39] thank you volans [19:49:15] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 329505 seconds left:Certificate wikipedia.fi valid until 2024-05-03 08:30:14 +0000 (expires in 55 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:49:15] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 214424 seconds left:Certificate *.wikispecies.net valid until 2024-05-25 08:20:38 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:49:19] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 391600 seconds left:Certificate *.wikipedia.bg valid until 2024-04-13 06:06:54 +0000 (expires in 35 days) https://wikitech.wikimedia.org/wiki/Ncredir [19:51:47] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:52:27] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [19:52:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:53:28] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [19:54:27] (03PS1) 10Jdlrobson: Exclude non-functional pages from night mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) [19:56:23] (03PS3) 10FNegri: [wmcs-backup] WIP: Add dummy test [puppet] - 10https://gerrit.wikimedia.org/r/1009787 (https://phabricator.wikimedia.org/T359192) [19:57:02] (03PS2) 10Jdlrobson: Exclude non-functional pages from night mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) [20:01:24] (03PS4) 10Dzahn: prometheus/apache_exporter: fix argument syntax in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1009775 (https://phabricator.wikimedia.org/T359556) [20:08:01] (03PS1) 10Dzahn: Revert "planet: add prometheus apache exporter to role" [puppet] - 10https://gerrit.wikimedia.org/r/1009732 [20:09:46] (03CR) 10Dzahn: [C: 03+2] Revert "planet: add prometheus apache exporter to role" [puppet] - 10https://gerrit.wikimedia.org/r/1009732 (owner: 10Dzahn) [20:28:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:28:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:36:29] (03PS11) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [20:36:31] (03PS12) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [20:36:33] (03PS1) 10Andrew Bogott: git-sync-upstream: on puppet7, deploy code after update [puppet] - 10https://gerrit.wikimedia.org/r/1009798 (https://phabricator.wikimedia.org/T351450) [20:36:39] (03PS1) 10Andrew Bogott: git-sync-upstream.py: run through black [puppet] - 10https://gerrit.wikimedia.org/r/1009799 [20:37:10] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:38:38] (03CR) 10Kimberly Sarabia: [C: 03+1] "LGTM but hoping someone else from data eng can also review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009718 (https://phabricator.wikimedia.org/T352342) (owner: 10Phuedx) [20:46:47] !log planet1003/2003: apt-get remove prometheus-apache-exporter - T359596 [20:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:51] T359596: SystemdUnitFailed (planet and gitlab) - https://phabricator.wikimedia.org/T359596 [20:53:10] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:57:15] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:57:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:01:11] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:01:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:02:59] (03CR) 10Krinkle: mw-xml.sh: Update maintenance script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009784 (https://phabricator.wikimedia.org/T99268) (owner: 10Ahmon Dancy) [21:03:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:03:36] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:05:33] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:05:40] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:08:11] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:08:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:16:07] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:16:14] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:21:16] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:21:23] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:23:19] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:23:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:25:22] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:25:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:28:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:28:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:29:39] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009789 (https://phabricator.wikimedia.org/T359629) (owner: 10Cathal Mooney) [21:30:29] (03PS2) 10Ahmon Dancy: mw-xml.sh: Update maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/1009784 (https://phabricator.wikimedia.org/T99268) [21:31:12] (03CR) 10Ahmon Dancy: mw-xml.sh: Update maintenance script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009784 (https://phabricator.wikimedia.org/T99268) (owner: 10Ahmon Dancy) [21:31:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:31:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:10:25] (03PS2) 10Majavah: P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 [22:14:01] (03PS3) 10Majavah: P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 [22:14:03] (03PS1) 10Majavah: P:puppetserver: git: mark /srv/git as safe [puppet] - 10https://gerrit.wikimedia.org/r/1009805 [22:15:20] (03CR) 10CI reject: [V: 04-1] P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [22:15:23] (03CR) 10CI reject: [V: 04-1] P:puppetserver: git: mark /srv/git as safe [puppet] - 10https://gerrit.wikimedia.org/r/1009805 (owner: 10Majavah) [22:16:53] (03PS2) 10Majavah: P:puppetserver: git: mark /srv/git as safe [puppet] - 10https://gerrit.wikimedia.org/r/1009805 [22:16:55] (03PS4) 10Majavah: P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 [22:18:49] (03CR) 10Majavah: P:puppetserver: git: use creates for initial deploy-code (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [22:21:27] (03CR) 10CI reject: [V: 04-1] P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [22:21:34] (03PS5) 10Majavah: P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 [22:22:01] (03CR) 10Majavah: P:puppetserver: git: use creates for initial deploy-code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [22:22:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [22:37:28] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:52:55] (SystemdUnitFailed) resolved: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:56:25] (SystemdUnitFailed) firing: rsync-aptrepo-apt2001.wikimedia.org.service on apt1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:41] 06SRE, 10SRE-Access-Requests: Requesting ssh & kerberos access to analytics-privatedata-users (with ssh & kerberos) for bdgreenlee - https://phabricator.wikimedia.org/T359645#9616731 (10BTullis) a:03BTullis [23:17:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:17:35] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:26:42] (03PS1) 10Btullis: Add SSH and kerberos access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009810 (https://phabricator.wikimedia.org/T359417) [23:28:51] (03CR) 10Btullis: [C: 04-1] "Awaiting double-check of SSH key via additional channel. Setting -1 until then." [puppet] - 10https://gerrit.wikimedia.org/r/1009810 (https://phabricator.wikimedia.org/T359417) (owner: 10Btullis) [23:32:30] 06SRE, 10SRE-Access-Requests: Requesting ssh & kerberos access to analytics-privatedata-users (with ssh & kerberos) for bdgreenlee - https://phabricator.wikimedia.org/T359645#9616793 (10BTullis) I have created the patch to add an SSH key and record the kerberos status. Set the patch to -1 until I ave verified... [23:33:09] (03PS2) 10Btullis: Add SSH and kerberos access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009810 (https://phabricator.wikimedia.org/T359417) [23:33:46] (03CR) 10Btullis: Add SSH and kerberos access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009810 (https://phabricator.wikimedia.org/T359417) (owner: 10Btullis) [23:34:37] (03CR) 10Btullis: [C: 04-1] "Verified via: https://office.wikimedia.org/wiki/User:BGreenlee-WMF" [puppet] - 10https://gerrit.wikimedia.org/r/1009810 (https://phabricator.wikimedia.org/T359417) (owner: 10Btullis) [23:35:05] (03CR) 10Btullis: [C: 03+2] Add SSH and kerberos access for bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009810 (https://phabricator.wikimedia.org/T359417) (owner: 10Btullis) [23:39:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting ssh & kerberos access to analytics-privatedata-users (with ssh & kerberos) for bdgreenlee - https://phabricator.wikimedia.org/T359645#9616799 (10BTullis) 05Openโ†’03Resolved [23:43:45] (03PS3) 10Jdlrobson: Exclude non-functional pages from night mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) [23:43:56] (03PS4) 10Jdlrobson: Exclude non-functional pages from night mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009790 (https://phabricator.wikimedia.org/T359183) [23:44:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting ssh & kerberos access to analytics-privatedata-users (with ssh & kerberos) for bdgreenlee - https://phabricator.wikimedia.org/T359645#9616817 (10BTullis) I accidentally used the wrong email address for the kerberos principal creation. Deleted it a...