[00:05:40] (03PS1) 10Dzahn: phabricator::migration: add scap::target [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) [00:06:03] (03CR) 10CI reject: [V:04-1] phabricator::migration: add scap::target [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:09:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135842 [00:09:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135842 (owner: 10TrainBranchBot) [00:10:11] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10732069 (10Quiddity) Thanks for the drafts, both! I will add this to Tech News tomorrow, **pending your confirmation** on the wording-tweaks I've made,... [00:11:03] (03PS2) 10Dzahn: phabricator::migration: add scap::target [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) [00:20:38] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:27:10] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2060 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:27:10] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2060 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:27:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2060-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:28:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:30:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:35:45] (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1135841/5260/phab1005.eqiad.wmnet/change.phab1005.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:48:53] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:57:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2060-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10732154 (10phaultfinder) [01:10:31] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135842 (owner: 10TrainBranchBot) [01:11:16] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:16:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3333 MB (3% inode=98%): /tmp 3333 MB (3% inode=98%): /var/tmp 3333 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [01:30:02] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:40:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [01:40:20] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [01:40:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [01:42:39] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2072-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:43:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [02:02:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [02:07:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2072-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:22:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [02:28:53] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:21] 06SRE, 10Wikimedia-Mailing-lists: Postorius (held and) reported full headers get mangled somewhere in the system - https://phabricator.wikimedia.org/T309492#10732233 (10Aklapper) @grin: Could you please answer the last comment? Thanks in advance! [03:45:02] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:07] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [03:47:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 218, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:49:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:58:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:04:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:16:18] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:20:28] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:35:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:52] PROBLEM - Check unit status of sync-puppet-volatile on puppetmaster2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:40:30] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:44:39] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391654 (10phaultfinder) 03NEW [04:46:52] RECOVERY - Check unit status of sync-puppet-volatile on puppetmaster2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:50:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:53] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:22:26] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:30:02] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:35:47] (03PS1) 10Bartosz Dziewoński: Enable SUL3 on most remaining beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 [05:35:48] (03PS1) 10Bartosz Dziewoński: Clean up obsolete SUL3 settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 [05:36:44] (03CR) 10Bartosz Dziewoński: "I was surprised to find it wasn't already enabled. I think we just forgot. Is there any reason not to do it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński) [05:40:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [05:43:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [05:57:42] (03PS1) 10Ayounsi: Add eqiad RIPE Atlas Anchor VM to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) [05:58:53] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:13] (03PS2) 10Ayounsi: Add eqiad RIPE Atlas Anchor VM to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T0600) [06:00:10] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [06:07:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [06:27:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [06:58:38] (03PS1) 10Marostegui: check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 [06:59:05] (03CR) 10CI reject: [V:04-1] check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T0700) [07:02:43] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui) [07:17:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:19:03] (03PS1) 10Marostegui: db1256: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1135856 [07:19:58] (03CR) 10Marostegui: [C:03+2] db1256: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1135856 (owner: 10Marostegui) [07:27:44] PROBLEM - MegaRAID on an-worker1135 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:30:46] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: increase the reserved resources for the airflow-test-k8s scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135724 (https://phabricator.wikimedia.org/T391556) (owner: 10Brouberol) [07:31:21] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: increase the reserved resources for the airflow-test-k8s scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135724 (https://phabricator.wikimedia.org/T391556) (owner: 10Brouberol) [07:35:42] (03PS4) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) [07:43:32] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:27] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:44:48] (03PS5) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) [07:44:48] (03CR) 10DCausse: [C:03+1] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [07:45:02] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [07:45:06] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:46:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:50:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:50:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:51:26] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5261/console" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:52:03] (03PS1) 10Brouberol: airflow-test-k8s: increase the limit ranges to support deploying a bigger scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135891 (https://phabricator.wikimedia.org/T391556) [07:52:14] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:32] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:08] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:00:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:04:35] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [08:10:58] (03CR) 10MVernon: [C:03+1] "None of the tested hosts have object storage credentials in the PCC output, which I think is expected at this point? Since they don't curr" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:13:44] (03PS2) 10Federico Ceratto: pool.py: In dry-run mode do not monitor connection drain [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) [08:16:49] (03CR) 10Federico Ceratto: "Applied the change and tested it in dry-run." [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto) [08:18:28] (03CR) 10Ayounsi: [C:03+2] Add eqiad RIPE Atlas Anchor VM to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [08:20:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:40:22] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: increase the limit ranges to support deploying a bigger scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135891 (https://phabricator.wikimedia.org/T391556) (owner: 10Brouberol) [08:43:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10732602 (10ayounsi) Please hold on. Netops just discovered it and we're not sure D6 the best choice network-wise as it furthers the row capacity imbalance. [08:44:17] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1169.eqiad.wmnet with OS bullseye [08:44:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-worker1169... [08:44:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1169.eqiad.wmnet with OS bullseye [08:44:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-worker... [08:47:17] (03CR) 10Jelto: [V:03+1 C:03+2] "yes all gitlab hosts have object storage disabled. I'll enable it for one of the replicas soon." [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:47:38] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: rename thanos object storage parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:48:53] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:50:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:34] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:54] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10732617 (10ayounsi) Please hold on. Netops just discovered it and we're not sure D6 the best choice network-wise as it furthers the row capacity imbalance. [08:56:49] (03PS2) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) [08:57:08] (03PS1) 10Btullis: Prep an-worker1169 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1135899 (https://phabricator.wikimedia.org/T390169) [08:57:50] (03PS1) 10Jelto: gitlab: fix Unknown variable: 'object_storage_access_key' [puppet] - 10https://gerrit.wikimedia.org/r/1135900 (https://phabricator.wikimedia.org/T378922) [08:59:49] (03CR) 10Btullis: [C:03+2] Prep an-worker1169 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1135899 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [09:00:20] (03CR) 10Jelto: [C:03+2] gitlab: fix Unknown variable: 'object_storage_access_key' [puppet] - 10https://gerrit.wikimedia.org/r/1135900 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [09:00:59] (03CR) 10Filippo Giunchedi: [C:03+1] "That's fair yeah, ok let's go with the _filter in metric name as you suggested" [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [09:02:42] (03PS3) 10Filippo Giunchedi: perf/real_user_monitoring: add rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [09:02:51] (03CR) 10Filippo Giunchedi: [C:03+1] "See latest PS" [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [09:03:53] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:05:28] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1169.eqiad.wmnet with OS bullseye [09:05:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002... [09:05:48] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1169.eqiad.wmnet with OS bullseye [09:05:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1... [09:14:09] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [09:15:32] (03PS2) 10Clément Goubert: mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) [09:15:49] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [09:16:22] (03CR) 10Hnowlan: [C:03+1] mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [09:20:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:20:37] (03PS3) 10Btullis: Temporarily exclude mediawikiwiki from the dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo) [09:20:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:20:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo) [09:20:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:20:49] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1169.eqiad.wmnet with reason: host reimage [09:21:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:22:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10732673 (10Ladsgroup) >>! In T355914#10732069, @Quiddity wrote: > Thanks for the drafts, both! I will add this to Tech News tomorrow, **pending your con... [09:22:25] (03CR) 10Clément Goubert: [C:03+2] mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert) [09:24:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1169.eqiad.wmnet with reason: host reimage [09:25:04] (03CR) 10Tiziano Fogli: [C:03+2] perf/real_user_monitoring: add rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli) [09:25:24] (03CR) 10Btullis: [C:03+2] Temporarily exclude mediawikiwiki from the dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo) [09:27:51] a [09:27:58] sometimes b too [09:28:07] it do b like this [09:28:21] fr [09:28:45] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10732694 (10MatthewVernon) [09:32:06] (03CR) 10Clément Goubert: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [09:32:14] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10732719 (10MatthewVernon) [09:33:25] (03CR) 10Clément Goubert: Add namespace for zarcillo (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:35:02] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:37:28] (03PS7) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [09:37:52] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [09:38:53] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:38:54] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:38:56] (03PS8) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 [09:40:02] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:12] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5263/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [09:40:26] (03CR) 10Elukey: "Thanks for the review Keith!" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [09:40:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:43:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:46:57] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1169.eqiad.wmnet with OS bullseye [09:47:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-worker1169... [09:47:29] (03PS1) 10Slyngshede: IDP-Test: 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135909 [09:48:06] (03PS1) 10Elukey: services: update proton's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135910 [09:48:40] (03PS3) 10Fabfur: haproxy: start staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 [09:48:53] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:49:28] (03PS4) 10Fabfur: haproxy: start staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 [09:50:10] (03CR) 10Gergő Tisza: [C:03+1] "Yeah, we just forgot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński) [09:51:00] (03CR) 10Slyngshede: [C:03+2] IDP-Test: 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135909 (owner: 10Slyngshede) [09:51:02] !log slyngshede@dns1004 START - running authdns-update [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (owner: 10Fabfur) [09:53:23] !log slyngshede@dns1004 END - running authdns-update [09:53:30] PROBLEM - Disk space on idp-test2005 is CRITICAL: DISK CRITICAL - free space: / 534MiB (1% inode=95%): /tmp 534MiB (1% inode=95%): /var/tmp 534MiB (1% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=idp-test2005&var-datasource=codfw+prometheus/ops [09:53:35] !log slyngshede@dns1004 START - running authdns-update [09:56:01] !log slyngshede@dns1004 END - running authdns-update [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:19] (03CR) 10Vgutierrez: haproxy: start staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (owner: 10Fabfur) [09:59:53] (03PS1) 10Clément Goubert: scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) [10:00:02] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:02:03] (03CR) 10CI reject: [V:04-1] scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [10:03:53] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:05:10] (03PS1) 10Slyngshede: IDP: CAS 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135913 [10:06:44] (03CR) 10Ayounsi: [C:03+1] IDP: CAS 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135913 (owner: 10Slyngshede) [10:06:59] (03CR) 10Slyngshede: [C:03+2] IDP: CAS 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135913 (owner: 10Slyngshede) [10:07:18] !log slyngshede@dns1004 START - running authdns-update [10:07:25] !log slyngshede@dns1004 START - running authdns-update [10:09:51] !log slyngshede@dns1004 END - running authdns-update [10:10:19] (03PS1) 10Btullis: Revert "Temporarily exclude mediawikiwiki from the dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1135915 [10:12:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [10:14:10] (03CR) 10Btullis: [C:03+2] Revert "Temporarily exclude mediawikiwiki from the dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1135915 (owner: 10Btullis) [10:15:13] (03PS1) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) [10:16:44] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:17:21] (03CR) 10CI reject: [V:04-1] mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:18:22] (03PS2) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) [10:19:53] (03PS1) 10Filippo Giunchedi: logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 [10:20:30] (03PS1) 10Btullis: Temporarily add mediawikiwiki to the skip list for dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135918 (https://phabricator.wikimedia.org/T390839) [10:21:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5264/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135918 (https://phabricator.wikimedia.org/T390839) (owner: 10Btullis) [10:22:16] (03CR) 10CI reject: [V:04-1] logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi) [10:22:37] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1169.eqiad.wmnet [10:23:54] (03PS1) 10Jelto: gitlab: enable object storage on one of the replicas: [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) [10:25:42] (03PS5) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) [10:27:26] (03PS2) 10Clément Goubert: scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) [10:27:47] (03CR) 10Btullis: [V:03+1 C:03+2] Temporarily add mediawikiwiki to the skip list for dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135918 (https://phabricator.wikimedia.org/T390839) (owner: 10Btullis) [10:27:54] (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [10:27:59] (03PS1) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135920 (https://phabricator.wikimedia.org/T391670) [10:28:51] (03CR) 10Tiziano Fogli: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi) [10:28:53] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:29:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1169.eqiad.wmnet [10:30:02] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:30:14] (03PS2) 10Jelto: gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) [10:30:26] (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135920 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [10:31:25] (03PS1) 10Btullis: Revert "Temporarily put an-worker1169 back into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1135921 [10:31:57] (03PS1) 10Clément Goubert: php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) [10:32:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [10:32:41] (03Abandoned) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135920 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [10:33:24] (03CR) 10Btullis: [C:03+2] Revert "Temporarily put an-worker1169 back into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1135921 (owner: 10Btullis) [10:34:00] (03CR) 10Hnowlan: [C:03+1] scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [10:37:08] (03CR) 10Jgiannelos: [C:03+1] services: update proton's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135910 (owner: 10Elukey) [10:37:31] (03PS3) 10Jelto: gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) [10:39:13] (03PS1) 10Btullis: Put an-worker1169 back into service [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169) [10:39:31] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:40:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:45] (03PS4) 10Jelto: gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) [10:42:53] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:43:27] (03PS2) 10Btullis: Put an-worker1169 back into service and exclude group 3 [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169) [10:45:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:59] (03PS3) 10Btullis: Put an-worker1169 back into service and exclude group 3 [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169) [10:46:48] (03CR) 10Btullis: [C:03+2] Put an-worker1169 back into service and exclude group 3 [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [10:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:51:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732936 (10BTullis) [10:51:51] (03CR) 10Clément Goubert: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:54:09] (03PS3) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) [10:54:23] (03CR) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:54:46] (03CR) 10MVernon: [C:03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T0700) [11:00:05] jelto, arnoldokoth, and mutante: gettimeofday() says it's time for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T1100) [11:01:40] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [11:01:59] (03PS6) 10Giuseppe Lavagetto: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [11:07:13] (03PS7) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) [11:07:20] (03PS2) 10Clément Goubert: mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) [11:08:36] (03CR) 10Fabfur: haproxy: staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [11:09:24] (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [11:12:09] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:12:23] PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:12:33] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:13:09] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:13:33] (03CR) 10Vgutierrez: haproxy: staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [11:16:09] PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:24:23] RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:25:09] (03PS1) 10Btullis: Prep for new druid hosts to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1135933 (https://phabricator.wikimedia.org/T387132) [11:27:23] PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:29:17] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1169.eqiad.wmnet [11:29:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733007 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD re... [11:31:33] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:34:33] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:35:07] (03PS8) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) [11:36:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1169.eqiad.wmnet [11:37:18] (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [11:37:19] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Btullis Prep for hard drive replacement T390170 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:37:20] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Btullis Prep for hard drive replacement T390170 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:37:20] ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Btullis Prep for hard drive replacement T390170 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:38:35] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1135933 (https://phabricator.wikimedia.org/T387132) (owner: 10Btullis) [11:39:12] (03PS9) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) [11:39:17] (03CR) 10Btullis: [C:03+2] Prep for new druid hosts to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1135933 (https://phabricator.wikimedia.org/T387132) (owner: 10Btullis) [11:40:14] (03PS5) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 [11:40:43] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733032 (10BTullis) [11:41:45] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [11:42:00] (03PS2) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [11:42:09] RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:43:04] (03CR) 10CI reject: [V:04-1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [11:43:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733038 (10BTullis) [11:45:02] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:02] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [11:45:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733039 (10BTullis) 05Open→03Resolved >>! In T390169#10724690, @ayounsi wrote: > To properly move the server yo... [11:45:20] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733044 (10BTullis) [11:45:32] (03PS1) 10Clément Goubert: mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) [11:48:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10733046 (10BTullis) [11:50:35] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733050 (10BTullis) [11:51:14] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733052 (10BTullis) Moving to the milestone, as we have a new column for tracking tasks like this. [11:51:21] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733053 (10BTullis) a:05BTullis→03None [11:52:59] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733057 (10BTullis) [11:55:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10733065 (10Stevemunene) a:03Stevemunene [11:55:22] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10733064 (10Silvan_WMDE) I believe this must have been an infrastructure issue which hasn't occured any mor... [11:56:14] (03CR) 10Hnowlan: [C:03+1] mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert) [11:57:08] (03PS6) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 [12:00:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: refresh all tests in the new IPv6-enabled networks [puppet] - 10https://gerrit.wikimedia.org/r/1135942 (https://phabricator.wikimedia.org/T391325) [12:02:00] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: refresh all tests in the new IPv6-enabled networks [puppet] - 10https://gerrit.wikimedia.org/r/1135942 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [12:05:07] (03PS2) 10Filippo Giunchedi: logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 [12:05:13] (03PS4) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) [12:05:39] (03CR) 10Federico Ceratto: Add namespace for zarcillo (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [12:08:47] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: networktests: fix typo in envvars [puppet] - 10https://gerrit.wikimedia.org/r/1135944 (https://phabricator.wikimedia.org/T391325) [12:10:52] !log bounce thanos-query thanos-query-frontend thanos-store on titan1* [12:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:11] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: networktests: fix typo in envvars [puppet] - 10https://gerrit.wikimedia.org/r/1135944 (https://phabricator.wikimedia.org/T391325) [12:17:20] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: networktests: fix typo in envvars [puppet] - 10https://gerrit.wikimedia.org/r/1135944 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [12:20:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:33] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:20:46] (03PS4) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [12:20:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:21:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:13] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 14.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:27:33] (03CR) 10CI reject: [V:04-1] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:29:30] (03PS5) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [12:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 14.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:33:30] (03CR) 10Arnaudb: [C:03+1] gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:34:25] (03PS6) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [12:34:54] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: fix typos in networktests [puppet] - 10https://gerrit.wikimedia.org/r/1135947 (https://phabricator.wikimedia.org/T391325) [12:36:00] (03PS7) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [12:36:21] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:37:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr1-codfw:9804&var-bgp_group=Switch&var-bgp_neighbor=ssw2-a8-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:37:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:38:10] (03PS3) 10Filippo Giunchedi: logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 [12:38:10] (03PS1) 10Filippo Giunchedi: grafana: set max_source_resolution=auto for thanos ds [puppet] - 10https://gerrit.wikimedia.org/r/1135948 (https://phabricator.wikimedia.org/T390215) [12:38:20] (03PS8) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [12:38:24] (03CR) 10Fabfur: haproxy: staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [12:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10733153 (10phaultfinder) [12:40:59] (03CR) 10Federico Ceratto: "Apologies, I misread the code. I added the downtime now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:41:43] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM, tests/knative_activator.yaml will serve as a valid test." [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi) [12:42:45] (03PS2) 10Filippo Giunchedi: grafana: set max_source_resolution=auto for thanos ds [puppet] - 10https://gerrit.wikimedia.org/r/1135948 (https://phabricator.wikimedia.org/T371102) [12:43:16] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: fix typos in networktests [puppet] - 10https://gerrit.wikimedia.org/r/1135947 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [12:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:48:19] (03CR) 10Vgutierrez: [C:03+1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [12:50:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:05] (03Abandoned) 10Ssingh: package_builder: add packages for nginx build [puppet] - 10https://gerrit.wikimedia.org/r/1135731 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [12:58:07] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: networktests: refresh floating VM IP address [puppet] - 10https://gerrit.wikimedia.org/r/1135949 (https://phabricator.wikimedia.org/T391325) [12:59:22] (03PS1) 10Hashar: Upgrade and pin flake8 [software] - 10https://gerrit.wikimedia.org/r/1135950 [13:00:20] (03CR) 10Fabfur: haproxy: staticize haproxy acls into template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [13:03:47] (03PS2) 10Hashar: tox: upgrade and pin flake8 [software] - 10https://gerrit.wikimedia.org/r/1135950 [13:03:47] (03PS1) 10Hashar: tox: use flake8's extend-exclude [software] - 10https://gerrit.wikimedia.org/r/1135951 [13:04:46] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733207 (10BTullis) a:05BTullis→03Jclark-ctr >>! In T387142#10727875, @Jclark-ctr wrote: > @btullis handing over to you for updating puppet repo. also to verify that 10... [13:05:28] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733210 (10BTullis) a:05BTullis→03Jclark-ctr Done. Thanks @Jclark-ctr . [13:06:32] (03CR) 10Marostegui: [C:03+2] "Per our IRC chat" [software] - 10https://gerrit.wikimedia.org/r/1135951 (owner: 10Hashar) [13:06:44] (03CR) 10Marostegui: [C:03+2] "Per our IRC chat" [software] - 10https://gerrit.wikimedia.org/r/1135950 (owner: 10Hashar) [13:07:29] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: networktests: refresh floating VM IP address [puppet] - 10https://gerrit.wikimedia.org/r/1135949 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [13:07:45] (03Merged) 10jenkins-bot: tox: upgrade and pin flake8 [software] - 10https://gerrit.wikimedia.org/r/1135950 (owner: 10Hashar) [13:07:50] (03Merged) 10jenkins-bot: tox: use flake8's extend-exclude [software] - 10https://gerrit.wikimedia.org/r/1135951 (owner: 10Hashar) [13:08:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10733238 (10Gehel) Configuration tracked in T391680 [13:08:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10733240 (10Gehel) [13:08:46] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui) [13:08:55] 06SRE, 06SRE Observability, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794#10733242 (10Gehel) [13:09:41] (03CR) 10Marostegui: [C:03+2] check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui) [13:10:12] (03Merged) 10jenkins-bot: check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui) [13:12:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733254 (10Gehel) [13:13:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10733274 (10Gehel) [13:13:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10733278 (10Gehel) [13:14:10] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733297 (10Gehel) [13:16:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:16:21] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10733355 (10Gehel) [13:16:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10733359 (10Gehel) [13:16:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10733361 (10Gehel) [13:16:39] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10733365 (10Gehel) [13:16:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10733363 (10Gehel) [13:16:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10733367 (10Gehel) [13:16:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10733369 (10Gehel) [13:17:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10733373 (10Gehel) [13:17:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10733371 (10Gehel) [13:17:20] 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10733376 (10Gehel) [13:17:26] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10733381 (10Gehel) [13:17:36] 07sre-alert-triage, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10733391 (10Gehel) [13:17:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:17:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:18:02] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.03.22 - 2025.04.11): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733395 (10Gehel) [13:22:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:25:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Change weight for db1180 T390510', diff saved to https://phabricator.wikimedia.org/P74901 and previous config saved to /var/cache/conftool/dbconfig/20250411-132518-marostegui.json [13:25:22] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [13:27:21] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.04.11 - 2025.05.02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733525 (10Gehel) [13:29:22] (03PS6) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [13:33:46] !log reprepro -C component/nginx-ech include bookworm-wikimedia openssl_3.4.1-1+ech2_amd64.changes: T205378 [13:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:50] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [13:40:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:43:26] (03CR) 10Hashar: "recheck after having enabled the debian-glue job: https://gerrit.wikimedia.org/r/c/integration/config/+/1135728" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:43:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:44:35] (03CR) 10CI reject: [V:04-1] Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:46:43] (03CR) 10Hashar: "From the build console:" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:47:00] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [13:47:40] (03CR) 10Ssingh: "Yes, thanks, I am still figuring this out and did a gitlab build which worked so will take it from there. I may abandon this as well but w" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:47:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:49:31] RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 4.276 second response time https://wikitech.wikimedia.org/wiki/RESTBase [13:49:41] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/RESTBase [13:53:09] (03PS1) 10Hashar: ci: add eatmydata to bookworm cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [13:54:46] (03CR) 10Herron: [C:03+1] "LGTM! 👍" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [14:00:02] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:07] (03PS2) 10Clément Goubert: hiera: Add zarcillo k8s service on traffic server [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:00:33] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:00:37] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:04:03] (03PS2) 10Ssingh: Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) [14:04:19] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465#10733691 (10isarantopoulos) I've deleted 30GB from my home directory. @klausman are there any quick wins to clean up disk space for now? I think purging the h... [14:05:06] (03CR) 10CI reject: [V:04-1] Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:05:26] ^^^ on it (ml-lab1001) [14:05:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1006.eqiad.wmnet with OS bullseye [14:05:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1007.eqiad.wmnet with OS bullseye [14:05:56] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-druid1006.eqiad.wmnet with OS bullseye [14:05:56] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye [14:06:46] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465#10733710 (10klausman) >>! In T391465#10733690, @isarantopoulos wrote: > I've deleted 30GB from my home directory. > @klausman are there any quick wins to clean... [14:07:51] (03CR) 10Hashar: "I have updated both instances cowbuilder image using:" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [14:12:43] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [14:17:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:19:59] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10733737 (10Bugreporter) >last 10 newly created Wikidata Properties Note the issue are only reported in ite... [14:21:08] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10733743 (10Bugreporter) [14:37:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:38:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:43:31] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2114.codfw.wmnet with OS bullseye [14:43:36] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2114 [14:43:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2114 [14:47:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:48:38] (03PS2) 10Bking: sre.elasticsearch.rolling-operation: don't use http for dhcp for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [14:49:54] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on releases2003.codfw.wmnet with reason: Bookworm Re-image [14:52:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:53:28] !log reprepro -C component/nginx-ech remove bookworm-wikimedia libssl3t64: removing libssl3t* since we dropped support for 64-bit time [14:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:04] (03PS1) 10Bking: cirrussearch: Add row D non-master hosts to elasticsearch pools [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) [14:55:33] (03CR) 10Bking: [C:04-1] "Do not merge until the row D non-masters are finished re-imaging." [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:56:46] (03CR) 10Clément Goubert: mw:periodic_jobs: Add mw-cron boilerplate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:56:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2142'] [14:57:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-worker2142'] [14:57:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:58:08] (03CR) 10Scott French: [C:03+1] "Thanks for pre-filling these and replacing the CRLFs!" [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:59:41] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2114.codfw.wmnet with reason: host reimage [15:00:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2142.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:01:21] RECOVERY - Host wikikube-worker2142 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [15:01:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2142.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:02:22] Hello 2142 :D [15:02:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:03:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2114.codfw.wmnet with reason: host reimage [15:03:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:04:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733816 (10Jhancock.wm) 05Open→03Resolved a:05Papaul→03Jhancock.wm @Clement_Goubert arrived and replaced. ran provisioning cookbook and it pings now. L... [15:04:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733819 (10Clement_Goubert) Thanks for the resuscitation! [15:05:08] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1006.eqiad.wmnet with reason: host reimage [15:05:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:06:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:08:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1006.eqiad.wmnet with reason: host reimage [15:08:20] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:08:53] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:10:58] (03CR) 10Scott French: [C:03+1] mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert) [15:12:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:12:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:13:42] (03CR) 10Scott French: [C:03+1] scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [15:13:47] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host druid1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:19:42] !log homer lsw1-c2-codfw* commit T391341 [15:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:45] T391341: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341 [15:19:57] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker2142.codfw.wmnet [15:19:58] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker2142.codfw.wmnet [15:20:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:21:15] (03CR) 10Scott French: [C:03+1] "Looks good! One question:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [15:21:50] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.04.11 - 2025.05.02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733842 (10Jhancock.wm) 05Open→03Resolved a:05Papaul→03Jhancock.wm... [15:22:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:24] (03CR) 10Clément Goubert: "I wanted to do that in a later patch, to make a possible revert smaller to review, but I can do it in this one if you prefer." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [15:22:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1006.eqiad.wmnet with OS bullseye [15:22:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host druid1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:22:40] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.04.11 - 2025.05.02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733858 (10Jhancock.wm) @bking [15:22:41] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-druid1006.eqiad.wmnet with OS bullseye completed: - an-druid1006... [15:23:04] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2142.codfw.wmnet [15:23:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2142.codfw.wmnet [15:23:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733862 (10ops-monitoring-bot) pool host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: None [15:23:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733863 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker2142.codfw.wmnet completed... [15:23:17] (03PS6) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) [15:23:28] !log reprepro -C component/nginx-ech include bookworm-wikimedia openssl_3.4.1-1+ech3_amd64.changes: T205378 [15:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:31] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [15:23:41] (03PS7) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) [15:23:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2114.codfw.wmnet with OS bullseye [15:24:45] (03CR) 10Scott French: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:24:53] (03PS5) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) [15:24:53] (03PS8) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) [15:25:21] (03PS5) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [15:26:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-druid1007.eqiad.wmnet with OS bullseye [15:26:11] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye executed with errors: - an... [15:26:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host druid1012.eqiad.wmnet with OS bullseye [15:26:35] (03CR) 10Scott French: [C:03+1] "Sounds good, and no strong preference on my end. Was mainly asking because I thought I might be missing a lingering use case." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [15:26:40] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host druid1012.eqiad.wmnet with OS bullseye [15:27:00] (03PS6) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) [15:27:00] (03PS9) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) [15:27:18] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733871 (10Jclark-ctr) [15:27:57] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733874 (10Jclark-ctr) [15:28:35] (03CR) 10Clément Goubert: [C:03+1] Add namespace for zarcillo (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10733875 (10phaultfinder) [15:30:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:30:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host druid1013.eqiad.wmnet with OS bullseye [15:30:50] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host druid1013.eqiad.wmnet with OS bullseye [15:31:20] (03PS6) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [15:31:30] (03PS1) 10Ladsgroup: mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) [15:33:03] (03PS2) 10Ladsgroup: mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) [15:33:34] (03PS7) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [15:33:40] (03CR) 10Clément Goubert: [C:03+1] "Adding @akosiaris@wikimedia.org to make sure I didn't miss something." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:35:02] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:06] (03PS1) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) [15:37:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup) [15:37:24] !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech1_amd64.changes: T205378 [15:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:28] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [15:38:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1012.eqiad.wmnet with reason: host reimage [15:38:15] (03CR) 10CI reject: [V:04-1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan) [15:40:06] (03PS1) 10Clément Goubert: growthexperiments: Disable updatementeedata on s6 [puppet] - 10https://gerrit.wikimedia.org/r/1135988 [15:40:31] (03CR) 10CI reject: [V:04-1] growthexperiments: Disable updatementeedata on s6 [puppet] - 10https://gerrit.wikimedia.org/r/1135988 (owner: 10Clément Goubert) [15:41:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1012.eqiad.wmnet with reason: host reimage [15:41:47] (03PS3) 10Ladsgroup: mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) [15:42:06] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1013.eqiad.wmnet with reason: host reimage [15:43:39] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:45:02] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:02] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:45:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1013.eqiad.wmnet with reason: host reimage [15:47:55] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002" [15:47:59] (03PS4) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 [15:48:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002" [15:48:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [15:48:46] (03PS2) 10JHathaway: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:49:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2072 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1353, active_shards: 4184, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [15:49:25] 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.97610513739545 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:49:29] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2072 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1678, active_shards: 5031, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of [15:49:29] _tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 31140, active_shards_percent_as_number: 99.96026226902444 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:50:09] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2060 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1678, active_shards: 5031, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of [15:50:09] _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.96026226902444 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:52:43] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:53:19] (03CR) 10Clément Goubert: [C:03+1] "cc'ing people from growth so they're aware" [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup) [15:54:11] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2060 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1353, active_shards: 4184, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [15:54:11] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 941, active_shards_percent_as_number: 99.97610513739545 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:54:23] (03CR) 10Ladsgroup: [C:03+2] mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup) [15:54:23] (03CR) 10Vgutierrez: [C:04-1] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:54:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10733982 (10phaultfinder) [15:55:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:27] FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:55:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:56:43] (03CR) 10STran: [C:03+1] CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae) [16:00:27] RESOLVED: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:01:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:02:47] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10733999 (10MatthewVernon) Tagging @MoritzMuehlenhoff who is clinician next week, for information. [16:08:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:08:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1012.eqiad.wmnet with OS bullseye [16:08:57] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host druid1012.eqiad.wmnet with OS bullseye completed: - druid1012 (**PASS*... [16:09:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:09:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1013.eqiad.wmnet with OS bullseye [16:09:07] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host druid1013.eqiad.wmnet with OS bullseye completed: - druid1013 (**WARN*... [16:11:48] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007 [16:11:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-druid1007 [16:12:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr1-codfw:9804&var-bgp_group=Switch&var-bgp_neighbor=ssw2-a8-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:13:08] (03PS8) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [16:14:05] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10734021 (10Eevans) [16:17:40] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10734029 (10Eevans) [16:20:33] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:21:56] (03CR) 10Andrew Bogott: [C:03+2] tcpircbot: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135762 (owner: 10Andrew Bogott) [16:21:58] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2085 to cirrussearch2085 [16:22:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:23:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:23:42] (03CR) 10Majavah: [C:03+1] wmcs-package-build: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135761 (owner: 10Andrew Bogott) [16:24:10] (03CR) 10Andrew Bogott: [C:03+2] wmcs-package-build: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135761 (owner: 10Andrew Bogott) [16:24:55] (03CR) 10Andrew Bogott: [C:03+2] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott) [16:26:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2085 to cirrussearch2085 - bking@cumin2002" [16:27:20] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on 15 hosts with reason: reimaging/migrating hosts [16:27:44] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2085 to cirrussearch2085 - bking@cumin2002" [16:27:44] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:45] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2085 [16:28:32] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2085 [16:29:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2085 to cirrussearch2085 [16:32:49] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2085.codfw.wmnet on all recursors [16:32:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2085.codfw.wmnet on all recursors [16:33:15] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS bullseye [16:33:26] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2085 [16:33:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:33:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host cirrussearch2085 [16:33:29] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2085.codfw.wmnet with OS bullseye [16:34:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:34:30] (03CR) 10Hnowlan: "Could you put a little more context either in commit or comment please? It's a bit mysterious without context!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [16:35:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734087 (10phaultfinder) [16:36:27] (03CR) 10RLazarus: [C:03+1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan) [16:40:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:42:45] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS bullseye [16:42:54] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cirrussearch2085.codfw.wmnet with OS bullseye [16:44:16] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS bullseye [16:44:18] (03PS3) 10Scott French: Profile::Mediawiki_deployment: add 'dir' field [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) [16:44:23] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:44:27] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2085 [16:44:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:45:38] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007 [16:45:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-druid1007 [16:45:50] (03CR) 10Ahmon Dancy: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [16:46:14] (03CR) 10Scott French: "How about something like this? (see commit message)" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [16:46:31] (03CR) 10Scott French: "And thanks, Ahmon, as well!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [16:46:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:47:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:48:33] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007 [16:48:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-druid1007 [16:49:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:50:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:54:18] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:57:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:57:54] (03PS1) 10Bartosz Dziewoński: CentralAuthTokenManager: Log failures for write operations [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) [16:58:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [16:58:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński) [16:59:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2085 - bking@cumin2002" [16:59:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2085 - bking@cumin2002" [16:59:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:08] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2085.codfw.wmnet 72.48.192.10.in-addr.arpa 2.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:59:11] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2085.codfw.wmnet 72.48.192.10.in-addr.arpa 2.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:59:12] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2085 [16:59:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [17:00:09] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2085 [17:00:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2085 [17:02:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:02:29] (03CR) 10Hnowlan: [C:03+1] "Perfect, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [17:03:54] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [17:04:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:07:13] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1007.eqiad.wmnet with OS bullseye [17:08:10] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye [17:08:53] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [17:15:31] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage [17:15:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734311 (10phaultfinder) [17:16:11] (03PS7) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [17:16:28] (03PS1) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) [17:18:07] (03CR) 10Dzahn: [C:04-1] "After looking at this some more I think we don't want to change "use_scap3_deployment" since this just switches jenkins deployment to "the" [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [17:19:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage [17:19:23] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1007.eqiad.wmnet with reason: host reimage [17:20:47] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 578524208 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:21:02] (03PS2) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) [17:21:13] (03CR) 10Federico Ceratto: "(Rebased)" [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [17:22:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1007.eqiad.wmnet with reason: host reimage [17:22:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 151952 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:28:48] (03CR) 10Ssingh: Add zarcillo (aux k8s) CNAME (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [17:30:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734389 (10phaultfinder) [17:32:12] (03CR) 10Dzahn: "want to also do codfw right away? see around line 810 in templates/wmnet. We recently got this for both DCs." [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [17:37:27] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:37:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:37:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1007.eqiad.wmnet with OS bullseye [17:37:59] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye completed: - an-druid1007... [17:38:20] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734423 (10Jclark-ctr) 05Open→03Resolved [17:38:46] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734429 (10Jclark-ctr) [17:38:56] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734431 (10Jclark-ctr) 05Open→03Resolved [17:39:38] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734434 (10Jclark-ctr) [17:39:43] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2085.codfw.wmnet with OS bullseye [17:47:40] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:51:38] cccccbukvgbcghvnjklrbvjldlbrfbiggttkndtrtrhj [17:53:15] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002" [17:53:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002" [17:53:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:51] (03PS1) 10Aleksandar Mastilovic: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) [18:06:14] (03CR) 10CI reject: [V:04-1] Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [18:06:51] (03CR) 10Aleksandar Mastilovic: "Please do not merge/deploy until we're ready to turn Gobblin on Airflow." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [18:07:41] (03PS2) 10Aleksandar Mastilovic: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) [18:22:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [18:23:15] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10734558 (10RobH) The two new optics arrived for this, one spare and one to swap in. >>! In T390766#10730347, @RobH wrote: > @cmooney: So I've figur... [18:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734585 (10phaultfinder) [18:29:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10734586 (10phaultfinder) [18:31:48] (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: add 'dir' field [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French) [18:32:27] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:35:48] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns records for new separate routed link in ulsfo - cmooney@cumin1002" [18:35:57] (03PS1) 10Cathal Mooney: Add new include statement for netbox-generated dns snippet [dns] - 10https://gerrit.wikimedia.org/r/1135998 (https://phabricator.wikimedia.org/T390731) [18:38:01] (03PS1) 10Cathal Mooney: ulsfo: enable OSPF on separate link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1135999 (https://phabricator.wikimedia.org/T390731) [18:38:48] (03CR) 10Ssingh: [C:03+1] Add new include statement for netbox-generated dns snippet [dns] - 10https://gerrit.wikimedia.org/r/1135998 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney) [18:39:00] (03CR) 10Cathal Mooney: [C:03+2] Add new include statement for netbox-generated dns snippet [dns] - 10https://gerrit.wikimedia.org/r/1135998 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney) [18:39:17] !log cmooney@dns2005 START - running authdns-update [18:39:43] (03CR) 10Cathal Mooney: [C:03+2] ulsfo: enable OSPF on separate link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1135999 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney) [18:40:40] (03Merged) 10jenkins-bot: ulsfo: enable OSPF on separate link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1135999 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney) [18:41:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns records for new separate routed link in ulsfo - cmooney@cumin1002" [18:41:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:41:20] !log cmooney@dns2005 END - running authdns-update [18:42:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [18:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734631 (10phaultfinder) [18:45:03] !log remove et-0/0/0 from ae0 LAG bundle on cr3-ulsfo and cr4-ulsfo T390731 [18:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:06] T390731: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731 [18:53:22] RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:57:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:01:14] (03PS3) 10Dwisehaupt: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) [19:03:39] (03CR) 10Dwisehaupt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:05:26] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10734723 (10cmooney) >>! In T390731#10734558, @RobH wrote: > How is best to proceed? Since this is a redundant link can I just enter a remote hand... [19:15:13] (03CR) 10Hashar: "I imagine the `libssl-dev` supporting ECH is in `component/nginx-ech` and since Ia0d3229ac4ab5747c717e08f1d8529ec2cdc21a9 it should be all" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:15:14] (03CR) 10Dwisehaupt: "@jhathaway@wikimedia.org Thanks for the review and addition of the include to clear up the verification tests. I've hit a point where PCC " [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:21:17] (03CR) 10Hashar: "recheck with `COMPONENT=component/nginx-ech` ( https://gerrit.wikimedia.org/r/c/integration/config/+/1136001 )" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:24:08] (03CR) 10Hashar: "recheck with the sudo policy amended with `env_keep+="COMPONENT"` ( https://horizon.wikimedia.org/project/sudo/ )." [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734824 (10phaultfinder) [19:24:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 170389960 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:24:55] (03CR) 10Ssingh: "Need to update debian/control here again but leave that to me. Thanks for the help!" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:25:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 54344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:35:02] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:13] (03CR) 10Hashar: "recheck with `export COMPONENT` in the Jenkins job." [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:36:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3643 MB (3% inode=98%): /tmp 3643 MB (3% inode=98%): /var/tmp 3643 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [19:39:56] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2104 to cirrussearch2014 [19:40:18] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:44:42] !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic2105 to cirrussearch2105 [19:45:02] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:19] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [19:45:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2104 to cirrussearch2014 - bking@cumin2002" [19:47:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1925538552 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:48:01] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2104 to cirrussearch2014 - bking@cumin2002" [19:48:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:48:02] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2014 [19:48:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2014 [19:48:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2104 to cirrussearch2014 [19:49:31] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2104.codfw.wmnet on all recursors [19:49:34] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2104.codfw.wmnet on all recursors [19:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734929 (10phaultfinder) [19:50:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 227936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:52:54] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2014.codfw.wmnet on all recursors [19:52:57] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2014.codfw.wmnet on all recursors [19:53:50] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10734938 (10RobH) a:05RobH→03ayounsi >>! In T390240#10732617, @ayounsi wrote: > Please hold on. Netops just discovered it and we're not sure D6 the best choice network-wise as it furthe... [19:54:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10734945 (10RobH) [19:56:12] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:57:17] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2105 to cirrussearch2105 - ryankemper@cumin2002" [19:57:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2105 to cirrussearch2105 - ryankemper@cumin2002" [19:57:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:24] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2105 [19:57:37] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2105 [19:58:18] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2105 to cirrussearch2105 [19:59:47] (03PS1) 10Bking: temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) [20:00:11] (03CR) 10CI reject: [V:04-1] temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:00:37] (03PS4) 10JHathaway: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:00:47] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:01:09] (03PS2) 10Bking: temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) [20:01:50] (03CR) 10JHathaway: "that is sharp corner I helped create, sorry, you need to add:" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:02:31] FIRING: [2x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:03:41] (03CR) 10Bking: [C:03+2] temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:03:53] (03CR) 10Bking: [C:03+2] "self-merging to unblock ongoing migration" [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:05:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734976 (10phaultfinder) [20:06:11] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2105.codfw.wmnet with OS bullseye [20:06:23] !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2105 [20:06:47] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [20:07:10] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:07:31] RESOLVED: [2x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:07:59] FIRING: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:11:03] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fix typo (cirrussearch2014 should be cirrussearch2104) - bking@cumin2002 - T388610" [20:11:06] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [20:11:09] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fix typo (cirrussearch2014 should be cirrussearch2104) - bking@cumin2002 - T388610" [20:12:23] (03CR) 10Dwisehaupt: "Thanks. I have a vague memory of possibly seeing that when first investigating months ago." [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:12:46] FIRING: [7x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:12:59] FIRING: [12x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:13:50] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2104.codfw.wmnet on all recursors [20:13:53] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2104.codfw.wmnet on all recursors [20:14:45] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2105 - ryankemper@cumin2002" [20:14:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2105 - ryankemper@cumin2002" [20:14:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:14:52] !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2105.codfw.wmnet 70.48.192.10.in-addr.arpa 0.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:14:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2105.codfw.wmnet 70.48.192.10.in-addr.arpa 0.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:14:56] !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2105 [20:15:09] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2105 [20:15:10] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2105 [20:17:46] FIRING: [20x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:17:59] FIRING: [19x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:18:54] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:20:33] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:22:46] FIRING: [20x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:22:59] FIRING: [22x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:23:53] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:25:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:25:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2014.codfw.wmnet with OS bullseye [20:26:07] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2014 [20:27:11] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:27:46] FIRING: [29x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:27:59] FIRING: [32x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:31:22] (03CR) 10Hashar: [C:03+1] jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [20:32:09] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2105.codfw.wmnet with reason: host reimage [20:32:46] FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:32:59] FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:35:01] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2014 - bking@cumin2002" [20:35:06] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2014 - bking@cumin2002" [20:35:06] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:06] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2014.codfw.wmnet 69.48.192.10.in-addr.arpa 9.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:35:10] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2014.codfw.wmnet 69.48.192.10.in-addr.arpa 9.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:35:10] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2014 [20:35:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2105.codfw.wmnet with reason: host reimage [20:36:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3347 MB (3% inode=98%): /tmp 3347 MB (3% inode=98%): /var/tmp 3347 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [20:37:46] FIRING: [26x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:37:59] FIRING: [26x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:41:06] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2014 [20:41:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2014 [20:42:08] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:42:46] RESOLVED: [23x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:46:12] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:46:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [20:50:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:56:10] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:56:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:57:12] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:57:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2105.codfw.wmnet with OS bullseye [20:58:37] !log bking@cumin2002 START - Cookbook sre.hosts.rename from cirrussearch2014 to cirrussearch2104 [20:58:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from cirrussearch2014 to cirrussearch2104 [21:01:45] (03PS1) 10Bking: cirrussearch: temporarily add cirrussearch2014 so we can rename [puppet] - 10https://gerrit.wikimedia.org/r/1136019 (https://phabricator.wikimedia.org/T388610) [21:01:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:03:09] (03CR) 10Bking: [C:03+2] "self-merging to unblock migration." [puppet] - 10https://gerrit.wikimedia.org/r/1136019 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:10:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:15:41] (03PS1) 10Bking: cirrussearch2014: move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1136020 (https://phabricator.wikimedia.org/T388610) [21:16:10] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:16:14] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:16:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:21:03] (03CR) 10Bking: [C:03+2] cirrussearch2014: move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1136020 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:21:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:26:20] (03PS1) 10Bking: cirrussearch: Use new insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1136021 (https://phabricator.wikimedia.org/T388610) [21:27:12] (03CR) 10Bking: [C:03+2] cirrussearch: Use new insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1136021 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:28:50] (03PS1) 10JHathaway: keyholder: restart proxy after arming a key [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) [21:29:01] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) (owner: 10JHathaway) [21:33:47] (03PS1) 10Bking: cirrussearch: add the firewall suffix [puppet] - 10https://gerrit.wikimedia.org/r/1136023 (https://phabricator.wikimedia.org/T388610) [21:34:48] (03CR) 10Bking: [C:03+2] cirrussearch: add the firewall suffix [puppet] - 10https://gerrit.wikimedia.org/r/1136023 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:36:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3416 MB (3% inode=98%): /tmp 3416 MB (3% inode=98%): /var/tmp 3416 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [21:37:56] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2014.codfw.wmnet with reason: host reimage [21:40:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2014.codfw.wmnet with reason: host reimage [21:54:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2014.codfw.wmnet with OS bullseye [21:57:07] (03PS1) 10Bking: cirrussearch: remove no-longer-existing master-eligibles. [puppet] - 10https://gerrit.wikimedia.org/r/1136026 (https://phabricator.wikimedia.org/T388610) [22:14:47] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10735259 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than... [22:15:31] 06SRE, 06Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppet JMX mappings - https://phabricator.wikimedia.org/T342253#10735286 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progre... [22:15:35] 06SRE, 06Infrastructure-Foundations, 06serviceops-radar, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741#10735289 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "i... [22:16:26] 06SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120#10735327 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half year... [22:16:42] 06SRE, 06Traffic: Add version flag to purged - https://phabricator.wikimedia.org/T347839#10735334 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half years (see `T380300`). [22:21:26] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Monitoring check for nftables - https://phabricator.wikimedia.org/T348499#10735449 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one year (see `T380300`). Feel... [22:22:32] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855#10735497 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "... [22:23:08] (03PS1) 10Aklapper: phabricator weekly changes email: Lower "in progress" threshold to 1y [puppet] - 10https://gerrit.wikimedia.org/r/1136028 (https://phabricator.wikimedia.org/T380300) [22:27:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [22:37:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2105-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:38:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2105-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:42:43] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:43:39] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:45:32] (03PS1) 10Clare Ming: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136031 [22:47:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [22:48:22] (03PS1) 10Clare Ming: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136032 [22:55:39] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [22:55:44] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [22:56:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3570 MB (3% inode=98%): /tmp 3570 MB (3% inode=98%): /var/tmp 3570 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:58:57] (03CR) 10Clare Ming: [C:03+2] Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136031 (owner: 10Clare Ming) [23:00:17] (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136031 (owner: 10Clare Ming) [23:01:53] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [23:02:22] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [23:12:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2105-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [23:35:02] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:40:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136035 [23:40:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136035 (owner: 10TrainBranchBot) [23:45:02] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:51:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136035 (owner: 10TrainBranchBot) [23:58:12] (03PS2) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595)