[00:05:40] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: add scap::target [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889)
[00:06:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] phabricator::migration: add scap::target [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[00:09:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135842
[00:09:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135842 (owner: 10TrainBranchBot)
[00:10:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10732069 (10Quiddity) Thanks for the drafts, both!  I will add this to Tech News tomorrow, **pending your confirmation** on the wording-tweaks I've made,...
[00:11:03] <wikibugs>	 (03PS2) 10Dzahn: phabricator::migration: add scap::target [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889)
[00:20:38] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:27:10] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2060 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:27:10] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2060 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:27:38] <jinxer-wm>	 FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2060-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:28:38] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[00:30:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[00:35:45] <wikibugs>	 (03CR) 10Dzahn: [V:04-1 C:04-1] "https://puppet-compiler.wmflabs.org/output/1135841/5260/phab1005.eqiad.wmnet/change.phab1005.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[00:48:53] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:57:39] <jinxer-wm>	 RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2060-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[01:09:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10732154 (10phaultfinder)
[01:10:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135842 (owner: 10TrainBranchBot)
[01:11:16] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:16:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3333 MB (3% inode=98%): /tmp 3333 MB (3% inode=98%): /var/tmp 3333 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[01:30:02] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[01:40:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:40:20] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:40:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[01:42:39] <jinxer-wm>	 FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2072-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[01:43:39] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[02:02:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[02:07:39] <jinxer-wm>	 RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2072-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:22:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[02:28:53] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:21] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Postorius (held and) reported full headers get mangled somewhere in the system - https://phabricator.wikimedia.org/T309492#10732233 (10Aklapper) @grin: Could you please answer the last comment? Thanks in advance!
[03:45:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:45:07] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[03:47:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 218, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:49:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:58:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:04:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:16:18] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:20:28] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[04:35:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:36:52] <icinga-wm>	 PROBLEM - Check unit status of sync-puppet-volatile on puppetmaster2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:40:30] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:44:39] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391654 (10phaultfinder) 03NEW
[04:46:52] <icinga-wm>	 RECOVERY - Check unit status of sync-puppet-volatile on puppetmaster2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:50:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:50:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:08:53] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:22:26] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:30:02] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:35:47] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable SUL3 on most remaining beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850
[05:35:48] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Clean up obsolete SUL3 settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851
[05:36:44] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "I was surprised to find it wasn't already enabled. I think we just forgot. Is there any reason not to do it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński)
[05:40:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[05:43:39] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[05:57:42] <wikibugs>	 (03PS1) 10Ayounsi: Add eqiad RIPE Atlas Anchor VM to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560)
[05:58:53] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:59:13] <wikibugs>	 (03PS2) 10Ayounsi: Add eqiad RIPE Atlas Anchor VM to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T0600)
[06:00:10] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi)
[06:07:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[06:27:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[06:58:38] <wikibugs>	 (03PS1) 10Marostegui: check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854
[06:59:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T0700)
[07:02:43] <wikibugs>	 (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui)
[07:17:36] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:19:03] <wikibugs>	 (03PS1) 10Marostegui: db1256: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1135856
[07:19:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1256: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1135856 (owner: 10Marostegui)
[07:27:44] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1135 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:30:46] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-test-k8s: increase the reserved resources for the airflow-test-k8s scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135724 (https://phabricator.wikimedia.org/T391556) (owner: 10Brouberol)
[07:31:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: increase the reserved resources for the airflow-test-k8s scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135724 (https://phabricator.wikimedia.org/T391556) (owner: 10Brouberol)
[07:35:42] <wikibugs>	 (03PS4) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922)
[07:43:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:27] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:44:48] <wikibugs>	 (03PS5) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922)
[07:44:48] <wikibugs>	 (03CR) 10DCausse: [C:03+1] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer)
[07:45:02] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[07:45:06] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:45:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:46:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:50:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:50:56] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:51:26] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5261/console" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:52:03] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: increase the limit ranges to support deploying a bigger scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135891 (https://phabricator.wikimedia.org/T391556)
[07:52:14] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:55:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:00:08] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:00:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:04:35] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi)
[08:10:58] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "None of the tested hosts have object storage credentials in the PCC output, which I think is expected at this point? Since they don't curr" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:13:44] <wikibugs>	 (03PS2) 10Federico Ceratto: pool.py: In dry-run mode do not monitor connection drain [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577)
[08:16:49] <wikibugs>	 (03CR) 10Federico Ceratto: "Applied the change and tested it in dry-run." [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto)
[08:18:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add eqiad RIPE Atlas Anchor VM to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1135852 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi)
[08:20:15] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:40:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: increase the limit ranges to support deploying a bigger scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135891 (https://phabricator.wikimedia.org/T391556) (owner: 10Brouberol)
[08:43:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10732602 (10ayounsi) Please hold on. Netops just discovered it and we're not sure D6 the best choice network-wise as it furthers the row capacity imbalance.
[08:44:17] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1169.eqiad.wmnet with OS bullseye
[08:44:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732603 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-worker1169...
[08:44:52] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1169.eqiad.wmnet with OS bullseye
[08:44:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-worker...
[08:47:17] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] "yes all gitlab hosts have object storage disabled. I'll enable it for one of the replicas soon." [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:47:38] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: rename thanos object storage parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:48:53] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:50:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:54:34] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:55:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10732617 (10ayounsi) Please hold on. Netops just discovered it and we're not sure D6 the best choice network-wise as it furthers the row capacity imbalance.
[08:56:49] <wikibugs>	 (03PS2) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975)
[08:57:08] <wikibugs>	 (03PS1) 10Btullis: Prep an-worker1169 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1135899 (https://phabricator.wikimedia.org/T390169)
[08:57:50] <wikibugs>	 (03PS1) 10Jelto: gitlab: fix Unknown variable: 'object_storage_access_key' [puppet] - 10https://gerrit.wikimedia.org/r/1135900 (https://phabricator.wikimedia.org/T378922)
[08:59:49] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Prep an-worker1169 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1135899 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis)
[09:00:20] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: fix Unknown variable: 'object_storage_access_key' [puppet] - 10https://gerrit.wikimedia.org/r/1135900 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[09:00:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "That's fair yeah, ok let's go with the _filter in metric name as you suggested" [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli)
[09:02:42] <wikibugs>	 (03PS3) 10Filippo Giunchedi: perf/real_user_monitoring: add rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli)
[09:02:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "See latest PS" [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli)
[09:03:53] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:05:28] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1169.eqiad.wmnet with OS bullseye
[09:05:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002...
[09:05:48] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1169.eqiad.wmnet with OS bullseye
[09:05:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732630 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1...
[09:14:09] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert)
[09:15:32] <wikibugs>	 (03PS2) 10Clément Goubert: mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539)
[09:15:49] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert)
[09:16:22] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert)
[09:20:08] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:20:37] <wikibugs>	 (03PS3) 10Btullis: Temporarily exclude mediawikiwiki from the dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo)
[09:20:40] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:20:41] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo)
[09:20:47] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:20:49] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1169.eqiad.wmnet with reason: host reimage
[09:21:32] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[09:22:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10732673 (10Ladsgroup) >>! In T355914#10732069, @Quiddity wrote: > Thanks for the drafts, both!  I will add this to Tech News tomorrow, **pending your con...
[09:22:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw:periodic_jobs: Cleanup updatetranslationstats [puppet] - 10https://gerrit.wikimedia.org/r/1135760 (https://phabricator.wikimedia.org/T388539) (owner: 10Clément Goubert)
[09:24:05] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1169.eqiad.wmnet with reason: host reimage
[09:25:04] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] perf/real_user_monitoring: add rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1135684 (https://phabricator.wikimedia.org/T390166) (owner: 10Tiziano Fogli)
[09:25:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Temporarily exclude mediawikiwiki from the dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135734 (https://phabricator.wikimedia.org/T390839) (owner: 10Xcollazo)
[09:27:51] <godog>	 a
[09:27:58] <godog>	 sometimes b too
[09:28:07] <claime>	 it do b like this
[09:28:21] <godog>	 fr
[09:28:45] <wikibugs>	 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10732694 (10MatthewVernon)
[09:32:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[09:32:14] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10732719 (10MatthewVernon)
[09:33:25] <wikibugs>	 (03CR) 10Clément Goubert: Add namespace for zarcillo (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[09:35:02] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:37:28] <wikibugs>	 (03PS7) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746
[09:37:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey)
[09:38:53] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:54] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:38:56] <wikibugs>	 (03PS8) 10Elukey: profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746
[09:40:02] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:40:12] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5263/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey)
[09:40:26] <wikibugs>	 (03CR) 10Elukey: "Thanks for the review Keith!" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey)
[09:40:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[09:43:39] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[09:46:57] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1169.eqiad.wmnet with OS bullseye
[09:47:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-worker1169...
[09:47:29] <wikibugs>	 (03PS1) 10Slyngshede: IDP-Test: 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135909
[09:48:06] <wikibugs>	 (03PS1) 10Elukey: services: update proton's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135910
[09:48:40] <wikibugs>	 (03PS3) 10Fabfur: haproxy: start staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827
[09:48:53] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:49:28] <wikibugs>	 (03PS4) 10Fabfur: haproxy: start staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827
[09:50:10] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] "Yeah, we just forgot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński)
[09:51:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP-Test: 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135909 (owner: 10Slyngshede)
[09:51:02] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:52:36] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (owner: 10Fabfur)
[09:53:23] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[09:53:30] <icinga-wm>	 PROBLEM - Disk space on idp-test2005 is CRITICAL: DISK CRITICAL - free space: / 534MiB (1% inode=95%): /tmp 534MiB (1% inode=95%): /var/tmp 534MiB (1% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=idp-test2005&var-datasource=codfw+prometheus/ops
[09:53:35] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[09:56:01] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:19] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: start staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (owner: 10Fabfur)
[09:59:53] <wikibugs>	 (03PS1) 10Clément Goubert: scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665)
[10:00:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:02:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[10:03:53] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:05:10] <wikibugs>	 (03PS1) 10Slyngshede: IDP: CAS 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135913
[10:06:44] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] IDP: CAS 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135913 (owner: 10Slyngshede)
[10:06:59] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: CAS 7.0.10.1 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1135913 (owner: 10Slyngshede)
[10:07:18] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[10:07:25] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[10:09:51] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[10:10:19] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily exclude mediawikiwiki from the dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1135915
[10:12:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[10:14:10] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Temporarily exclude mediawikiwiki from the dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1135915 (owner: 10Btullis)
[10:15:13] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782)
[10:16:44] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:17:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:18:22] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782)
[10:19:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917
[10:20:30] <wikibugs>	 (03PS1) 10Btullis: Temporarily add mediawikiwiki to the skip list for dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135918 (https://phabricator.wikimedia.org/T390839)
[10:21:22] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5264/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135918 (https://phabricator.wikimedia.org/T390839) (owner: 10Btullis)
[10:22:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi)
[10:22:37] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1169.eqiad.wmnet
[10:23:54] <wikibugs>	 (03PS1) 10Jelto: gitlab: enable object storage on one of the replicas: [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922)
[10:25:42] <wikibugs>	 (03PS5) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670)
[10:27:26] <wikibugs>	 (03PS2) 10Clément Goubert: scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665)
[10:27:47] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Temporarily add mediawikiwiki to the skip list for dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135918 (https://phabricator.wikimedia.org/T390839) (owner: 10Btullis)
[10:27:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[10:27:59] <wikibugs>	 (03PS1) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135920 (https://phabricator.wikimedia.org/T391670)
[10:28:51] <wikibugs>	 (03CR) 10Tiziano Fogli: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi)
[10:28:53] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:29:49] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1169.eqiad.wmnet
[10:30:02] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:30:14] <wikibugs>	 (03PS2) 10Jelto: gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922)
[10:30:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135920 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[10:31:25] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily put an-worker1169 back into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1135921
[10:31:57] <wikibugs>	 (03PS1) 10Clément Goubert: php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665)
[10:32:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[10:32:41] <wikibugs>	 (03Abandoned) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135920 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[10:33:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Temporarily put an-worker1169 back into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1135921 (owner: 10Btullis)
[10:34:00] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[10:37:08] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] services: update proton's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135910 (owner: 10Elukey)
[10:37:31] <wikibugs>	 (03PS3) 10Jelto: gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922)
[10:39:13] <wikibugs>	 (03PS1) 10Btullis: Put an-worker1169 back into service [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169)
[10:39:31] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[10:40:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:40:45] <wikibugs>	 (03PS4) 10Jelto: gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922)
[10:42:53] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[10:43:27] <wikibugs>	 (03PS2) 10Btullis: Put an-worker1169 back into service and exclude group 3 [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169)
[10:45:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:45:59] <wikibugs>	 (03PS3) 10Btullis: Put an-worker1169 back into service and exclude group 3 [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169)
[10:46:48] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Put an-worker1169 back into service and exclude group 3 [puppet] - 10https://gerrit.wikimedia.org/r/1135925 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis)
[10:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:51:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10732936 (10BTullis)
[10:51:51] <wikibugs>	 (03CR) 10Clément Goubert: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:54:09] <wikibugs>	 (03PS3) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782)
[10:54:23] <wikibugs>	 (03CR) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:54:46] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[11:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T0700)
[11:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: gettimeofday() says it's time for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250411T1100)
[11:01:40] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[11:01:59] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[11:07:13] <wikibugs>	 (03PS7) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670)
[11:07:20] <wikibugs>	 (03PS2) 10Clément Goubert: mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672)
[11:08:36] <wikibugs>	 (03CR) 10Fabfur: haproxy: staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[11:09:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[11:12:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:12:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:12:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:13:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:13:33] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[11:16:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:24:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:25:09] <wikibugs>	 (03PS1) 10Btullis: Prep for new druid hosts to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1135933 (https://phabricator.wikimedia.org/T387132)
[11:27:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:29:17] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1169.eqiad.wmnet
[11:29:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733007 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD re...
[11:31:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:34:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:35:07] <wikibugs>	 (03PS8) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670)
[11:36:56] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1169.eqiad.wmnet
[11:37:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[11:37:19] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Btullis Prep for hard drive replacement T390170 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:37:20] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1167 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Btullis Prep for hard drive replacement T390170 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:37:20] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager Btullis Prep for hard drive replacement T390170 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:38:35] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1135933 (https://phabricator.wikimedia.org/T387132) (owner: 10Btullis)
[11:39:12] <wikibugs>	 (03PS9) 10Fabfur: haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670)
[11:39:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Prep for new druid hosts to be installed [puppet] - 10https://gerrit.wikimedia.org/r/1135933 (https://phabricator.wikimedia.org/T387132) (owner: 10Btullis)
[11:40:14] <wikibugs>	 (03PS5) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796
[11:40:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733032 (10BTullis)
[11:41:45] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[11:42:00] <wikibugs>	 (03PS2) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359)
[11:42:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1167 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:43:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene)
[11:43:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733038 (10BTullis)
[11:45:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:45:02] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[11:45:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733039 (10BTullis) 05Open→03Resolved >>! In T390169#10724690, @ayounsi wrote: > To properly move the server yo...
[11:45:20] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733044 (10BTullis)
[11:45:32] <wikibugs>	 (03PS1) 10Clément Goubert: mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555)
[11:48:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10733046 (10BTullis)
[11:50:35] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733050 (10BTullis)
[11:51:14] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733052 (10BTullis) Moving to the milestone, as we have a new column for tracking tasks like this.
[11:51:21] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733053 (10BTullis) a:05BTullis→03None
[11:52:59] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733057 (10BTullis)
[11:55:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10733065 (10Stevemunene) a:03Stevemunene
[11:55:22] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10733064 (10Silvan_WMDE) I believe this must have been an infrastructure issue which hasn't occured any mor...
[11:56:14] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert)
[11:57:08] <wikibugs>	 (03PS6) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796
[12:00:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: refresh all tests in the new IPv6-enabled networks [puppet] - 10https://gerrit.wikimedia.org/r/1135942 (https://phabricator.wikimedia.org/T391325)
[12:02:00] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: refresh all tests in the new IPv6-enabled networks [puppet] - 10https://gerrit.wikimedia.org/r/1135942 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[12:05:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917
[12:05:13] <wikibugs>	 (03PS4) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212)
[12:05:39] <wikibugs>	 (03CR) 10Federico Ceratto: Add namespace for zarcillo (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[12:08:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: networktests: fix typo in envvars [puppet] - 10https://gerrit.wikimedia.org/r/1135944 (https://phabricator.wikimedia.org/T391325)
[12:10:52] <godog>	 !log bounce thanos-query thanos-query-frontend thanos-store on titan1*
[12:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:11] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: networktests: fix typo in envvars [puppet] - 10https://gerrit.wikimedia.org/r/1135944 (https://phabricator.wikimedia.org/T391325)
[12:17:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: networktests: fix typo in envvars [puppet] - 10https://gerrit.wikimedia.org/r/1135944 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[12:20:23] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:33] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:20:46] <wikibugs>	 (03PS4) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665)
[12:20:57] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:21:47] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:22:13] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:22:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:27:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 14.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:27:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:29:30] <wikibugs>	 (03PS5) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665)
[12:32:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 14.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:33:30] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[12:34:25] <wikibugs>	 (03PS6) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665)
[12:34:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: fix typos in networktests [puppet] - 10https://gerrit.wikimedia.org/r/1135947 (https://phabricator.wikimedia.org/T391325)
[12:36:00] <wikibugs>	 (03PS7) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665)
[12:36:21] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:37:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr1-codfw:9804&var-bgp_group=Switch&var-bgp_neighbor=ssw2-a8-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:37:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[12:38:10] <wikibugs>	 (03PS3) 10Filippo Giunchedi: logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917
[12:38:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: grafana: set max_source_resolution=auto for thanos ds [puppet] - 10https://gerrit.wikimedia.org/r/1135948 (https://phabricator.wikimedia.org/T390215)
[12:38:20] <wikibugs>	 (03PS8) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665)
[12:38:24] <wikibugs>	 (03CR) 10Fabfur: haproxy: staticize haproxy acls into template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[12:39:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10733153 (10phaultfinder)
[12:40:59] <wikibugs>	 (03CR) 10Federico Ceratto: "Apologies, I misread the code. I added the downtime now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:41:43] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] "LGTM, tests/knative_activator.yaml will serve as a valid test." [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi)
[12:42:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: grafana: set max_source_resolution=auto for thanos ds [puppet] - 10https://gerrit.wikimedia.org/r/1135948 (https://phabricator.wikimedia.org/T371102)
[12:43:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: fix typos in networktests [puppet] - 10https://gerrit.wikimedia.org/r/1135947 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[12:47:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:48:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[12:50:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:54:05] <wikibugs>	 (03Abandoned) 10Ssingh: package_builder: add packages for nginx build [puppet] - 10https://gerrit.wikimedia.org/r/1135731 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[12:58:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: networktests: refresh floating VM IP address [puppet] - 10https://gerrit.wikimedia.org/r/1135949 (https://phabricator.wikimedia.org/T391325)
[12:59:22] <wikibugs>	 (03PS1) 10Hashar: Upgrade and pin flake8 [software] - 10https://gerrit.wikimedia.org/r/1135950
[13:00:20] <wikibugs>	 (03CR) 10Fabfur: haproxy: staticize haproxy acls into template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[13:03:47] <wikibugs>	 (03PS2) 10Hashar: tox: upgrade and pin flake8 [software] - 10https://gerrit.wikimedia.org/r/1135950
[13:03:47] <wikibugs>	 (03PS1) 10Hashar: tox: use flake8's extend-exclude [software] - 10https://gerrit.wikimedia.org/r/1135951
[13:04:46] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733207 (10BTullis) a:05BTullis→03Jclark-ctr >>! In T387142#10727875, @Jclark-ctr wrote: > @btullis handing over to you for updating puppet repo.  also to verify that 10...
[13:05:28] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733210 (10BTullis) a:05BTullis→03Jclark-ctr Done. Thanks @Jclark-ctr .
[13:06:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] "Per our IRC chat" [software] - 10https://gerrit.wikimedia.org/r/1135951 (owner: 10Hashar)
[13:06:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] "Per our IRC chat" [software] - 10https://gerrit.wikimedia.org/r/1135950 (owner: 10Hashar)
[13:07:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: codfw1dev: networktests: refresh floating VM IP address [puppet] - 10https://gerrit.wikimedia.org/r/1135949 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[13:07:45] <wikibugs>	 (03Merged) 10jenkins-bot: tox: upgrade and pin flake8 [software] - 10https://gerrit.wikimedia.org/r/1135950 (owner: 10Hashar)
[13:07:50] <wikibugs>	 (03Merged) 10jenkins-bot: tox: use flake8's extend-exclude [software] - 10https://gerrit.wikimedia.org/r/1135951 (owner: 10Hashar)
[13:08:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10733238 (10Gehel) Configuration tracked in T391680
[13:08:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10733240 (10Gehel)
[13:08:46] <wikibugs>	 (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui)
[13:08:55] <wikibugs>	 06SRE, 06SRE Observability, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794#10733242 (10Gehel)
[13:09:41] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui)
[13:10:12] <wikibugs>	 (03Merged) 10jenkins-bot: check_flags_per_dc.sh: Remove x2 [software] - 10https://gerrit.wikimedia.org/r/1135854 (owner: 10Marostegui)
[13:12:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10733254 (10Gehel)
[13:13:39] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:13:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10733274 (10Gehel)
[13:13:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10733278 (10Gehel)
[13:14:10] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10733297 (10Gehel)
[13:16:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:16:21] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10733355 (10Gehel)
[13:16:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10733359 (10Gehel)
[13:16:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10733361 (10Gehel)
[13:16:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:16:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10733365 (10Gehel)
[13:16:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10733363 (10Gehel)
[13:16:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10733367 (10Gehel)
[13:16:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10733369 (10Gehel)
[13:17:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10733373 (10Gehel)
[13:17:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10733371 (10Gehel)
[13:17:20] <wikibugs>	 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Bring relforge100[89] into production - https://phabricator.wikimedia.org/T389957#10733376 (10Gehel)
[13:17:26] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10733381 (10Gehel)
[13:17:36] <wikibugs>	 07sre-alert-triage, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10733391 (10Gehel)
[13:17:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:17:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:18:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.03.22 - 2025.04.11): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733395 (10Gehel)
[13:22:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:3 (Core: ssw2-a8-codfw:ethernet-1/33 {#10693_12295-4}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:25:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Change weight for db1180 T390510', diff saved to https://phabricator.wikimedia.org/P74901 and previous config saved to /var/cache/conftool/dbconfig/20250411-132518-marostegui.json
[13:25:22] <stashbot>	 T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510
[13:27:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.04.11 - 2025.05.02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733525 (10Gehel)
[13:29:22] <wikibugs>	 (03PS6) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231)
[13:33:46] <sukhe>	 !log reprepro -C component/nginx-ech include bookworm-wikimedia openssl_3.4.1-1+ech2_amd64.changes: T205378
[13:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:50] <stashbot>	 T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378
[13:40:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[13:43:26] <wikibugs>	 (03CR) 10Hashar: "recheck after having enabled the debian-glue job: https://gerrit.wikimedia.org/r/c/integration/config/+/1135728" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:43:39] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[13:44:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:46:43] <wikibugs>	 (03CR) 10Hashar: "From the build console:" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:47:00] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[13:47:40] <wikibugs>	 (03CR) 10Ssingh: "Yes, thanks, I am still figuring this out and did a gitlab build which worked so will take it from there. I may abandon this as well but w" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:47:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:49:31] <icinga-wm>	 RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 4.276 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[13:49:41] <icinga-wm>	 RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[13:53:09] <wikibugs>	 (03PS1) 10Hashar: ci: add eatmydata to bookworm cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[13:54:46] <wikibugs>	 (03CR) 10Herron: [C:03+1] "LGTM! 👍" [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey)
[14:00:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:00:07] <wikibugs>	 (03PS2) 10Clément Goubert: hiera: Add zarcillo k8s service on traffic server [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:00:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:00:37] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:04:03] <wikibugs>	 (03PS2) 10Ssingh: Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378)
[14:04:19] <wikibugs>	 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465#10733691 (10isarantopoulos) I've deleted 30GB from my home directory.  @klausman are there any quick wins to clean up disk space for now?  I think purging the h...
[14:05:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:05:26] <klausman>	 ^^^ on it (ml-lab1001)
[14:05:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1006.eqiad.wmnet with OS bullseye
[14:05:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1007.eqiad.wmnet with OS bullseye
[14:05:56] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-druid1006.eqiad.wmnet with OS bullseye
[14:05:56] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye
[14:06:46] <wikibugs>	 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465#10733710 (10klausman) >>! In T391465#10733690, @isarantopoulos wrote: > I've deleted 30GB from my home directory.  > @klausman are there any quick wins to clean...
[14:07:51] <wikibugs>	 (03CR) 10Hashar: "I have updated both instances cowbuilder image using:" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[14:12:43] <icinga-wm>	 RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[14:17:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:19:59] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10733737 (10Bugreporter) >last 10 newly created Wikidata Properties Note the issue are only reported in ite...
[14:21:08] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10733743 (10Bugreporter)
[14:37:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:38:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:43:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2114.codfw.wmnet with OS bullseye
[14:43:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2114
[14:43:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2114
[14:47:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:48:38] <wikibugs>	 (03PS2) 10Bking: sre.elasticsearch.rolling-operation: don't use http for dhcp for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[14:49:54] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on releases2003.codfw.wmnet with reason: Bookworm Re-image
[14:52:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:53:28] <sukhe>	 !log reprepro -C component/nginx-ech remove bookworm-wikimedia libssl3t64: removing libssl3t* since we dropped support for 64-bit time
[14:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:04] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add row D non-master hosts to elasticsearch pools [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610)
[14:55:33] <wikibugs>	 (03CR) 10Bking: [C:04-1] "Do not merge until the row D non-masters are finished re-imaging." [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:55:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:56:46] <wikibugs>	 (03CR) 10Clément Goubert: mw:periodic_jobs: Add mw-cron boilerplate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[14:56:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2142']
[14:57:12] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-worker2142']
[14:57:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:58:08] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks for pre-filling these and replacing the CRLFs!" [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[14:59:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2114.codfw.wmnet with reason: host reimage
[15:00:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2142.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:01:21] <icinga-wm>	 RECOVERY - Host wikikube-worker2142 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms
[15:01:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2142.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:02:22] <claime>	 Hello 2142 :D
[15:02:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:03:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2114.codfw.wmnet with reason: host reimage
[15:03:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:04:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733816 (10Jhancock.wm) 05Open→03Resolved a:05Papaul→03Jhancock.wm @Clement_Goubert  arrived and replaced. ran provisioning cookbook and it pings now. L...
[15:04:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733819 (10Clement_Goubert) Thanks for the resuscitation!
[15:05:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1006.eqiad.wmnet with reason: host reimage
[15:05:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:06:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:08:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1006.eqiad.wmnet with reason: host reimage
[15:08:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[15:08:53] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:10:58] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert)
[15:12:09] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:12:43] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:13:42] <wikibugs>	 (03CR) 10Scott French: [C:03+1] scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[15:13:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host druid1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:19:42] <claime>	 !log homer lsw1-c2-codfw* commit T391341
[15:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:45] <stashbot>	 T391341: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341
[15:19:57] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker2142.codfw.wmnet
[15:19:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker2142.codfw.wmnet
[15:20:20] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:21:15] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Looks good! One question:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[15:21:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.04.11 - 2025.05.02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733842 (10Jhancock.wm) 05Open→03Resolved a:05Papaul→03Jhancock.wm...
[15:22:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:22:24] <wikibugs>	 (03CR) 10Clément Goubert: "I wanted to do that in a later patch, to make a possible revert smaller to review, but I can do it in this one if you prefer." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[15:22:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:22:36] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1006.eqiad.wmnet with OS bullseye
[15:22:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host druid1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:22:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 10Discovery-Search (2025.04.11 - 2025.05.02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10733858 (10Jhancock.wm) @bking
[15:22:41] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-druid1006.eqiad.wmnet with OS bullseye completed: - an-druid1006...
[15:23:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2142.codfw.wmnet
[15:23:06] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2142.codfw.wmnet
[15:23:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733862 (10ops-monitoring-bot) pool host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: None
[15:23:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10733863 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker2142.codfw.wmnet completed...
[15:23:17] <wikibugs>	 (03PS6) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212)
[15:23:28] <sukhe>	 !log reprepro -C component/nginx-ech include bookworm-wikimedia openssl_3.4.1-1+ech3_amd64.changes: T205378
[15:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:31] <stashbot>	 T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378
[15:23:41] <wikibugs>	 (03PS7) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212)
[15:23:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2114.codfw.wmnet with OS bullseye
[15:24:45] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:24:53] <wikibugs>	 (03PS5) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212)
[15:24:53] <wikibugs>	 (03PS8) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212)
[15:25:21] <wikibugs>	 (03PS5) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192)
[15:26:04] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-druid1007.eqiad.wmnet with OS bullseye
[15:26:11] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye executed with errors: - an...
[15:26:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host druid1012.eqiad.wmnet with OS bullseye
[15:26:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Sounds good, and no strong preference on my end. Was mainly asking because I thought I might be missing a lingering use case." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[15:26:40] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host druid1012.eqiad.wmnet with OS bullseye
[15:27:00] <wikibugs>	 (03PS6) 10Federico Ceratto: Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212)
[15:27:00] <wikibugs>	 (03PS9) 10Federico Ceratto: Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212)
[15:27:18] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10733871 (10Jclark-ctr)
[15:27:57] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733874 (10Jclark-ctr)
[15:28:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Add namespace for zarcillo (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[15:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10733875 (10phaultfinder)
[15:30:34] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:30:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host druid1013.eqiad.wmnet with OS bullseye
[15:30:50] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10733878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host druid1013.eqiad.wmnet with OS bullseye
[15:31:20] <wikibugs>	 (03PS6) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192)
[15:31:30] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510)
[15:33:03] <wikibugs>	 (03PS2) 10Ladsgroup: mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510)
[15:33:34] <wikibugs>	 (03PS7) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192)
[15:33:40] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Adding @akosiaris@wikimedia.org to make sure I didn't miss something." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[15:35:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:06] <wikibugs>	 (03PS1) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212)
[15:37:12] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup)
[15:37:24] <sukhe>	 !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech1_amd64.changes: T205378
[15:37:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:28] <stashbot>	 T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378
[15:38:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1012.eqiad.wmnet with reason: host reimage
[15:38:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan)
[15:40:06] <wikibugs>	 (03PS1) 10Clément Goubert: growthexperiments: Disable updatementeedata on s6 [puppet] - 10https://gerrit.wikimedia.org/r/1135988
[15:40:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] growthexperiments: Disable updatementeedata on s6 [puppet] - 10https://gerrit.wikimedia.org/r/1135988 (owner: 10Clément Goubert)
[15:41:38] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1012.eqiad.wmnet with reason: host reimage
[15:41:47] <wikibugs>	 (03PS3) 10Ladsgroup: mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510)
[15:42:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1013.eqiad.wmnet with reason: host reimage
[15:43:39] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:45:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:45:02] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[15:45:51] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1013.eqiad.wmnet with reason: host reimage
[15:47:55] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002"
[15:47:59] <wikibugs>	 (03PS4) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115
[15:48:01] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002"
[15:48:01] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:48:21] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway)
[15:48:46] <wikibugs>	 (03PS2) 10JHathaway: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[15:49:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2072 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1353, active_shards: 4184, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[15:49:25] <icinga-wm>	 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.97610513739545 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:49:29] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2072 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, discovered_master: True, active_primary_shards: 1678, active_shards: 5031, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of
[15:49:29] <icinga-wm>	 _tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 31140, active_shards_percent_as_number: 99.96026226902444 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:50:09] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2060 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1678, active_shards: 5031, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of
[15:50:09] <icinga-wm>	 _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.96026226902444 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:52:43] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:53:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "cc'ing people from growth so they're aware" [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup)
[15:54:11] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2060 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 59, number_of_data_nodes: 59, discovered_master: True, active_primary_shards: 1353, active_shards: 4184, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[15:54:11] <icinga-wm>	 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 941, active_shards_percent_as_number: 99.97610513739545 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:54:23] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mediawiki: Absent updatementeedata jobs [puppet] - 10https://gerrit.wikimedia.org/r/1135983 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup)
[15:54:23] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[15:54:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10733982 (10phaultfinder)
[15:55:21] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:55:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[15:55:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:56:43] <wikibugs>	 (03CR) 10STran: [C:03+1] CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae)
[16:00:27] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2060:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[16:01:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:02:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10733999 (10MatthewVernon) Tagging @MoritzMuehlenhoff who is clinician next week, for information.
[16:08:52] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:08:52] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1012.eqiad.wmnet with OS bullseye
[16:08:57] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host druid1012.eqiad.wmnet with OS bullseye completed: - druid1012 (**PASS*...
[16:09:02] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:09:03] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1013.eqiad.wmnet with OS bullseye
[16:09:07] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host druid1013.eqiad.wmnet with OS bullseye completed: - druid1013 (**WARN*...
[16:11:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007
[16:11:55] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-druid1007
[16:12:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-codfw and ssw2-a8-codfw (10.192.254.15) - group Switch - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr1-codfw:9804&var-bgp_group=Switch&var-bgp_neighbor=ssw2-a8-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:13:08] <wikibugs>	 (03PS8) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192)
[16:14:05] <wikibugs>	 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10734021 (10Eevans)
[16:17:40] <wikibugs>	 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10734029 (10Eevans)
[16:20:33] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:21:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] tcpircbot: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135762 (owner: 10Andrew Bogott)
[16:21:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2085 to cirrussearch2085
[16:22:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:23:39] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:23:42] <wikibugs>	 (03CR) 10Majavah: [C:03+1] wmcs-package-build: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135761 (owner: 10Andrew Bogott)
[16:24:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs-package-build: remove an unnecessary 'global' [puppet] - 10https://gerrit.wikimedia.org/r/1135761 (owner: 10Andrew Bogott)
[16:24:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove files and manifests for openstack version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1135756 (owner: 10Andrew Bogott)
[16:26:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2085 to cirrussearch2085 - bking@cumin2002"
[16:27:20] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on 15 hosts with reason: reimaging/migrating hosts
[16:27:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2085 to cirrussearch2085 - bking@cumin2002"
[16:27:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:27:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2085
[16:28:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2085
[16:29:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2085 to cirrussearch2085
[16:32:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2085.codfw.wmnet on all recursors
[16:32:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2085.codfw.wmnet on all recursors
[16:33:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS bullseye
[16:33:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2085
[16:33:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:33:28] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host cirrussearch2085
[16:33:29] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2085.codfw.wmnet with OS bullseye
[16:34:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:34:30] <wikibugs>	 (03CR) 10Hnowlan: "Could you put a little more context either in commit or comment please? It's a bit mysterious without context!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French)
[16:35:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734087 (10phaultfinder)
[16:36:27] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan)
[16:40:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:42:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS bullseye
[16:42:54] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cirrussearch2085.codfw.wmnet with OS bullseye
[16:44:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2085.codfw.wmnet with OS bullseye
[16:44:18] <wikibugs>	 (03PS3) 10Scott French: Profile::Mediawiki_deployment: add 'dir' field [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761)
[16:44:23] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:44:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2085
[16:44:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:45:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007
[16:45:38] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-druid1007
[16:45:50] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French)
[16:46:14] <wikibugs>	 (03CR) 10Scott French: "How about something like this? (see commit message)" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French)
[16:46:31] <wikibugs>	 (03CR) 10Scott French: "And thanks, Ahmon, as well!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French)
[16:46:35] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[16:47:06] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:48:33] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007
[16:48:47] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-druid1007
[16:49:41] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:50:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:51:25] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:54:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:55:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[16:57:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:57:54] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: CentralAuthTokenManager: Log failures for write operations [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784)
[16:58:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński)
[16:58:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński)
[16:59:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2085 - bking@cumin2002"
[16:59:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2085 - bking@cumin2002"
[16:59:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:59:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2085.codfw.wmnet 72.48.192.10.in-addr.arpa 2.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:59:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2085.codfw.wmnet 72.48.192.10.in-addr.arpa 2.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:59:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2085
[16:59:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński)
[17:00:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2085
[17:00:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2085
[17:02:23] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:02:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Perfect, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French)
[17:03:54] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[17:04:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:07:13] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:08:01] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1007.eqiad.wmnet with OS bullseye
[17:08:10] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734249 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye
[17:08:53] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[17:15:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage
[17:15:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734311 (10phaultfinder)
[17:16:11] <wikibugs>	 (03PS7) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[17:16:28] <wikibugs>	 (03PS1) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595)
[17:18:07] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "After looking at this some more I think we don't want to change "use_scap3_deployment" since this just switches jenkins deployment to "the" [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[17:19:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2085.codfw.wmnet with reason: host reimage
[17:19:23] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1007.eqiad.wmnet with reason: host reimage
[17:20:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 578524208 and 36 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:21:02] <wikibugs>	 (03PS2) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212)
[17:21:13] <wikibugs>	 (03CR) 10Federico Ceratto: "(Rebased)" [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[17:22:30] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1007.eqiad.wmnet with reason: host reimage
[17:22:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 151952 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:28:48] <wikibugs>	 (03CR) 10Ssingh: Add zarcillo (aux k8s) CNAME (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[17:30:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734389 (10phaultfinder)
[17:32:12] <wikibugs>	 (03CR) 10Dzahn: "want to also do codfw right away? see around line 810 in templates/wmnet. We recently got this for both DCs." [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[17:37:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[17:37:46] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[17:37:46] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1007.eqiad.wmnet with OS bullseye
[17:37:59] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-druid1007.eqiad.wmnet with OS bullseye completed: - an-druid1007...
[17:38:20] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734423 (10Jclark-ctr) 05Open→03Resolved
[17:38:46] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734429 (10Jclark-ctr)
[17:38:56] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10734431 (10Jclark-ctr) 05Open→03Resolved
[17:39:38] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[67] - https://phabricator.wikimedia.org/T387142#10734434 (10Jclark-ctr)
[17:39:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2085.codfw.wmnet with OS bullseye
[17:47:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[17:51:38] <mutante>	 cccccbukvgbcghvnjklrbvjldlbrfbiggttkndtrtrhj
[17:53:15] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002"
[17:53:21] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add test server IP dns nokia lab - cmooney@cumin1002"
[17:53:21] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:05:51] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249)
[18:06:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[18:06:51] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "Please do not merge/deploy until we're ready to turn Gobblin on Airflow." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[18:07:41] <wikibugs>	 (03PS2) 10Aleksandar Mastilovic: Remove support for systemd Gobblin [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249)
[18:22:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[18:23:15] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10734558 (10RobH) The two new optics arrived for this, one spare and one to swap in.  >>! In T390766#10730347, @RobH wrote: > @cmooney: So I've figur...
[18:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734585 (10phaultfinder)
[18:29:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10734586 (10phaultfinder)
[18:31:48] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: add 'dir' field [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) (owner: 10Scott French)
[18:32:27] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[18:35:48] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns records for new separate routed link in ulsfo - cmooney@cumin1002"
[18:35:57] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new include statement for netbox-generated dns snippet [dns] - 10https://gerrit.wikimedia.org/r/1135998 (https://phabricator.wikimedia.org/T390731)
[18:38:01] <wikibugs>	 (03PS1) 10Cathal Mooney: ulsfo: enable OSPF on separate link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1135999 (https://phabricator.wikimedia.org/T390731)
[18:38:48] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add new include statement for netbox-generated dns snippet [dns] - 10https://gerrit.wikimedia.org/r/1135998 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney)
[18:39:00] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add new include statement for netbox-generated dns snippet [dns] - 10https://gerrit.wikimedia.org/r/1135998 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney)
[18:39:17] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[18:39:43] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] ulsfo: enable OSPF on separate link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1135999 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney)
[18:40:40] <wikibugs>	 (03Merged) 10jenkins-bot: ulsfo: enable OSPF on separate link between CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1135999 (https://phabricator.wikimedia.org/T390731) (owner: 10Cathal Mooney)
[18:41:02] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns records for new separate routed link in ulsfo - cmooney@cumin1002"
[18:41:02] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:41:20] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[18:42:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[18:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734631 (10phaultfinder)
[18:45:03] <topranks>	 !log remove et-0/0/0 from ae0 LAG bundle on cr3-ulsfo and cr4-ulsfo T390731
[18:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:06] <stashbot>	 T390731: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731
[18:53:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:57:43] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:01:14] <wikibugs>	 (03PS3) 10Dwisehaupt: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715)
[19:03:39] <wikibugs>	 (03CR) 10Dwisehaupt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[19:05:26] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10734723 (10cmooney) >>! In T390731#10734558, @RobH wrote: > How is best to proceed?  Since this is a redundant link can I just enter a remote hand...
[19:15:13] <wikibugs>	 (03CR) 10Hashar: "I imagine the `libssl-dev` supporting ECH is in `component/nginx-ech` and since Ia0d3229ac4ab5747c717e08f1d8529ec2cdc21a9 it should be all" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:15:14] <wikibugs>	 (03CR) 10Dwisehaupt: "@jhathaway@wikimedia.org Thanks for the review and addition of the include to clear up the verification tests. I've hit a point where PCC " [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[19:21:17] <wikibugs>	 (03CR) 10Hashar: "recheck with `COMPONENT=component/nginx-ech` ( https://gerrit.wikimedia.org/r/c/integration/config/+/1136001 )" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:24:08] <wikibugs>	 (03CR) 10Hashar: "recheck with the sudo policy amended with `env_keep+="COMPONENT"` ( https://horizon.wikimedia.org/project/sudo/ )." [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734824 (10phaultfinder)
[19:24:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 170389960 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:24:55] <wikibugs>	 (03CR) 10Ssingh: "Need to update debian/control here again but leave that to me. Thanks for the help!" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:25:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 54344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:35:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:35:13] <wikibugs>	 (03CR) 10Hashar: "recheck with `export COMPONENT` in the Jenkins job." [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:36:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3643 MB (3% inode=98%): /tmp 3643 MB (3% inode=98%): /var/tmp 3643 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[19:39:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2104 to cirrussearch2014
[19:40:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:44:42] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.rename from elastic2105 to cirrussearch2105
[19:45:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:45:19] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[19:45:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2104 to cirrussearch2014 - bking@cumin2002"
[19:47:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1925538552 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:48:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2104 to cirrussearch2014 - bking@cumin2002"
[19:48:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:48:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2014
[19:48:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2014
[19:48:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2104 to cirrussearch2014
[19:49:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2104.codfw.wmnet on all recursors
[19:49:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2104.codfw.wmnet on all recursors
[19:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734929 (10phaultfinder)
[19:50:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 227936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:52:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2014.codfw.wmnet on all recursors
[19:52:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2014.codfw.wmnet on all recursors
[19:53:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10734938 (10RobH) a:05RobH→03ayounsi >>! In T390240#10732617, @ayounsi wrote: > Please hold on. Netops just discovered it and we're not sure D6 the best choice network-wise as it furthe...
[19:54:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10734945 (10RobH)
[19:56:12] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:57:17] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2105 to cirrussearch2105 - ryankemper@cumin2002"
[19:57:23] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2105 to cirrussearch2105 - ryankemper@cumin2002"
[19:57:23] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:57:24] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2105
[19:57:37] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2105
[19:58:18] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2105 to cirrussearch2105
[19:59:47] <wikibugs>	 (03PS1) 10Bking: temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610)
[20:00:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:00:37] <wikibugs>	 (03PS4) 10JHathaway: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[20:00:47] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[20:01:09] <wikibugs>	 (03PS2) 10Bking: temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610)
[20:01:50] <wikibugs>	 (03CR) 10JHathaway: "that is sharp corner I helped create, sorry, you need to add:" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[20:02:31] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:03:41] <wikibugs>	 (03CR) 10Bking: [C:03+2] temporarily add cirrussearch2014 as a host [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:03:53] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging to unblock ongoing migration" [puppet] - 10https://gerrit.wikimedia.org/r/1136013 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:05:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10734976 (10phaultfinder)
[20:06:11] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2105.codfw.wmnet with OS bullseye
[20:06:23] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2105
[20:06:47] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox
[20:07:10] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:07:31] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:07:59] <jinxer-wm>	 FIRING: [4x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1076:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:11:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fix typo (cirrussearch2014 should be cirrussearch2104) - bking@cumin2002 - T388610"
[20:11:06] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[20:11:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fix typo (cirrussearch2014 should be cirrussearch2104) - bking@cumin2002 - T388610"
[20:12:23] <wikibugs>	 (03CR) 10Dwisehaupt: "Thanks. I have a vague memory of possibly seeing that when first investigating months ago." [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[20:12:46] <jinxer-wm>	 FIRING: [7x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:12:59] <jinxer-wm>	 FIRING: [12x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:13:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2104.codfw.wmnet on all recursors
[20:13:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2104.codfw.wmnet on all recursors
[20:14:45] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2105 - ryankemper@cumin2002"
[20:14:51] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2105 - ryankemper@cumin2002"
[20:14:51] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:14:52] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2105.codfw.wmnet 70.48.192.10.in-addr.arpa 0.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:14:55] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2105.codfw.wmnet 70.48.192.10.in-addr.arpa 0.7.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:14:56] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2105
[20:15:09] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2105
[20:15:10] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2105
[20:17:46] <jinxer-wm>	 FIRING: [20x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:17:59] <jinxer-wm>	 FIRING: [19x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:18:54] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:20:33] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:22:46] <jinxer-wm>	 FIRING: [20x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:22:59] <jinxer-wm>	 FIRING: [22x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1057:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:23:53] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:25:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:25:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2014.codfw.wmnet with OS bullseye
[20:26:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2014
[20:27:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:27:46] <jinxer-wm>	 FIRING: [29x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:27:59] <jinxer-wm>	 FIRING: [32x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:31:22] <wikibugs>	 (03CR) 10Hashar: [C:03+1] jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[20:32:09] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2105.codfw.wmnet with reason: host reimage
[20:32:46] <jinxer-wm>	 FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:32:59] <jinxer-wm>	 FIRING: [33x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:35:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2014 - bking@cumin2002"
[20:35:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2014 - bking@cumin2002"
[20:35:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:35:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2014.codfw.wmnet 69.48.192.10.in-addr.arpa 9.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:35:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2014.codfw.wmnet 69.48.192.10.in-addr.arpa 9.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:35:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2014
[20:35:15] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2105.codfw.wmnet with reason: host reimage
[20:36:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3347 MB (3% inode=98%): /tmp 3347 MB (3% inode=98%): /var/tmp 3347 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[20:37:46] <jinxer-wm>	 FIRING: [26x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:37:59] <jinxer-wm>	 FIRING: [26x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:41:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2014
[20:41:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2014
[20:42:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:42:46] <jinxer-wm>	 RESOLVED: [23x] SystemdUnitCrashLoop: logstash.service crashloop on elastic1053:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:46:12] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:46:35] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[20:50:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:55:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:56:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:56:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:57:12] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:57:16] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2105.codfw.wmnet with OS bullseye
[20:58:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from cirrussearch2014 to cirrussearch2104
[20:58:39] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from cirrussearch2014 to cirrussearch2104
[21:01:45] <wikibugs>	 (03PS1) 10Bking: cirrussearch: temporarily add cirrussearch2014 so we can rename [puppet] - 10https://gerrit.wikimedia.org/r/1136019 (https://phabricator.wikimedia.org/T388610)
[21:01:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:03:09] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging to unblock migration." [puppet] - 10https://gerrit.wikimedia.org/r/1136019 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:10:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:15:41] <wikibugs>	 (03PS1) 10Bking: cirrussearch2014: move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1136020 (https://phabricator.wikimedia.org/T388610)
[21:16:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:16:14] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:16:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:21:03] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch2014: move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1136020 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:21:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:26:20] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Use new insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1136021 (https://phabricator.wikimedia.org/T388610)
[21:27:12] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Use new insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1136021 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:28:50] <wikibugs>	 (03PS1) 10JHathaway: keyholder: restart proxy after arming a key [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711)
[21:29:01] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) (owner: 10JHathaway)
[21:33:47] <wikibugs>	 (03PS1) 10Bking: cirrussearch: add the firewall suffix [puppet] - 10https://gerrit.wikimedia.org/r/1136023 (https://phabricator.wikimedia.org/T388610)
[21:34:48] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: add the firewall suffix [puppet] - 10https://gerrit.wikimedia.org/r/1136023 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:36:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3416 MB (3% inode=98%): /tmp 3416 MB (3% inode=98%): /var/tmp 3416 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[21:37:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2014.codfw.wmnet with reason: host reimage
[21:40:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2014.codfw.wmnet with reason: host reimage
[21:54:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2014.codfw.wmnet with OS bullseye
[21:57:07] <wikibugs>	 (03PS1) 10Bking: cirrussearch: remove no-longer-existing master-eligibles. [puppet] - 10https://gerrit.wikimedia.org/r/1136026 (https://phabricator.wikimedia.org/T388610)
[22:14:47] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10735259 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than...
[22:15:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppet JMX mappings - https://phabricator.wikimedia.org/T342253#10735286 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progre...
[22:15:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops-radar, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741#10735289 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "i...
[22:16:26] <wikibugs>	 06SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120#10735327 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half year...
[22:16:42] <wikibugs>	 06SRE, 06Traffic: Add version flag to purged - https://phabricator.wikimedia.org/T347839#10735334 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half years (see `T380300`).
[22:21:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Monitoring check for nftables - https://phabricator.wikimedia.org/T348499#10735449 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one year (see `T380300`). Feel...
[22:22:32] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855#10735497 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has been "...
[22:23:08] <wikibugs>	 (03PS1) 10Aklapper: phabricator weekly changes email: Lower "in progress" threshold to 1y [puppet] - 10https://gerrit.wikimedia.org/r/1136028 (https://phabricator.wikimedia.org/T380300)
[22:27:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[22:37:38] <jinxer-wm>	 FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2105-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[22:38:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2105-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[22:42:43] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:43:39] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:45:32] <wikibugs>	 (03PS1) 10Clare Ming: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136031
[22:47:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[22:48:22] <wikibugs>	 (03PS1) 10Clare Ming: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136032
[22:55:39] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[22:55:44] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[22:56:17] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3570 MB (3% inode=98%): /tmp 3570 MB (3% inode=98%): /var/tmp 3570 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[22:58:57] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136031 (owner: 10Clare Ming)
[23:00:17] <wikibugs>	 (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136031 (owner: 10Clare Ming)
[23:01:53] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[23:02:22] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[23:12:39] <jinxer-wm>	 RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2105-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[23:35:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:40:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136035
[23:40:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136035 (owner: 10TrainBranchBot)
[23:45:02] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:51:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136035 (owner: 10TrainBranchBot)
[23:58:12] <wikibugs>	 (03PS2) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595)