[00:02:18] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[00:02:22] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[00:10:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136823
[00:10:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136823 (owner: 10TrainBranchBot)
[00:10:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:11:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P75083 and previous config saved to /var/cache/conftool/dbconfig/20250416-001156-fceratto.json
[00:27:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T391056)', diff saved to https://phabricator.wikimedia.org/P75084 and previous config saved to /var/cache/conftool/dbconfig/20250416-002703-fceratto.json
[00:27:07] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[00:27:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2221.codfw.wmnet with reason: Maintenance
[00:27:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75085 and previous config saved to /var/cache/conftool/dbconfig/20250416-002725-fceratto.json
[00:27:47] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136823 (owner: 10TrainBranchBot)
[00:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[00:43:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75086 and previous config saved to /var/cache/conftool/dbconfig/20250416-004338-fceratto.json
[00:43:42] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[00:58:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P75087 and previous config saved to /var/cache/conftool/dbconfig/20250416-005846-fceratto.json
[01:13:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P75088 and previous config saved to /var/cache/conftool/dbconfig/20250416-011353-fceratto.json
[01:21:36] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/0fa72902e0aab988e2631df2617f26171681e532532aebd7feb2130a6edd4519/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:29:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75089 and previous config saved to /var/cache/conftool/dbconfig/20250416-012901-fceratto.json
[01:29:05] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[01:29:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2222.codfw.wmnet with reason: Maintenance
[01:29:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75090 and previous config saved to /var/cache/conftool/dbconfig/20250416-012924-fceratto.json
[01:41:36] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:45:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75091 and previous config saved to /var/cache/conftool/dbconfig/20250416-014529-fceratto.json
[01:45:34] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[01:53:39] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1045-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:58:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[02:00:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P75092 and previous config saved to /var/cache/conftool/dbconfig/20250416-020036-fceratto.json
[02:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[02:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[02:12:58] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 41.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:13:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:15:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P75093 and previous config saved to /var/cache/conftool/dbconfig/20250416-021544-fceratto.json
[02:16:16] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2103 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: green, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1678, active_shards: 5033, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_
[02:16:16] <icinga-wm>	 tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:16:16] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2103 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 60, number_of_data_nodes: 60, discovered_master: True, active_primary_shards: 1354, active_shards: 4185, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 3, delayed_unassigned_shards: 0, number_of_pending
[02:16:16] <icinga-wm>	 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92836676217765 https://wikitech.wikimedia.org/wiki/Search%23Administration
[02:23:27] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2103:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[02:30:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75094 and previous config saved to /var/cache/conftool/dbconfig/20250416-023052-fceratto.json
[02:30:56] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[02:38:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[03:43:39] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1045-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:50:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:50:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[03:52:50] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:42:50] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:45:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:12:06] <icinga-wm>	 PROBLEM - Restbase root url on restbase1029 is CRITICAL: connect to address 10.64.16.173 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[05:38:16] <volans>	 !log installing spicerack v10.1.0 on cumin2002
[05:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T0600)
[06:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[06:09:28] <volans>	 !log installing spicerack v10.1.0 on cumin1002
[06:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:20:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10746393 (10Lena_WMDE) @MatthewVernon works as expected, thank you! :)
[06:23:06] <wikibugs>	 (03PS1) 10Volans: __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835
[06:23:06] <wikibugs>	 (03PS1) 10Volans: Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836
[06:23:06] <wikibugs>	 (03PS1) 10Volans: I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837
[06:25:05] <wikibugs>	 (03PS1) 10Volans: I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838
[06:25:05] <wikibugs>	 (03PS1) 10Volans: ServiceOps cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839
[06:25:05] <wikibugs>	 (03PS1) 10Volans: Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840
[06:25:06] <wikibugs>	 (03PS1) 10Volans: CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841
[06:25:06] <wikibugs>	 (03PS1) 10Volans: DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842
[06:25:08] <wikibugs>	 (03PS1) 10Volans: DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843
[06:25:12] <wikibugs>	 (03PS1) 10Volans: Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844
[06:31:20] <wikibugs>	 (03PS1) 10Fabfur: cache: add termination status to haproxy log format [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454)
[06:37:00] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454) (owner: 10Fabfur)
[06:37:07] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10746400 (10BCornwall) So far so good in the first 8 hours of uptime! We'll let it simmer overnight and see how it fares.
[06:38:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:46:18] <wikibugs>	 (03PS2) 10Fabfur: cache: add termination state to haproxy log format [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454)
[06:57:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] sre.hosts.reimage: check dbctl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[06:59:06] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:50] <elukey>	 !log powercycle ml-serve2007 - OEM event registered in getsel (seems DIMM-related)
[07:02:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:06] <wikibugs>	 (03PS17) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T390251)
[07:05:47] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[07:05:50] <icinga-wm>	 RECOVERY - Host ml-serve2007 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms
[07:05:52] <wikibugs>	 (03PS18) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T391457)
[07:06:30] <wikibugs>	 (03PS1) 10Kevin Bazira: eventstreams: expose RRLA event stream publicly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179)
[07:06:34] <icinga-wm>	 RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:06:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746407 (10Jelto) >>! In T378922#10743705, @MatthewVernon wrote: > Looking at the Ceph metrics, it seems the packages were fewer l...
[07:06:57] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746410 (10Jelto)
[07:11:50] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:22:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10746432 (10MatthewVernon)
[07:24:45] <wikibugs>	 (03Abandoned) 10Fabfur: cache: add termination state to haproxy log format [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454) (owner: 10Fabfur)
[07:26:31] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:26:47] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:27:16] <wikibugs>	 (03PS1) 10MVernon: admin: add kcoleman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136936 (https://phabricator.wikimedia.org/T391861)
[07:36:16] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[07:39:39] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136936 (https://phabricator.wikimedia.org/T391861) (owner: 10MVernon)
[07:42:11] <wikibugs>	 (03CR) 10MVernon: [C:03+2] admin: add kcoleman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136936 (https://phabricator.wikimedia.org/T391861) (owner: 10MVernon)
[07:43:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service restbase1045-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:49:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10746512 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @KColeman-WMF this is done for you now (but I'd allow...
[07:50:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10746521 (10Gehel)
[07:50:32] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[07:50:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[07:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:00:48] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[08:02:48] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[08:10:55] <wikibugs>	 06SRE, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10746546 (10Gehel) p:05Triage→03High
[08:12:06] <wikibugs>	 06SRE, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10746549 (10Gehel) 05Open→03Resolved a:03brouberol Archiva is still being used, so we should still keep an eye on it. Cleanup done by @brouberol, we should be good for a while.
[08:16:02] <akosiaris>	 !log destroy the "main" helmfile releases for mw-wikifunctions. The service is now being powered by the single version MediaWiki HTTP routing solution releases, this is a cleanup.
[08:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 (owner: 10Volans)
[08:30:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] scap: Stop updating main mw-wikifunctions release [puppet] - 10https://gerrit.wikimedia.org/r/1136749 (owner: 10Alexandros Kosiaris)
[08:30:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Remove the main release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136748 (owner: 10Alexandros Kosiaris)
[08:32:17] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: Remove the main release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136748 (owner: 10Alexandros Kosiaris)
[08:39:52] <wikibugs>	 (03PS4) 10Volans: Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[08:42:22] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136963 (https://phabricator.wikimedia.org/T360589)
[08:44:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136963 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[08:44:23] <wikibugs>	 (03CR) 10Volans: [C:04-1] "I've rebased this one to resolve the rebase conflicts given the recent homer changes." [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[08:45:04] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136963 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[08:45:19] <wikibugs>	 (03PS2) 10Volans: Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844
[08:45:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:45:30] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Thanks for the review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 (owner: 10Volans)
[08:46:12] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]]
[08:46:15] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[08:51:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[08:52:09] <wikibugs>	 (03Merged) 10jenkins-bot: Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 (owner: 10Volans)
[08:58:02] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:58:06] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[08:59:06] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[08:59:34] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: refresh FQDN of the neutron virtual router [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174)
[09:00:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: refresh FQDN of the neutron virtual router [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[09:02:10] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world failed: <CalledProcessError> Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'write-values', '--output-file-template', '/tmp/tmpsh_tee3p']' returned non-zero exit status 3. (scap version: 4.153.0) (duration: 15m 58s)
[09:02:59] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]]
[09:05:46] <Amir1>	 I'm retrying again
[09:07:31] <wikibugs>	 (03PS1) 10Ladsgroup: Change default thumbnail size to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136964 (https://phabricator.wikimedia.org/T355914)
[09:07:39] <elukey>	 anything that broke related to the registry issue?
[09:07:42] <elukey>	 or other things?
[09:07:52] <elukey>	 lemme know in case :D
[09:08:54] <Amir1>	 https://www.irccloud.com/pastebin/cipVY2Nf/
[09:08:56] <Amir1>	 elukey: 
[09:09:42] <elukey>	 ok never seen this before, and it looks really weird
[09:10:24] <Amir1>	 yeah
[09:10:26] <elukey>	 it doesn't seem related to the registry though, but scap running helmfile in the wrong way
[09:10:32] <Amir1>	 it couldn't even roll back
[09:10:32] <elukey>	 +1 to retry
[09:12:23] <wikibugs>	 (03CR) 10Klausman: [C:03+1] role::ml_k8s::master: move 1001 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[09:12:50] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::ml_k8s::master: move 1001 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[09:15:02] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:15:06] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[09:15:15] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[09:15:23] <vgutierrez>	 !log repooling cp4047 - T387238
[09:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:30] <stashbot>	 T387238: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238
[09:17:27] <wikibugs>	 (03PS37) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[09:17:47] <wikibugs>	 (03CR) 10Majavah: [C:03+1] "Fine with me, I can merge/deploy as long as Francesco does not have any objections" [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup)
[09:18:54] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve-ctrl1001.eqiad.wmnet with OS bookworm
[09:19:30] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Fine with me!" [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup)
[09:20:26] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wikireplica_dns: Add termstore aliases for s8 [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup)
[09:22:04] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] (duration: 19m 05s)
[09:22:08] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[09:22:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746665 (10Jelto)
[09:22:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136964 (https://phabricator.wikimedia.org/T355914) (owner: 10Ladsgroup)
[09:22:32] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "I have a "Chesterton's fence" feeling about this. This seems reasonable for when you're developing, but for the submit checks on Gerrit I " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[09:22:53] <Amir1>	 elukey: It worked the second time *shrugs*
[09:23:07] <wikibugs>	 (03Merged) 10jenkins-bot: Change default thumbnail size to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136964 (https://phabricator.wikimedia.org/T355914) (owner: 10Ladsgroup)
[09:23:14] <elukey>	 Amir1: it felt that you were upset, I'd have done the same
[09:23:31] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136964|Change default thumbnail size to 250px (T355914)]]
[09:23:31] <Amir1>	 I'm always upset :D
[09:23:35] <stashbot>	 T355914: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914
[09:23:48] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:24:14] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:24:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "Just to clarify, I -1'd it because I want to hear someone else's opinion on this, I can be persuaded :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[09:24:47] <elukey>	 Amir1: nah :D
[09:28:13] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Forgot to ask, where are the queries? I don't see them in homer/public" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[09:29:38] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746694 (10Jelto) I've triggered a backup on the GitLab replica, which has been switched to object storage. The new backup runtime...
[09:31:14] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10746698 (10MatthewVernon) >>! In T391544#10745829, @Eevans wrote: > Cassandra's JBOD is pretty dumb in this r...
[09:31:57] <TheresNoTime>	 jouncebot: nowandnext
[09:31:57] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 28 minute(s)
[09:31:57] <jouncebot>	 In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000)
[09:32:55] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage
[09:35:09] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136964|Change default thumbnail size to 250px (T355914)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:35:13] <stashbot>	 T355914: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914
[09:36:02] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage
[09:36:26] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[09:37:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Release campaignEvents extension to azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey)
[09:39:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10746721 (10Stevemunene) a:05Gehel→03Stevemunene
[09:39:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10746722 (10Gehel) a:05Stevemunene→03None
[09:42:54] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136965
[09:43:07] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136964|Change default thumbnail size to 250px (T355914)]] (duration: 19m 35s)
[09:43:22] <stashbot>	 T355914: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914
[09:54:27] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet with OS bookworm
[09:57:29] <wikibugs>	 (03PS1) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970
[09:58:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[09:59:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (owner: 10Michael Große)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000)
[10:00:44] <wikibugs>	 (03PS2) 10Hnowlan: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033)
[10:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[10:01:25] <wikibugs>	 (03PS2) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970
[10:02:16] <wikibugs>	 (03PS41) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[10:02:16] <wikibugs>	 (03CR) 10Federico Ceratto: "Updated using new features from Spicerack" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[10:02:19] <wikibugs>	 (03PS3) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510)
[10:03:06] <wikibugs>	 (03PS42) 10Federico Ceratto: sre.mysql.upgrade: Switch to Host, apt-get and mysql helpers [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[10:04:45] <MichaelG_WMF>	 jouncebot: nowandnext
[10:04:45] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000)
[10:04:45] <jouncebot>	 In 0 hour(s) and 55 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1100)
[10:06:06] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] switchdc: clarify inputs for moving active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/1128895 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[10:06:38] <MichaelG_WMF>	 I'm looking into getting an urgent puppet script for mentorship reenabled. To that end, I would like to run the script as a test against testwiki. Is there an issue with that?
[10:06:52] <MichaelG_WMF>	 As is, with doing that now?
[10:07:01] <wikibugs>	 (03PS3) 10FNegri: openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953)
[10:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:09:04] <hnowlan>	 MichaelG_WMF: no issue with the timing - is your patch restoring the jobs for T391695? 
[10:09:04] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[10:09:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056 (10cmooney) 03NEW p:05Triage→03Medium
[10:09:39] <MichaelG_WMF>	 hnowlan: yes, that is the goal
[10:10:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746764 (10cmooney)
[10:10:41] <hnowlan>	 MichaelG_WMF: that's probably fine (ccing Amir1 for awareness) 
[10:11:07] <wikibugs>	 (03Abandoned) 10Hnowlan: httpbb: use k8s jobrunners for healthchecking [puppet] - 10https://gerrit.wikimedia.org/r/1112728 (https://phabricator.wikimedia.org/T383317) (owner: 10Hnowlan)
[10:11:18] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[10:12:06] <wikibugs>	 (03Merged) 10jenkins-bot: switchdc: clarify inputs for moving active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/1128895 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[10:12:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746767 (10cmooney)
[10:13:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:51] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:14:29] <wikibugs>	 (03Abandoned) 10Hnowlan: deployment: switch deploy servers to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1127074 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan)
[10:15:39] <MichaelG_WMF>	 hnowlan: Amir1: running the script worked without error, we should be able to reenable it I hope. Who exactly should I talk to for this? The Wiki says "talk to SRE"
[10:16:09] <wikibugs>	 (03CR) 10Jgiannelos: [C:04-1] rest-gateway: add mobileapps/PCS endpoints that don't use internal cache (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan)
[10:17:03] <MichaelG_WMF>	 !log migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.25$ mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki testwiki --verbose #T391695
[10:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:07] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[10:17:12] <Amir1>	 I will enable it soon
[10:17:23] <MichaelG_WMF>	 Amir1 <3
[10:18:39] <logmsgbot>	 volans@cumin2002 downtime (PID 4108428) is awaiting input
[10:18:47] <volans>	 elukey ^^^ yay
[10:19:09] <Amir1>	 MichaelG_WMF: while I get to a pc, can you try running it on frwiki and ruwiki and record how long it took?
[10:19:35] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database nupwiki (T390714)
[10:19:38] <stashbot>	 T390714: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714
[10:19:40] <MichaelG_WMF>	 Amir1 can do
[10:19:41] <wikibugs>	 (03CR) 10Hashar: "> for the submit checks on Gerrit I think we do actually want the diff against master, as that's what you're merging into." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[10:19:46] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database nupwiki (T390714)
[10:20:01] <Amir1>	 Thanks!
[10:20:12] <MichaelG_WMF>	 (though not sure where to find the actual slow queries log and how to read it)
[10:20:45] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[10:21:03] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:21:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T391056)', diff saved to https://phabricator.wikimedia.org/P75096 and previous config saved to /var/cache/conftool/dbconfig/20250416-102110-fceratto.json
[10:21:14] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:23:21] <MichaelG_WMF>	 !log migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.24$ time mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki frwiki --verbose #T391695
[10:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:24] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[10:24:33] <elukey>	 volans: ah nice!
[10:25:25] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746807 (10MatthewVernon) That fits with what I see from bucket stats: gitlab-packages has 3,938 objects and 195GB, gitlab-artifac...
[10:26:29] * MichaelG_WMF makes note to self: add some (--verbose) output while running to updateMenteeData.php -- looking at a shell that shows ~nothing is not great
[10:29:13] <MichaelG_WMF>	 Amir1: frwiki took 4m21s or 261 seconds. now running on ruwiki
[10:29:50] <MichaelG_WMF>	 !log migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.24$ time mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki ruwiki --verbose #T391695
[10:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:54] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[10:30:40] <wikibugs>	 (03PS1) 10Volans: spicerack: enable IRC notification on user input [puppet] - 10https://gerrit.wikimedia.org/r/1136973
[10:32:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T391056)', diff saved to https://phabricator.wikimedia.org/P75097 and previous config saved to /var/cache/conftool/dbconfig/20250416-103236-fceratto.json
[10:32:41] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:32:54] <wikibugs>	 (03PS3) 10Hnowlan: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033)
[10:33:41] <wikibugs>	 (03CR) 10Volans: "Tested on cumin2002, it notified me correctly:" [puppet] - 10https://gerrit.wikimedia.org/r/1136973 (owner: 10Volans)
[10:33:57] <MichaelG_WMF>	 Amir1: ruwiki was 173 seconds
[10:34:15] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Forgot they are not yet merged, for reference are in Ia3ff62de353a2f2d2a48498b6d6ed96743fb3ffd" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi)
[10:34:42] <MichaelG_WMF>	 though from our metrics, I expect enwiki to be one of those that runs a really looong time
[10:35:23] <wikibugs>	 (03CR) 10Hnowlan: "Thanks for the review! My list of changes was based on the initial list in the ticket, good catches." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan)
[10:37:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746861 (10cmooney)
[10:37:48] <claime>	 yo MichaelG_WMF, could you try and run it using mwscript-k8s? That'd give us confidence for when we migrate it to mw-cron
[10:38:07] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "You're correct, but my worry is about a chain of commits and master diverging. (Sorry, I should have mentioned that explicitly.) In most c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[10:38:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746862 (10cmooney)
[10:38:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:38:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746864 (10cmooney)
[10:40:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746877 (10cmooney)
[10:40:22] <MichaelG_WMF>	 claime: can I do that now? I know there used to be an issue around that because I only have restricted access and not full deployment access
[10:40:38] <claime>	 ah, idk, unsure
[10:40:49] <claime>	 I can run it if you give me an invoc and an ok
[10:41:30] <claime>	 https://phabricator.wikimedia.org/T378429 guess not
[10:43:14] <MichaelG_WMF>	 claime: This should finish in about 11 seconds and be generally low-risk: `/srv/mediawiki/php-1.44.0-wmf.25$ mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki testwiki --verbose`
[10:43:21] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "I would feel a lot more comfortable with this change if we also added a new `check_master` task that replicates the old behaviour. It's pr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[10:43:22] <claime>	 cool
[10:46:33] <claime>	 Done. Took 9 seconds.
[10:46:39] <claime>	 Ran on php 8.1 inside k8s
[10:47:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P75098 and previous config saved to /var/cache/conftool/dbconfig/20250416-104744-fceratto.json
[10:48:27] <MichaelG_WMF>	 Nice!
[10:50:31] <wikibugs>	 (03PS1) 10Abijeet Patro: Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311)
[10:52:23] <logmsgbot>	 !log cgoubert@deploy1003 Started scap build-images: (no justification provided)
[10:52:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1189 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[10:52:58] <Amir1>	 I need to check the load on db1180 and some other things and then let you know 
[10:54:35] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan)
[10:56:44] <claime>	 jouncebot: nowandnext
[10:56:44] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000)
[10:56:44] <jouncebot>	 In 0 hour(s) and 3 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1100)
[10:57:51] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[10:57:53] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925)
[10:58:00] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap build-images: (no justification provided) (duration: 05m 36s)
[10:58:28] <wikibugs>	 (03CR) 10Elukey: "Please check that my assumptions are correct :)" [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[11:00:04] <jouncebot>	 mvolz: That opportune time for a Services – Citoid / Zotero deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1100).
[11:02:33] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) (owner: 10FNegri)
[11:02:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P75099 and previous config saved to /var/cache/conftool/dbconfig/20250416-110252-fceratto.json
[11:03:39] <wikibugs>	 (03CR) 10Elukey: "I have zero experience in this template, it looks good but I'd rely on Filippo's input to be honest :(" [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron)
[11:04:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1189 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:04:48] <wikibugs>	 (03PS2) 10Clément Goubert: php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665)
[11:05:19] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[11:05:21] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5308/co" [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[11:05:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:05:38] <wikibugs>	 (03PS1) 10Cathal Mooney: Cloud network: update policy to support /17 IPv4 aggregates [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725)
[11:06:04] <claime>	 !log Rebuilding php base images to pick up 1135922 - T391665
[11:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:08] <stashbot>	 T391665: Move mwscript wrapper from base image to copy on build - https://phabricator.wikimedia.org/T391665
[11:06:12] <wikibugs>	 (03PS2) 10Abijeet Patro: Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311)
[11:09:21] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[11:09:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:10:21] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Move mwscript wrapper from base image to copy on build - T391665
[11:11:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:15:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:18:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T391056)', diff saved to https://phabricator.wikimedia.org/P75100 and previous config saved to /var/cache/conftool/dbconfig/20250416-111759-fceratto.json
[11:18:03] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:18:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[11:18:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75101 and previous config saved to /var/cache/conftool/dbconfig/20250416-111822-fceratto.json
[11:19:37] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan)
[11:20:23] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747041 (10Ladsgroup) 05Open→03Resolved
[11:21:09] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan)
[11:21:23] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/922c734ba2d3515515e7e0c69be9fcf04f1bc210092cb07b58fc3729e51d4cd6/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[11:26:03] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) (owner: 10Abijeet Patro)
[11:26:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:27:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:29:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75102 and previous config saved to /var/cache/conftool/dbconfig/20250416-112948-fceratto.json
[11:29:52] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:30:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10747081 (10cmooney) >>! In T392007#10745165, @Jclark-ctr wrote: > @RobH  we have 1 free cross connect circuit id 21996480.  but have plenty of room for additional p...
[11:32:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:36:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[11:37:02] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[11:37:37] <jelto>	 !log temporarily disable query sites on miscweb vms - T350793
[11:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:40] <stashbot>	 T350793: move query.wikidata.org to kubernetes - https://phabricator.wikimedia.org/T350793
[11:37:45] <claime>	 26 minutes for a full image push but IT WENT THROUGH.
[11:40:35] <wikibugs>	 (03PS1) 10Volans: doc: expand logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136984
[11:41:15] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136985
[11:41:23] <icinga-wm>	 RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[11:41:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:41:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:42:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "I like this idea! Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[11:43:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service restbase1045-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:44:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P75103 and previous config saved to /var/cache/conftool/dbconfig/20250416-114455-fceratto.json
[11:45:12] <jinxer-wm>	 FIRING: ProbeDown: Service miscweb2003:443 has failed probes (http_query_scholarly_wikidata_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:45:26] <jelto>	 ^ miscweb alert is expected, I'll silence this
[11:46:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Cloud network: update policy to support /17 IPv4 aggregates [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[11:47:03] <wikibugs>	 (03Merged) 10jenkins-bot: Cloud network: update policy to support /17 IPv4 aggregates [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[11:48:21] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136965 (owner: 10PipelineBot)
[11:48:21] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135706 (owner: 10PipelineBot)
[11:48:21] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136366 (owner: 10PipelineBot)
[11:48:21] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136791 (owner: 10PipelineBot)
[11:48:22] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136789 (owner: 10PipelineBot)
[11:48:43] <hnowlan>	 restbase1045-b is actually down - but also not in the cluster? 
[11:48:46] <wikibugs>	 (03CR) 10Tacsipacsi: [C:03+1] "In T391297#10737100, it was highlighted that this is a regression (caused by I39d1d1f45c017e6522f71979c8ad70ae2b00c333). Given this, I’m f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[11:48:46] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136985 (owner: 10PipelineBot)
[11:49:34] <hnowlan>	 oh, restbase1045-b is possibly yet to be bootstrapped cc urandom 
[11:50:15] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136985 (owner: 10PipelineBot)
[11:50:38] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127760 (owner: 10PipelineBot)
[11:50:39] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133976 (owner: 10PipelineBot)
[11:50:39] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132619 (owner: 10PipelineBot)
[11:50:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:50:39] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132653 (owner: 10PipelineBot)
[11:50:40] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131799 (owner: 10PipelineBot)
[11:50:41] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123332 (owner: 10PipelineBot)
[11:50:45] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126707 (owner: 10PipelineBot)
[11:50:49] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125137 (owner: 10PipelineBot)
[11:50:53] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124441 (owner: 10PipelineBot)
[11:50:57] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114365 (owner: 10PipelineBot)
[11:51:01] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112740 (owner: 10PipelineBot)
[11:51:05] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092223 (owner: 10PipelineBot)
[11:51:09] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100203 (owner: 10PipelineBot)
[11:51:13] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105689 (owner: 10PipelineBot)
[11:51:17] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111688 (owner: 10PipelineBot)
[11:51:21] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088280 (owner: 10PipelineBot)
[11:51:24] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:51:25] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083795 (owner: 10PipelineBot)
[11:51:29] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082760 (owner: 10PipelineBot)
[11:51:33] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077698 (owner: 10PipelineBot)
[11:51:37] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079997 (owner: 10PipelineBot)
[11:51:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:51:41] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066744 (owner: 10PipelineBot)
[11:51:42] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aphlict2001.codfw.wmnet with reason: Bookworm Re-image
[11:51:45] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068755 (owner: 10PipelineBot)
[11:51:49] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070272 (owner: 10PipelineBot)
[11:51:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:51:53] <wikibugs>	 (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[11:52:01] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:52:01] <wikibugs>	 (03PS1) 10Cyndywikime: Growth: Configure higher Impact Module edit limits for testwiki pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599)
[11:52:31] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host aphlict2001.codfw.wmnet with OS bookworm
[11:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:57:12] <wikibugs>	 (03PS1) 10Slyngshede: Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694)
[11:57:31] <wikibugs>	 (03PS1) 10Cathal Mooney: WMCS: fix typo in updated cloud-in policy [homer/public] - 10https://gerrit.wikimedia.org/r/1136991 (https://phabricator.wikimedia.org/T364725)
[11:57:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:57:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:58:07] <wikibugs>	 (03PS2) 10Cyndywikime: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599)
[11:59:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] WMCS: fix typo in updated cloud-in policy [homer/public] - 10https://gerrit.wikimedia.org/r/1136991 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[11:59:50] <wikibugs>	 (03Merged) 10jenkins-bot: WMCS: fix typo in updated cloud-in policy [homer/public] - 10https://gerrit.wikimedia.org/r/1136991 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[12:00:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P75104 and previous config saved to /var/cache/conftool/dbconfig/20250416-120002-fceratto.json
[12:00:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:00:54] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Move mwscript wrapper from base image to copy on build - T391665 (duration: 50m 43s)
[12:00:57] <stashbot>	 T391665: Move mwscript wrapper from base image to copy on build - https://phabricator.wikimedia.org/T391665
[12:04:30] <wikibugs>	 (03PS1) 10Jgiannelos: pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033)
[12:04:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:05:11] <wikibugs>	 (03PS4) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510)
[12:05:17] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510) (owner: 10Michael Große)
[12:05:21] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510) (owner: 10Michael Große)
[12:05:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:05:41] <wikibugs>	 (03PS2) 10Jgiannelos: pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033)
[12:05:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos)
[12:05:45] <wikibugs>	 (03PS2) 10Slyngshede: Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694)
[12:05:49] <wikibugs>	 (03PS1) 10Cathal Mooney: WMCS: fix typo in updated cloud-in policy #2 [homer/public] - 10https://gerrit.wikimedia.org/r/1136993 (https://phabricator.wikimedia.org/T364725)
[12:06:15] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route various miscellaneous pcs services to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1136676 (https://phabricator.wikimedia.org/T385033)
[12:06:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] WMCS: fix typo in updated cloud-in policy #2 [homer/public] - 10https://gerrit.wikimedia.org/r/1136993 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[12:07:10] <wikibugs>	 (03Merged) 10jenkins-bot: WMCS: fix typo in updated cloud-in policy #2 [homer/public] - 10https://gerrit.wikimedia.org/r/1136993 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[12:07:10] <wikibugs>	 (03PS3) 10Cyndywikime: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599)
[12:08:02] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage
[12:09:25] <wikibugs>	 (03PS1) 10Clément Goubert: php-fpm-multiversion-base: Cleanup unused scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665)
[12:09:40] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos)
[12:10:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:11:11] <wikibugs>	 (03Abandoned) 10Clément Goubert: growthexperiments: Disable updatementeedata on s6 [puppet] - 10https://gerrit.wikimedia.org/r/1135988 (owner: 10Clément Goubert)
[12:11:32] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage
[12:11:33] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos)
[12:11:35] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[12:12:45] <wikibugs>	 (03PS4) 10Cyndywikime: Growth: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599)
[12:13:05] <wikibugs>	 (03Merged) 10jenkins-bot: pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos)
[12:13:33] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:13:42] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:14:15] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:14:26] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:14:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:15:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75106 and previous config saved to /var/cache/conftool/dbconfig/20250416-121509-fceratto.json
[12:15:14] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:15:26] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[12:15:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75107 and previous config saved to /var/cache/conftool/dbconfig/20250416-121532-fceratto.json
[12:17:29] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:17:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75108 and previous config saved to /var/cache/conftool/dbconfig/20250416-121742-fceratto.json
[12:17:50] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:18:34] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:36] <wikibugs>	 (03PS1) 10Cathal Mooney: WMCS: Remove static routes for cloudsw2-d5-eqiad loopbacks [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283)
[12:19:16] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:19:54] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:20:02] <icinga-wm>	 PROBLEM - Exim SMTP on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Exim
[12:20:06] <icinga-wm>	 PROBLEM - SSH on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:20:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney)
[12:20:43] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] WMCS: Remove static routes for cloudsw2-d5-eqiad loopbacks [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney)
[12:21:14] <wikibugs>	 (03Merged) 10jenkins-bot: WMCS: Remove static routes for cloudsw2-d5-eqiad loopbacks [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney)
[12:23:11] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[12:23:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:23:39] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[12:23:50] <wikibugs>	 (03PS1) 10Cathal Mooney: WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283)
[12:24:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:25:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-updateMenteeData-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:26:30] <wikibugs>	 (03PS1) 10Hnowlan: deployment_server: ignore overlayfs when checking disk space [puppet] - 10https://gerrit.wikimedia.org/r/1136997
[12:26:31] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans)
[12:26:34] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:27:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney)
[12:27:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney)
[12:27:54] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:27:59] <wikibugs>	 (03Merged) 10jenkins-bot: WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney)
[12:32:46] <wikibugs>	 (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime)
[12:32:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P75109 and previous config saved to /var/cache/conftool/dbconfig/20250416-123248-fceratto.json
[12:34:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:36:51] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411)
[12:37:56] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aphlict2001.codfw.wmnet with OS bookworm
[12:38:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[12:38:51] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747323 (10Jdforrester-WMF)
[12:39:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[12:41:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:43:36] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] deployment_server: ignore overlayfs when checking disk space [puppet] - 10https://gerrit.wikimedia.org/r/1136997 (owner: 10Hnowlan)
[12:43:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[12:47:44] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:47:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P75111 and previous config saved to /var/cache/conftool/dbconfig/20250416-124755-fceratto.json
[12:47:56] <icinga-wm>	 RECOVERY - SSH on lists1004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:47:58] <icinga-wm>	 RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim
[12:48:06] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:48:22] <wikibugs>	 (03CR) 10Volans: [C:03+2] CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans)
[12:48:24] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:48:26] <wikibugs>	 (03PS2) 10Volans: CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841
[12:48:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service restbase1045-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:50:16] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:51:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] etcd: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129177 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[12:52:35] <wikibugs>	 (03PS1) 10Cathal Mooney: WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725)
[12:55:27] <wikibugs>	 (03PS2) 10Cathal Mooney: WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725)
[12:57:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[12:57:59] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[12:58:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] webperf: Move `php_admin_flag engine on` from subdir to docroot [puppet] - 10https://gerrit.wikimedia.org/r/1130211 (owner: 10Krinkle)
[12:58:32] <wikibugs>	 (03Merged) 10jenkins-bot: WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1300).
[13:00:04] <jouncebot>	 HouseOfM and tto: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:33] <tto>	 Greetings!
[13:00:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] deployment_server: ignore overlayfs when checking disk space [puppet] - 10https://gerrit.wikimedia.org/r/1136997 (owner: 10Hnowlan)
[13:00:43] <Lucas_WMDE>	 o/
[13:01:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[13:01:26] <Lucas_WMDE>	 I can deploy, but I wouldn’t mind if someone else does it
[13:01:50] <HouseOfM>	 o/ greetings
[13:03:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75112 and previous config saved to /var/cache/conftool/dbconfig/20250416-130303-fceratto.json
[13:03:07] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:03:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[13:03:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T391056)', diff saved to https://phabricator.wikimedia.org/P75113 and previous config saved to /var/cache/conftool/dbconfig/20250416-130326-fceratto.json
[13:03:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: update irc template for pyrra slo alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron)
[13:03:42] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "+1 as this restores a `strtolower()` that was already present prior to I39d1d1f45c." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[13:04:12] <Lucas_WMDE>	 alright, I can deploy
[13:04:21] <tto>	 :o thanks Lucas_WMDE!
[13:04:22] <Lucas_WMDE>	 and I’ll start with tto’s change since HouseOfM’s still has an open comment
[13:04:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[13:04:44] <HouseOfM>	 it does? I hadn't seen that! thx
[13:04:47] <wikibugs>	 (03PS1) 10Fabfur: cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073)
[13:05:01] <Lucas_WMDE>	 yeah, I looked at it earlier today
[13:05:19] <Lucas_WMDE>	 haven’t had the time yet to fully confirm but I think all the core-Permissions.php changes are unnecessary
[13:05:27] <Lucas_WMDE>	 since CampaignEvents configures a group by default now
[13:05:31] <wikibugs>	 (03Merged) 10jenkins-bot: search-redirect: fix case-sensitivity of project name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[13:05:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T391056)', diff saved to https://phabricator.wikimedia.org/P75114 and previous config saved to /var/cache/conftool/dbconfig/20250416-130536-fceratto.json
[13:05:47] <Lucas_WMDE>	 IIUC the only core-Permissions.php entries that are left related to CampaignEvents are for nonstandard situations, like test wikis where all users should have those permissions
[13:05:58] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661)
[13:06:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134984|search-redirect: fix case-sensitivity of project name (T391297)]]
[13:06:04] <stashbot>	 T391297: www.wiktionary.org and other portals are redirecting searches to wikipedia - https://phabricator.wikimedia.org/T391297
[13:06:10] <wikibugs>	 (03CR) 10Jelto: [C:03+2] make helm3 alternative entry dependent on helm [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[13:10:04] <wikibugs>	 (03PS3) 10Mhorsey: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805)
[13:10:31] <HouseOfM>	 You are correct @Lucas_WMDE I've made the relevant change
[13:10:39] <Lucas_WMDE>	 nice
[13:10:52] <Lucas_WMDE>	 we can see what the userrights API reports on mwdebug :)
[13:10:53] <wikibugs>	 (03PS1) 10Ssingh: utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007
[13:11:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007 (owner: 10Ssingh)
[13:12:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-2] "The current behaviour is deliberately chosen for a reason: we want to know the full diff compared to what is in production right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[13:15:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:16:55] <wikibugs>	 (03PS2) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081)
[13:16:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 wargo, lucaswerkmeister-wmde: Backport for [[gerrit:1134984|search-redirect: fix case-sensitivity of project name (T391297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:17:03] <stashbot>	 T391297: www.wiktionary.org and other portals are redirecting searches to wikipedia - https://phabricator.wikimedia.org/T391297
[13:17:27] <wikibugs>	 (03PS2) 10Ssingh: utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007
[13:17:42] <Lucas_WMDE>	 tto: please test with WikimediaDebug :)
[13:18:09] <Lucas_WMDE>	 (I assume it should still work for docroot/wwwportal stuff)
[13:18:24] <wikibugs>	 (03CR) 10Tiziano Fogli: profile::prometheus::k8s: drop two more labels in Istio metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[13:18:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:18:49] <godog>	 !log bounce thanos on titan100* - overload
[13:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:56] <tto>	 OK, will do...
[13:19:38] <Lucas_WMDE>	 https://www.wikipedia.org/search-redirect.php?language=de&search=Test&family=Wiktionary seems to work for me (redirects to Wikipedia currently but Wiktionary with -H 'X-Wikimedia-Debug: backend=k8s-mwdebug')
[13:20:16] <tto>	 Can confirm working on k8s-mwdebug
[13:20:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 wargo, lucaswerkmeister-wmde: Continuing with sync
[13:20:27] <Lucas_WMDE>	 nice, thanks!
[13:20:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:20:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P75115 and previous config saved to /var/cache/conftool/dbconfig/20250416-132043-fceratto.json
[13:21:54] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[13:22:20] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[13:23:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:24:40] <godog>	 !log finish rollout of thanos 0.38 to prometheus* - T383966
[13:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:44] <stashbot>	 T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966
[13:26:52] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks:
[13:26:52] <icinga-wm>	 er_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:26:54] <wikibugs>	 (03PS2) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[13:27:01] <wikibugs>	 (03CR) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[13:28:05] <wikibugs>	 (03CR) 10Mhorsey: Release campaignEvents extension to azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey)
[13:28:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134984|search-redirect: fix case-sensitivity of project name (T391297)]] (duration: 22m 55s)
[13:29:00] <stashbot>	 T391297: www.wiktionary.org and other portals are redirecting searches to wikipedia - https://phabricator.wikimedia.org/T391297
[13:29:28] <wikibugs>	 (03PS3) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081)
[13:29:46] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[13:29:56] <tto>	 Lucas_WMDE I can confirm this is now working in production!
[13:29:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[13:30:00] <tto>	 Thanks for your assistance as ever
[13:31:03] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411)
[13:31:04] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[13:31:22] <tto>	 Goodnight all
[13:32:05] <elukey>	 Lucas_WMDE: o/ all good with the deployments so far right?
[13:32:40] <wikibugs>	 (03PS4) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081)
[13:32:42] <Lucas_WMDE>	 elukey: yup
[13:33:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey)
[13:33:46] <Lucas_WMDE>	 I’m also in a meeting now, so might be a bit slow to respond to messages
[13:33:47] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:33:50] <Lucas_WMDE>	 hopefully the deployment will go smoothly
[13:34:15] <wikibugs>	 (03Merged) 10jenkins-bot: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey)
[13:34:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1136754|Release campaignEvents extension to azwiki (T390805)]]
[13:34:42] <stashbot>	 T390805: Enable CampaignEvents Extension on azwiki - https://phabricator.wikimedia.org/T390805
[13:34:53] <elukey>	 yep yep, ping me if needed
[13:35:02] <elukey>	 it seems that the 5 minutes delay is working
[13:35:29] <wikibugs>	 (03CR) 10Hashar: "> The current behaviour is deliberately chosen for a reason: we want to know the full diff compared to what is in production right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[13:35:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P75116 and previous config saved to /var/cache/conftool/dbconfig/20250416-133552-fceratto.json
[13:38:59] <Lucas_WMDE>	 ah, the sleep is hidden in build-and-push-container-images ^^
[13:39:07] <wikibugs>	 (03CR) 10Elukey: [C:03+1] __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 (owner: 10Volans)
[13:39:52] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans)
[13:40:00] <wikibugs>	 (03CR) 10Elukey: [C:03+1] I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 (owner: 10Volans)
[13:41:25] <Lucas_WMDE>	 yay, sleep done
[13:43:36] <elukey>	 it should tell you something in the scap log though
[13:43:46] <elukey>	 there is also a "Sorry" :D
[13:43:47] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[13:43:54] <Lucas_WMDE>	 elukey: that’s only in the output file
[13:44:00] <Lucas_WMDE>	 13:34:59 Started build-and-push-container-images
[13:44:00] <Lucas_WMDE>	 13:34:59 K8s images build/push output redirected to /home/lucaswerkmeister-wmde/scap-image-build-and-push-log
[13:44:00] <Lucas_WMDE>	 13:41:07 Finished build-and-push-container-images (duration: 06m 08s)
[13:44:08] <Lucas_WMDE>	 and once I looked at that file I saw the “sorry”
[13:44:19] <elukey>	 ahhh right right
[13:44:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "we need to return `X-Cache: hostname int` and `X-Cache-Status: int-tls` here as well" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[13:44:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1136754|Release campaignEvents extension to azwiki (T390805)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:45:02] <stashbot>	 T390805: Enable CampaignEvents Extension on azwiki - https://phabricator.wikimedia.org/T390805
[13:45:04] <Lucas_WMDE>	 HouseOfM: please test :)
[13:45:32] <Lucas_WMDE>	 user rights look promising to me fwiw
[13:45:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[13:46:58] <HouseOfM>	 LGTM :)
[13:47:01] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Continuing with sync
[13:47:02] <Lucas_WMDE>	 yay
[13:47:45] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[13:48:23] <wikibugs>	 (03CR) 10Elukey: [C:03+1] I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 (owner: 10Volans)
[13:49:11] <wikibugs>	 (03CR) 10Herron: [C:03+1] profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[13:50:36] <wikibugs>	 07sre-alert-triage, 06SRE Observability, 06Traffic: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T392091 (10LSobanski) 03NEW
[13:50:37] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#10747701 (10Jgreen)
[13:50:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T391056)', diff saved to https://phabricator.wikimedia.org/P75117 and previous config saved to /var/cache/conftool/dbconfig/20250416-135059-fceratto.json
[13:51:03] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:51:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:51:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75118 and previous config saved to /var/cache/conftool/dbconfig/20250416-135121-fceratto.json
[13:51:58] <jelto>	 !log "Imported helm311 3.11.3-4 to bullseye-wikimedia and bookworm-wikimedia - T387548"
[13:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:02] <stashbot>	 T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548
[13:52:04] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[13:52:08] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[13:53:17] <wikibugs>	 (03PS2) 10Fabfur: cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073)
[13:53:27] <wikibugs>	 (03CR) 10Fabfur: "Do you mean on every error request? In this case it's better to provide a separate configuration that will apply to every error generated " [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[13:53:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094 (10cmooney) 03NEW p:05Triage→03Low
[13:53:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[13:53:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136754|Release campaignEvents extension to azwiki (T390805)]] (duration: 19m 09s)
[13:53:51] <stashbot>	 T390805: Enable CampaignEvents Extension on azwiki - https://phabricator.wikimedia.org/T390805
[13:54:41] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[13:54:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[13:54:51] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610)
[13:55:04] <wikibugs>	 (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[13:55:07] <logmsgbot>	 !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1045.eqiad.wmnet with reason: Bootstrapping — T389423
[13:55:09] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:55:10] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[13:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:32] <HouseOfM>	 tysm Lucas_WMDE. 
[13:56:36] <wikibugs>	 (03PS3) 10Fabfur: cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073)
[13:57:35] <claime>	 Lucas_WMDE: yeah, I didn't have time to patch scap to add the logging in there, only to the build script, sorry
[13:57:43] <wikibugs>	 (03CR) 10Vgutierrez: "every response generated by haproxy needs to be flagged as `int-tls`" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[13:57:48] <Lucas_WMDE>	 np ^^
[13:58:27] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#10747776 (10Jgreen)
[13:58:33] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10747777 (10Jgreen)
[13:58:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10747774 (10Jgreen)
[13:58:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[13:58:47] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400)
[14:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:01:39] <wikibugs>	 (03PS3) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[14:01:44] <wikibugs>	 (03CR) 10Volans: [C:03+2] __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 (owner: 10Volans)
[14:01:48] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925)
[14:02:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75119 and previous config saved to /var/cache/conftool/dbconfig/20250416-140228-fceratto.json
[14:02:32] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:03:34] <wikibugs>	 (03PS1) 10Jelto: make helm3 alternative entry dependent on helm [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548)
[14:04:15] <wikibugs>	 (03PS2) 10Volans: I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837
[14:04:23] <wikibugs>	 (03CR) 10Volans: [C:03+2] I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 (owner: 10Volans)
[14:04:31] <wikibugs>	 (03CR) 10Jelto: "similar change for `helm317`" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[14:04:47] <wikibugs>	 07Puppet: Add PATCH method to Wmflib::HTTP::Method - https://phabricator.wikimedia.org/T392096 (10Fabfur) 03NEW
[14:04:59] <wikibugs>	 (03PS3) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925)
[14:05:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ServiceOps cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839 (owner: 10Volans)
[14:05:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron)
[14:05:41] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans)
[14:06:02] <wikibugs>	 (03PS4) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925)
[14:06:19] <wikibugs>	 (03CR) 10Elukey: [C:03+1] DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 (owner: 10Volans)
[14:06:51] <wikibugs>	 (03CR) 10Elukey: [C:03+1] DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans)
[14:07:03] <wikibugs>	 (03CR) 10Herron: alertmanager: update irc template for pyrra slo alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron)
[14:07:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[14:08:00] <wikibugs>	 (03CR) 10DCausse: [C:03+1] cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:08:03] <wikibugs>	 (03PS3) 10Elukey: profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925)
[14:08:03] <wikibugs>	 (03PS4) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[14:08:10] <wikibugs>	 (03Merged) 10jenkins-bot: __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 (owner: 10Volans)
[14:08:25] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:11:05] <wikibugs>	 (03Merged) 10jenkins-bot: I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 (owner: 10Volans)
[14:11:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[14:13:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:14:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus: kernel-messages-ignore-regex.txt: ignore another message [puppet] - 10https://gerrit.wikimedia.org/r/1137012
[14:14:50] <wikibugs>	 (03PS1) 10Krinkle: [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013
[14:14:56] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: prometheus: kernel-messages-ignore-regex.txt: ignore another message [puppet] - 10https://gerrit.wikimedia.org/r/1137012 (https://phabricator.wikimedia.org/T391408)
[14:15:30] <wikibugs>	 (03CR) 10Herron: "would there be a downside to pushing this even further to say 30+ days essentially to run forcemerge only on the hdd nodes?" [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi)
[14:15:48] <wikibugs>	 (03CR) 10Volans: [C:03+2] I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 (owner: 10Volans)
[14:16:05] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - brouberol@cumin2002 - T388610
[14:16:08] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:16:43] <wikibugs>	 (03CR) 10Krinkle: "This was committed anonymously in  Thu 14 Mar 2024 without a change-id." [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[14:17:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[14:17:25] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2071.codfw.wmnet on all recursors
[14:17:29] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2071.codfw.wmnet on all recursors
[14:17:29] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2099.codfw.wmnet on all recursors
[14:17:32] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2099.codfw.wmnet on all recursors
[14:17:33] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2101.codfw.wmnet on all recursors
[14:17:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P75120 and previous config saved to /var/cache/conftool/dbconfig/20250416-141735-fceratto.json
[14:17:36] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2101.codfw.wmnet on all recursors
[14:18:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[14:18:30] <wikibugs>	 (03CR) 10Krinkle: "I don't know volatile means in this context but https://gerrit.wikimedia.org/r/q/project:operations/puppet+message:%22puppetserver::volati" [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[14:20:43] <wikibugs>	 (03PS5) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[14:20:44] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925)
[14:21:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[14:21:47] <wikibugs>	 (03Merged) 10jenkins-bot: I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 (owner: 10Volans)
[14:22:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: kernel-messages-ignore-regex.txt: ignore another message [puppet] - 10https://gerrit.wikimedia.org/r/1137012 (https://phabricator.wikimedia.org/T391408) (owner: 10Arturo Borrero Gonzalez)
[14:22:06] <sukhe>	 !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech3_amd64.changes: T205378
[14:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:10] <stashbot>	 T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378
[14:23:28] <wikibugs>	 (03PS5) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081)
[14:24:53] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[14:26:20] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[14:26:26] <wikibugs>	 (03CR) 10BBlack: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1137007 (owner: 10Ssingh)
[14:26:42] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[14:26:46] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:26:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[14:26:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007 (owner: 10Ssingh)
[14:27:27] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:27:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[14:27:58] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925)
[14:27:58] <wikibugs>	 (03PS6) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[14:28:34] <icinga-wm>	 PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[14:29:29] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5310/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[14:29:55] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:31:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron)
[14:32:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P75121 and previous config saved to /var/cache/conftool/dbconfig/20250416-143242-fceratto.json
[14:33:09] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] php-fpm-multiversion-base: Cleanup unused scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[14:33:51] <claime>	 jouncebot: nowandnext
[14:33:51] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400)
[14:33:52] <jouncebot>	 In 2 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1700)
[14:37:36] <wikibugs>	 (03CR) 10FNegri: [C:03+2] openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) (owner: 10FNegri)
[14:37:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748105 (10cmooney)
[14:38:18] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "Confirmed with @claime on IRC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:38:21] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:38:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:38:57] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[14:39:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748126 (10cmooney) These are the two for codfw: ` ip route add vrf vrf-cloudgw blackhole 172.16.128.0/17 metric 9999 ip route add vrf vrf-cloudgw blackhole 2a02:...
[14:39:12] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey)
[14:40:17] <godog>	 jouncebot: now and next
[14:40:17] <jouncebot>	 For the next 0 hour(s) and 19 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400)
[14:40:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[14:40:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170)
[14:40:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[14:41:09] <wikibugs>	 (03PS2) 10Majavah: Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542
[14:41:47] <elukey>	 taavi: \o/ o/
[14:41:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 (owner: 10Majavah)
[14:41:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 (owner: 10Majavah)
[14:42:31] <wikibugs>	 (03Merged) 10jenkins-bot: Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 (owner: 10Majavah)
[14:44:57] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:45:17] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:47:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75122 and previous config saved to /var/cache/conftool/dbconfig/20250416-144750-fceratto.json
[14:47:54] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:48:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:48:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-2] "> That was deemed a problem in T387781" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[14:49:38] <godog>	 jouncebot: now and next
[14:49:38] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400)
[14:50:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:51:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "tbh I don't know, though off the top of my head I don't see why not, except maybe forcemerge performance on hdd might be costly? we'll nee" [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi)
[14:51:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Add zarcillo k8s service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:52:39] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:52:47] <wikibugs>	 (03PS1) 10FNegri: wikireplicas: maintain-views should not create _p db [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105)
[14:53:06] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:53:14] <wikibugs>	 (03PS6) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:53:19] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:53:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:53:37] <wikibugs>	 (03CR) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:53:40] <wikibugs>	 (03PS7) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:53:50] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:54:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:54:14] <wikibugs>	 (03PS8) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:55:01] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[14:56:26] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[14:57:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[14:57:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T391056)', diff saved to https://phabricator.wikimedia.org/P75123 and previous config saved to /var/cache/conftool/dbconfig/20250416-145718-fceratto.json
[14:57:22] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:59:01] <wikibugs>	 (03PS3) 10Scott French: PageTriage: migrate updatePageTriageQueue-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536)
[14:59:01] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French)
[14:59:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T391056)', diff saved to https://phabricator.wikimedia.org/P75124 and previous config saved to /var/cache/conftool/dbconfig/20250416-145928-fceratto.json
[15:00:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:00:55] <wikibugs>	 (03PS9) 10Brouberol: Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis)
[15:01:11] <wikibugs>	 (03PS7) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350)
[15:01:57] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10748293 (10fnegri) Thanks @Jclark-ctr, do you think there is a way to disable the sensor so that it will not trigger the alert? We could also sile...
[15:02:10] <wikibugs>	 (03PS1) 10Kamila Součková: Revert "CampaignEvents: Migrate aggregateparticipantanswers-test2wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1137020
[15:02:29] <wikibugs>	 (03PS2) 10Kamila Součková: Revert "CampaignEvents: Migrate aggregateparticipantanswers-test2wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1137020
[15:03:41] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10748298 (10fgiunchedi)
[15:03:49] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:04:20] <wikibugs>	 (03PS5) 10JHathaway: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[15:04:22] <wikibugs>	 (03PS1) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378)
[15:05:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10748300 (10RobH) 05Open→03Stalled Please note this is stalled while the evaluation of D6 is performed.  , please see T392007 and...
[15:05:19] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:06:38] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:09] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:10:34] <wikibugs>	 (03PS1) 10Ssingh: utils/type65: fix typo s/bas64/base64 [dns] - 10https://gerrit.wikimedia.org/r/1137022
[15:13:38] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[15:14:09] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] "Fixing typo, no code change." [dns] - 10https://gerrit.wikimedia.org/r/1137022 (owner: 10Ssingh)
[15:14:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P75125 and previous config saved to /var/cache/conftool/dbconfig/20250416-151438-fceratto.json
[15:14:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10748338 (10phaultfinder)
[15:14:41] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[15:15:09] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] Revert "CampaignEvents: Migrate aggregateparticipantanswers-test2wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1137020 (owner: 10Kamila Součková)
[15:16:31] <wikibugs>	 (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans)
[15:17:14] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[15:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:20:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-updateMenteeData-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:23:26] <wikibugs>	 (03Merged) 10jenkins-bot: CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans)
[15:26:40] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] profile::prometheus::k8s: drop two more labels in Istio metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey)
[15:27:48] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, Andrew!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[15:27:54] <wikibugs>	 (03CR) 10Ssingh: [C:04-2] "DO NOT MERGE until April 24, week of deploy." [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:29:16] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: expose RRLA event stream publicly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[15:29:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P75126 and previous config saved to /var/cache/conftool/dbconfig/20250416-152945-fceratto.json
[15:30:42] <wikibugs>	 (03PS1) 10Herron: Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027
[15:32:52] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[15:32:56] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[15:32:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron)
[15:33:24] <wikibugs>	 (03PS1) 10Dwisehaupt: hiera: acme_chief: add community-crm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715)
[15:34:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10748444 (10phaultfinder)
[15:34:50] <wikibugs>	 (03CR) 10Dwisehaupt: "Here is the acmechief stanza I believe we need. It is using community-crm instead of the crm role since that is the public name of the ser" [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[15:35:50] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:20] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] hiera: acme_chief: add community-crm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[15:37:29] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: connect to address 10.192.0.18 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[15:42:46] <hnowlan>	 MichaelG_WMF: looks like there might be some failures for mediawiki_job_growthexperiments-updateMenteeData-s1.service
[15:43:17] <MichaelG_WMF>	 hnowlan: meh. Where do you see them?
[15:44:08] <hnowlan>	 MichaelG_WMF: there's been one or two SystemdUnitFailed messages in here, at 15:20 and 12:25. haven't looked more 
[15:44:38] * MichaelG_WMF scrolls up
[15:44:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T391056)', diff saved to https://phabricator.wikimedia.org/P75127 and previous config saved to /var/cache/conftool/dbconfig/20250416-154452-fceratto.json
[15:44:56] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:45:08] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1233.eqiad.wmnet with reason: Maintenance
[15:45:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[15:45:14] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[15:45:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T391056)', diff saved to https://phabricator.wikimedia.org/P75128 and previous config saved to /var/cache/conftool/dbconfig/20250416-154515-fceratto.json
[15:46:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2070 to cirrussearch2070
[15:46:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:47:27] <MichaelG_WMF>	 hnowlan: I'm seeing it now, thanks. Though haven't found them yet in logstash
[15:48:19] <wikibugs>	 (03PS1) 10Hnowlan: mw:periodic_job:kubernetes: fail when job name in kubernetes is too long [puppet] - 10https://gerrit.wikimedia.org/r/1137029
[15:49:42] <wikibugs>	 (03PS5) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782)
[15:50:11] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Revert^2 "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772 (owner: 10Ssingh)
[15:51:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2070 to cirrussearch2070 - bking@cumin2002"
[15:52:27] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133563 (owner: 10Ncmonitor)
[15:52:54] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133563 (owner: 10Ncmonitor)
[15:53:27] <wikibugs>	 (03PS2) 10Hnowlan: mw:periodic_job:kubernetes: fail when job name in kubernetes is too long [puppet] - 10https://gerrit.wikimedia.org/r/1137029
[15:53:32] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[15:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:55] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan)
[15:54:04] <MichaelG_WMF>	 hnowlan: or Amir1: any ideas for how to debug this? `systemctl list-units --state=failed` is not listing the unit
[15:54:37] <claime>	 MichaelG_WMF: lemme take a look
[15:54:48] <MichaelG_WMF>	 claime: thanks!
[15:54:50] <Amir1>	 I think you can find logs in /var/log/maint-name
[15:55:05] <wikibugs>	 (03PS2) 10Ssingh: Revert^2 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772
[15:55:06] <MichaelG_WMF>	 yes, looked at that, contains nothing helpful
[15:55:17] <wikibugs>	 (03CR) 10Ssingh: "Updated to durum3003 so the DEfO folks in IE can test." [puppet] - 10https://gerrit.wikimedia.org/r/1136772 (owner: 10Ssingh)
[15:55:29] <MichaelG_WMF>	 only that the job started for enwiki, but no error message anything of the sort
[15:56:09] <Amir1>	 there is this
[15:56:12] <Amir1>	 https://www.irccloud.com/pastebin/EcbdjS5B/
[15:56:19] <claime>	  Main PID: 31703 (code=exited, status=0/SUCCESS)
[15:56:27] <claime>	 it worked correctly
[15:56:34] <claime>	 sudo systemctl status mediawiki_job_growthexperiments-updateMenteeData-s1.service
[15:56:38] <claime>	 [...]
[15:56:38] <Amir1>	 > Apr 16 15:18:28 mwmaint1002 systemd[1]: mediawiki_job_growthexperiments-updateMenteeData-s1.service: Current command vanished from the unit file, execution of the command list won't be resumed.
[15:56:43] <claime>	 Apr 16 15:55:16 mwmaint1002 mediawiki_job_growthexperiments-updateMenteeData-s1[31703]: enwiki:  Done. Took 2416 seconds.
[15:56:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T391056)', diff saved to https://phabricator.wikimedia.org/P75129 and previous config saved to /var/cache/conftool/dbconfig/20250416-155655-fceratto.json
[15:56:59] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:58:18] <MichaelG_WMF>	 claime: ok, when I looked minutes ago, the success message wasn't there yet XD
[15:58:36] <MichaelG_WMF>	 but then why the error messages here about the systemd unit having failed?
[15:58:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2070 to cirrussearch2070 - bking@cumin2002"
[15:58:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:58:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070
[15:58:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2070
[15:59:06] <MichaelG_WMF>	 Also, the thing posted by Amir1 sounds strange
[15:59:08] <MichaelG_WMF>	 > Apr 16 15:18:28 mwmaint1002 systemd[1]: mediawiki_job_growthexperiments-updateMenteeData-s1.service: Current command vanished from the unit file, execution of the command list won't be resumed.
[15:59:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: spiderpig: Set global_cert_name on deployment-deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945)
[15:59:22] <Amir1>	 I have not seen this before
[15:59:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2070 to cirrussearch2070
[15:59:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2070.codfw.wmnet on all recursors
[15:59:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2070.codfw.wmnet on all recursors
[15:59:41] <claime>	 Me neither
[16:00:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2070.codfw.wmnet with OS bullseye
[16:00:14] <wikibugs>	 (03PS2) 10Ahmon Dancy: spiderpig: Set global_cert_name on deployment-deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945)
[16:00:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2070
[16:00:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] spicerack: enable IRC notification on user input [puppet] - 10https://gerrit.wikimedia.org/r/1136973 (owner: 10Volans)
[16:00:53] <wikibugs>	 (03CR) 10Elukey: [C:03+1] doc: expand logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136984 (owner: 10Volans)
[16:00:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10748625 (10VRiley-WMF) I have sent an email to them requesting an update on this. Awaiting response.
[16:01:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:01:30] <wikibugs>	 (03CR) 10Ahmon Dancy: "This finalizes a change that was lurking on deployment-puppetserver-1.deployment-prep." [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[16:01:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron)
[16:02:04] <wikibugs>	 (03PS1) 10Dwisehaupt: hiera: acme_chief: move community-crm to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715)
[16:03:21] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] hiera: acme_chief: move community-crm to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[16:03:29] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[16:04:26] <wikibugs>	 (03PS3) 10Filippo Giunchedi: logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661)
[16:04:44] <wikibugs>	 (03PS3) 10Hnowlan: mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029
[16:04:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/1136604/5311/" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[16:05:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan)
[16:05:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: "As discussed at the meeting, pushed forcemerge to 30d" [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi)
[16:05:57] <wikibugs>	 (03PS4) 10Hnowlan: mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029
[16:06:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan)
[16:06:59] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033
[16:07:08] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply
[16:07:28] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[16:07:40] <wikibugs>	 (03PS5) 10Hnowlan: mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029
[16:07:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2070 - bking@cumin2002"
[16:07:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2070 - bking@cumin2002"
[16:07:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:07:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:07:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:07:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070
[16:08:23] <sukhe>	 !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1136772"'
[16:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert^2 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772 (owner: 10Ssingh)
[16:10:09] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan)
[16:10:12] <sukhe>	 !log stopping bird on durum3003 to temporarily disable advertising of anycast IPs
[16:10:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:29] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: testing ECH
[16:12:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P75132 and previous config saved to /var/cache/conftool/dbconfig/20250416-161202-fceratto.json
[16:13:14] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert)
[16:13:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert)
[16:15:33] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert)
[16:15:45] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:15:53] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:16:04] <sukhe>	 ^ expected, host is depooled
[16:16:20] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert)
[16:16:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:17:15] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:17:21] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:17:27] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:17:27] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:17:33] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:17:35] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:17:35] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:18:01] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[16:18:21] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cirrussearch2070
[16:18:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070
[16:18:34] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cirrussearch2070
[16:18:34] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.10
[16:18:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:20:48] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki chart 0.8.10 (duration: 03m 20s)
[16:21:50] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:21:54] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=93) for host cirrussearch2070
[16:21:55] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2070.codfw.wmnet with OS bullseye
[16:22:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2095.codfw.wmnet on all recursors
[16:22:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2095.codfw.wmnet on all recursors
[16:22:13] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan)
[16:22:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2110.codfw.wmnet on all recursors
[16:22:25] <wikibugs>	 (03PS1) 10Fabfur: wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096)
[16:22:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2110.codfw.wmnet on all recursors
[16:22:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan)
[16:23:01] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[16:23:04] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[16:24:15] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-2 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 379664 seconds left:Certificate *.wikimania.com valid until 2025-05-20 06:53:14 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:24:21] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 211598 seconds left:Certificate wikipedia.com valid until 2025-05-29 22:00:27 +0000 (expires in 43 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:24:27] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 204332 seconds left:Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:24:27] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 323492 seconds left:Certificate *.wikispecies.net valid until 2025-05-20 04:52:46 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:24:33] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 326306 seconds left:Certificate wikimedia.is valid until 2025-06-05 06:20:49 +0000 (expires in 49 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:24:35] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-8 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimediacommons.uk has 190344 seconds left:Certificate wikimediacommons.uk valid until 2025-07-01 19:46:04 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:24:35] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-7 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.ro has 235764 seconds left:Certificate wikipedia.ro valid until 2025-07-01 19:44:46 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:25:01] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 318178 seconds left:Certificate *.wikipedia.bg valid until 2025-06-07 02:21:46 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Ncredir
[16:25:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096) (owner: 10Fabfur)
[16:26:47] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096) (owner: 10Fabfur)
[16:27:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P75133 and previous config saved to /var/cache/conftool/dbconfig/20250416-162709-fceratto.json
[16:28:00] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper)
[16:31:35] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:32:08] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[16:32:17] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:32:31] <wikibugs>	 (03PS1) 10Dwisehaupt: Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037
[16:32:42] <wikibugs>	 (03PS4) 10Fabfur: cache,haproxy: allowed methods check and set response headers [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073)
[16:32:55] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[16:32:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (owner: 10Dwisehaupt)
[16:33:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2070.codfw.wmnet with OS bullseye
[16:33:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2070
[16:33:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:33:57] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:34:13] <wikibugs>	 (03PS2) 10Dwisehaupt: Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (https://phabricator.wikimedia.org/T383715)
[16:34:29] <wikibugs>	 (03CR) 10Dwisehaupt: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[16:34:35] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.919 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:34:47] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:35:07] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:35:20] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[16:35:45] <wikibugs>	 06SRE, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10748841 (10Dzahn) How about notifications for next time?
[16:36:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:36:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:36:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:36:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070
[16:36:12] <wikibugs>	 (03PS6) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081)
[16:36:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2070
[16:36:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2070
[16:36:35] <icinga-wm>	 PROBLEM - Host kafka-logging2005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:47] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[16:37:34] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[16:38:16] <wikibugs>	 (03CR) 10Raymond Ndibe: "tested by execing into `toolforge-control-plane` on lima-kilo and everything works as expected. the index is tracking things properly and " [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[16:42:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T391056)', diff saved to https://phabricator.wikimedia.org/P75135 and previous config saved to /var/cache/conftool/dbconfig/20250416-164216-fceratto.json
[16:42:20] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:42:32] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: Maintenance
[16:46:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:46:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:48:39] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:48:39] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1045-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:50:16] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038
[16:51:11] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1254.eqiad.wmnet with reason: Maintenance
[16:51:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T391056)', diff saved to https://phabricator.wikimedia.org/P75136 and previous config saved to /var/cache/conftool/dbconfig/20250416-165118-fceratto.json
[16:51:22] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:53:59] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 (owner: 10Clément Goubert)
[16:54:05] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 (owner: 10Clément Goubert)
[16:55:22] <wikibugs>	 (03CR) 10Majavah: "Does this mean we should remove the `--drop` option from the script too?" [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) (owner: 10FNegri)
[16:56:25] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 (owner: 10Clément Goubert)
[16:58:07] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.11
[16:58:39] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:59:14] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1005.eqiad.wmnet - https://phabricator.wikimedia.org/T391413#10748937 (10VRiley-WMF)
[16:59:28] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] php-fpm-multiversion-base: Cleanup unused scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[16:59:44] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1005.eqiad.wmnet - https://phabricator.wikimedia.org/T391413#10748943 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This is completed
[16:59:56] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] "This can be auto-picked up by the weekly rebuild, or we can do a full build tomorrow." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1700)
[17:00:32] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki chart 0.8.11 (duration: 03m 02s)
[17:00:39] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] "Actually I don't think that's even needed since the image itself isn't changing, just the repo files." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[17:03:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T391056)', diff saved to https://phabricator.wikimedia.org/P75137 and previous config saved to /var/cache/conftool/dbconfig/20250416-170305-fceratto.json
[17:03:15] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:05:06] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "archiving the deployment steps here: https://phabricator.wikimedia.org/P75134" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[17:07:31] <wikibugs>	 (03CR) 10Herron: [C:03+1] logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi)
[17:07:36] <wikibugs>	 (03CR) 10FNegri: "I think that can still be useful, if we have to drop an entire wiki from clouddbs. I'm not sure if that ever happened, and what is the cur" [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) (owner: 10FNegri)
[17:09:11] <wikibugs>	 (03PS6) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782)
[17:09:26] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[17:09:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] PageTriage: migrate updatePageTriageQueue-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French)
[17:09:45] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:09:55] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:13:39] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:15:15] <wikibugs>	 (03PS7) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782)
[17:16:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache,haproxy: allowed methods check and set response headers [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[17:16:46] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[17:17:09] <wikibugs>	 (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[17:18:07] <wikibugs>	 (03CR) 10Fabfur: cache,haproxy: allowed methods check and set response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[17:18:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P75138 and previous config saved to /var/cache/conftool/dbconfig/20250416-171813-fceratto.json
[17:18:36] <wikibugs>	 (03CR) 10Fabfur: "Do you want to split this part into a separate MR?" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[17:21:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[17:25:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[17:28:19] <wikibugs>	 (03PS1) 10Bking: cirrussearch: fix row B regex [puppet] - 10https://gerrit.wikimedia.org/r/1137043 (https://phabricator.wikimedia.org/T388610)
[17:29:19] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: fix row B regex [puppet] - 10https://gerrit.wikimedia.org/r/1137043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[17:29:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging to prevent failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1137043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[17:33:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[17:33:11] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[17:33:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P75139 and previous config saved to /var/cache/conftool/dbconfig/20250416-173320-fceratto.json
[17:33:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2070.codfw.wmnet with reason: host reimage
[17:37:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2070.codfw.wmnet with reason: host reimage
[17:40:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "not really needed given it's the first time we start actively responding in the `tls` frontend" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[17:42:10] <wikibugs>	 (03PS1) 10Bking: WIP: run puppet/restart ferm across DC after reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610)
[17:44:48] <wikibugs>	 (03PS1) 10Ssingh: Revert^3 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1137046
[17:45:08] <wikibugs>	 (03CR) 10Ssingh: "Context: this worked but since it's a long weekend, we are reverting and will deploy again next week." [puppet] - 10https://gerrit.wikimedia.org/r/1137046 (owner: 10Ssingh)
[17:48:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T391056)', diff saved to https://phabricator.wikimedia.org/P75140 and previous config saved to /var/cache/conftool/dbconfig/20250416-174828-fceratto.json
[17:48:32] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:48:39] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1045-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:48:44] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[17:50:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert^3 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1137046 (owner: 10Ssingh)
[17:51:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 36431240 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:52:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5862000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:53:39] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1045-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:55:10] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum3003.esams.wmnet with OS bookworm
[17:58:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2148.codfw.wmnet with reason: Maintenance
[17:58:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[17:58:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T391056)', diff saved to https://phabricator.wikimedia.org/P75142 and previous config saved to /var/cache/conftool/dbconfig/20250416-175842-fceratto.json
[17:58:46] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:58:46] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:58:54] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:59:47] <sukhe>	 ^ exepcted 
[17:59:49] <sukhe>	 reimaging
[17:59:51] <dduvall>	 James_F or Reedy: is there a fix in the works for https://phabricator.wikimedia.org/T392086 ?
[18:00:05] <jouncebot>	 dduvall and brennen: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1800)
[18:00:53] <brennen>	 o/
[18:01:29] <dduvall>	 brennen: howdy o/
[18:01:58] <dduvall>	 brennen: currently unsure if we can roll due to https://phabricator.wikimedia.org/T392086
[18:02:40] * brennen nods
[18:05:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:05:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2070.codfw.wmnet with OS bullseye
[18:08:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[18:09:02] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[18:09:38] <icinga-wm>	 RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 4.619 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[18:09:40] <icinga-wm>	 RECOVERY - Restbase root url on restbase1029 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[18:10:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:11:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T391056)', diff saved to https://phabricator.wikimedia.org/P75144 and previous config saved to /var/cache/conftool/dbconfig/20250416-181105-fceratto.json
[18:11:11] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:19:03] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3003.esams.wmnet with reason: host reimage
[18:22:39] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: host reimage
[18:23:59] <dduvall>	 brennen: k. that task is no longer a blocker/UBN. rolling
[18:24:11] <brennen>	 ack, godspeed
[18:25:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137048 (https://phabricator.wikimedia.org/T386220)
[18:25:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137048 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[18:26:06] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137048 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[18:26:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P75145 and previous config saved to /var/cache/conftool/dbconfig/20250416-182613-fceratto.json
[18:27:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:29:13] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10749402 (10VRiley-WMF)
[18:30:49] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10749408 (10VRiley-WMF) ms-fe1015 Rack E8 U 21 Port 17 CableID 240707900054  ms-fe1016 Rack F8 U 22 Port 17 CableID 240707900052
[18:37:26] <wikibugs>	 (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[18:38:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[18:40:56] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:41:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P75146 and previous config saved to /var/cache/conftool/dbconfig/20250416-184121-fceratto.json
[18:41:41] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10749423 (10Eevans) >>! In T391544#10746698, @MatthewVernon wrote: >>>! In T391544#10745829, @Eevans wrote: >>...
[18:41:46] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:41:57] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3003.esams.wmnet with OS bookworm
[18:42:52] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.25  refs T386220
[18:42:56] <stashbot>	 T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220
[18:44:10] <sukhe>	 !log re-enable puppet on A:durum
[18:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:51] <wikibugs>	 (03CR) 10Eevans: [C:03+1] DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans)
[18:53:02] <wikibugs>	 (03PS1) 10Ssingh: secret: rename ech-durum.pem [labs/private] - 10https://gerrit.wikimedia.org/r/1137051
[18:54:36] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] secret: rename ech-durum.pem [labs/private] - 10https://gerrit.wikimedia.org/r/1137051 (owner: 10Ssingh)
[18:56:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T391056)', diff saved to https://phabricator.wikimedia.org/P75147 and previous config saved to /var/cache/conftool/dbconfig/20250416-185628-fceratto.json
[18:56:32] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:56:44] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: Maintenance
[18:56:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T391056)', diff saved to https://phabricator.wikimedia.org/P75148 and previous config saved to /var/cache/conftool/dbconfig/20250416-185651-fceratto.json
[19:06:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2063 to cirrussearch2063
[19:06:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:07:34] <icinga-wm>	 PROBLEM - Disk space on an-worker1116 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 202800 MB (5% inode=99%): /var/lib/hadoop/data/m 222976 MB (5% inode=99%): /var/lib/hadoop/data/b 245413 MB (6% inode=99%): /var/lib/hadoop/data/c 223851 MB (5% inode=99%): /var/lib/hadoop/data/k 156034 MB (4% inode=99%): /var/lib/hadoop/data/i 184038 MB (4% inode=99%): /var/lib/hadoop/data/h 125055 MB (3% inode=99%): /var/lib/hadoop/data
[19:07:34] <icinga-wm>	 4 MB (5% inode=99%): /var/lib/hadoop/data/j 152553 MB (4% inode=99%): /var/lib/hadoop/data/d 156819 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops
[19:08:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T391056)', diff saved to https://phabricator.wikimedia.org/P75149 and previous config saved to /var/cache/conftool/dbconfig/20250416-190823-fceratto.json
[19:08:27] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:09:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10749567 (10VRiley-WMF) I received this as a response today   "After reviewing the debug logs and thermal data, we did not uncover any new information. It appears that the issue is self-correcting until it...
[19:10:25] <wikibugs>	 (03PS1) 10Vgutierrez: wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411)
[19:14:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[19:18:59] <wikibugs>	 (03PS1) 10Cwhite: logstash: drop out_request field [puppet] - 10https://gerrit.wikimedia.org/r/1137057 (https://phabricator.wikimedia.org/T390215)
[19:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:21:34] <swfrench-wmf>	 ^ parsoidtest1001 one shouldn't be there anymore - I'll take a look
[19:23:09] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: drop out_request field [puppet] - 10https://gerrit.wikimedia.org/r/1137057 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[19:23:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2063 to cirrussearch2063 - bking@cumin2002"
[19:23:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2063 to cirrussearch2063 - bking@cumin2002"
[19:23:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:23:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P75150 and previous config saved to /var/cache/conftool/dbconfig/20250416-192330-fceratto.json
[19:23:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2063
[19:23:35] <swfrench-wmf>	 dduvall: brennen: any objections if I sneak in a non-deploy (--stop-before-sync) scap run to pick up a make-container-image change?
[19:23:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2063
[19:24:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2063 to cirrussearch2063
[19:24:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2063.codfw.wmnet on all recursors
[19:24:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2063.codfw.wmnet on all recursors
[19:24:33] <brennen>	 swfrench-wmf: no objections here
[19:25:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2063.codfw.wmnet with OS bullseye
[19:25:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2063
[19:25:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:26:51] <swfrench-wmf>	 brennen: great, thank you!
[19:30:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2063 - bking@cumin2002"
[19:30:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2063 - bking@cumin2002"
[19:30:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:30:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2063.codfw.wmnet 108.16.192.10.in-addr.arpa 8.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:30:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2063.codfw.wmnet 108.16.192.10.in-addr.arpa 8.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:30:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2063
[19:30:23] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Test stop-before-sync scap run to pick up make-container-image changes - T390251
[19:30:28] <stashbot>	 T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251
[19:30:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2063
[19:30:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2063
[19:30:58] <logmsgbot>	 !log swfrench@deploy1003 Stopping before sync operations
[19:33:45] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[19:34:39] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[19:38:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P75151 and previous config saved to /var/cache/conftool/dbconfig/20250416-193838-fceratto.json
[19:40:01] <wikibugs>	 (03PS1) 10Dzahn: aptrepo: add jenkins to bookworm section in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137060 (https://phabricator.wikimedia.org/T392127)
[19:44:53] <icinga-wm>	 PROBLEM - Disk space on an-worker1163 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 256348 MB (6% inode=99%): /var/lib/hadoop/data/b 237762 MB (6% inode=99%): /var/lib/hadoop/data/j 146035 MB (3% inode=99%): /var/lib/hadoop/data/l 165764 MB (4% inode=99%): /var/lib/hadoop/data/h 171807 MB (4% inode=99%): /var/lib/hadoop/data/i 122642 MB (3% inode=99%): /var/lib/hadoop/data/k 117285 MB (3% inode=99%): https://wikitech.wik
[19:44:53] <icinga-wm>	 rg/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1163&var-datasource=eqiad+prometheus/ops
[19:45:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2063.codfw.wmnet with reason: host reimage
[19:48:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2063.codfw.wmnet with reason: host reimage
[19:48:39] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:50:13] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "Reading the docs, this seems like a reasonable change and should do as the commit message says." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136716 (https://phabricator.wikimedia.org/T390853) (owner: 10DCausse)
[19:50:20] <wikibugs>	 (03PS2) 10Hashar: Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron)
[19:50:59] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Thank you for the cc: and I feel sorry it did not improve the current situation 😢" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron)
[19:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:53:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T391056)', diff saved to https://phabricator.wikimedia.org/P75152 and previous config saved to /var/cache/conftool/dbconfig/20250416-195345-fceratto.json
[19:54:02] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: Maintenance
[19:54:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T391056)', diff saved to https://phabricator.wikimedia.org/P75153 and previous config saved to /var/cache/conftool/dbconfig/20250416-195408-fceratto.json
[19:54:11] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:57:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe)
[19:57:44] <wikibugs>	 (03PS2) 10Scott French: P:parsoid::mediawiki: use installed PHP versions for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485)
[19:59:34] <wikibugs>	 (03PS1) 10Cwhite: logstash: also remove outRequest field [puppet] - 10https://gerrit.wikimedia.org/r/1137064 (https://phabricator.wikimedia.org/T390215)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T2000). Please do the needful.
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:01:31] <wikibugs>	 (03CR) 10Herron: [C:03+1] logstash: also remove outRequest field [puppet] - 10https://gerrit.wikimedia.org/r/1137064 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[20:01:47] <icinga-wm>	 PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 138654 MB (3% inode=99%): /var/lib/hadoop/data/f 209556 MB (5% inode=99%): /var/lib/hadoop/data/j 119935 MB (3% inode=99%): /var/lib/hadoop/data/m 121225 MB (3% inode=99%): /var/lib/hadoop/data/h 200345 MB (5% inode=99%): /var/lib/hadoop/data/k 110428 MB (2% inode=99%): /var/lib/hadoop/data/e 166921 MB (4% inode=99%): /var/lib/hadoop/data
[20:01:47] <icinga-wm>	 0 MB (6% inode=99%): /var/lib/hadoop/data/b 209089 MB (5% inode=99%): /var/lib/hadoop/data/d 144571 MB (3% inode=99%): /var/lib/hadoop/data/i 141450 MB (3% inode=99%): /var/lib/hadoop/data/l 159229 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops
[20:01:55] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] aptrepo: add jenkins to bookworm section in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137060 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[20:02:28] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: also remove outRequest field [puppet] - 10https://gerrit.wikimedia.org/r/1137064 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[20:03:39] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:04:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T391056)', diff saved to https://phabricator.wikimedia.org/P75154 and previous config saved to /var/cache/conftool/dbconfig/20250416-200437-fceratto.json
[20:04:41] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:06:07] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[20:09:11] <wikibugs>	 (03PS1) 10Cwhite: logstash: expand conditional [puppet] - 10https://gerrit.wikimedia.org/r/1137065 (https://phabricator.wikimedia.org/T390215)
[20:12:41] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[20:13:25] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: expand conditional [puppet] - 10https://gerrit.wikimedia.org/r/1137065 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[20:15:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2063.codfw.wmnet with OS bullseye
[20:19:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P75155 and previous config saved to /var/cache/conftool/dbconfig/20250416-201943-fceratto.json
[20:20:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2077 to cirrussearch2077
[20:20:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:22:02] <wikibugs>	 (03CR) 10Dzahn: "This all makes sense to me and looks good just the PHP versions it looks up in Hiera are still 7.4 (installed) and 7.2 (absented). Looking" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[20:25:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2077 to cirrussearch2077 - bking@cumin2002"
[20:26:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2077 to cirrussearch2077 - bking@cumin2002"
[20:26:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:26:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2077
[20:26:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2077
[20:27:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10749908 (10Eevans) >>! In T391903#10743696, @Jclark-ctr wrote: > @Eevans  This server is out of Warranty  We have  used drives from recently Decom servers please advise when and if you would like to replace....
[20:27:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2077 to cirrussearch2077
[20:27:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2077.codfw.wmnet on all recursors
[20:27:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2077.codfw.wmnet on all recursors
[20:28:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2077.codfw.wmnet with OS bullseye
[20:28:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2077
[20:28:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:33:37] <icinga-wm>	 PROBLEM - Disk space on an-worker1088 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/i 143716 MB (3% inode=99%): /var/lib/hadoop/data/k 113100 MB (3% inode=99%): /var/lib/hadoop/data/h 192074 MB (5% inode=99%): /var/lib/hadoop/data/l 183653 MB (4% inode=99%): /var/lib/hadoop/data/e 217515 MB (5% inode=99%): /var/lib/hadoop/data/j 138981 MB (3% inode=99%): /var/lib/hadoop/data/c 130013 MB (3% inode=99%): https://wikitech.wik
[20:33:37] <icinga-wm>	 rg/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1088&var-datasource=eqiad+prometheus/ops
[20:33:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2077 - bking@cumin2002"
[20:33:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2077 - bking@cumin2002"
[20:33:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:33:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2077.codfw.wmnet 125.16.192.10.in-addr.arpa 5.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:33:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2077.codfw.wmnet 125.16.192.10.in-addr.arpa 5.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[20:33:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2077
[20:34:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2077
[20:34:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2077
[20:34:24] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Turn off Gobblin test jobs (all at once). [puppet] - 10https://gerrit.wikimedia.org/r/1137067 (https://phabricator.wikimedia.org/T390249)
[20:34:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P75156 and previous config saved to /var/cache/conftool/dbconfig/20250416-203450-fceratto.json
[20:42:34] <wikibugs>	 (03CR) 10Scott French: "Excellent question!" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[20:43:35] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:44:33] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:46:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[20:48:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2077.codfw.wmnet with reason: host reimage
[20:49:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10750017 (10Eevans) > sessionstore1006: > [] (service owner) Does the host need to stay in row D and keep its IP/VLAN?  It //does// need to stay in row D, yes.  If the IP/V...
[20:49:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T391056)', diff saved to https://phabricator.wikimedia.org/P75157 and previous config saved to /var/cache/conftool/dbconfig/20250416-204957-fceratto.json
[20:50:01] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:50:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance
[20:51:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:52:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2077.codfw.wmnet with reason: host reimage
[20:53:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "after you pointed out to me that you are overriding the versions at the host name level in Hiera.. NEVERMIND :) lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[20:55:49] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:parsoid::mediawiki: use installed PHP versions for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[20:56:24] <wikibugs>	 (03Abandoned) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097535 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper)
[20:56:46] <wikibugs>	 (03Abandoned) 10Ryan Kemper: cirrus: (WIP) support rename elastic->cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper)
[20:57:28] <wikibugs>	 (03PS1) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610)
[20:57:52] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:57:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:58:17] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs-update-lag: don't count wdqs-categories lag [puppet] - 10https://gerrit.wikimedia.org/r/1133554 (owner: 10Ryan Kemper)
[20:58:39] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:59:01] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2204.codfw.wmnet with reason: Maintenance
[20:59:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T391056)', diff saved to https://phabricator.wikimedia.org/P75158 and previous config saved to /var/cache/conftool/dbconfig/20250416-205907-fceratto.json
[20:59:11] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:59:16] <wikibugs>	 (03PS2) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610)
[20:59:58] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T2100)
[21:01:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T391056)', diff saved to https://phabricator.wikimedia.org/P75159 and previous config saved to /var/cache/conftool/dbconfig/20250416-210128-fceratto.json
[21:01:32] <wikibugs>	 (03PS1) 10Reedy: specials: Fix PHP Warning on Special:PasswordReset for crafted input [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137073 (https://phabricator.wikimedia.org/T392086)
[21:03:39] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10750114 (10Eevans) > aqs1022  > [] (service owner) Does the host need to stay in row D and keep its IP/VLAN?  It //can// go anywhere in row D —or— anywhere in...
[21:05:21] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[21:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:05:38] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10750121 (10Eevans) > restbase1045  > [] (service owner) Does the host need to stay in row D and keep its IP/VLAN?   Yes.  > [] (service owner) What hosts can t...
[21:06:23] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-04-08-183717 to 2025-04-09-214434 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137075
[21:06:23] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-08-183631 to 2025-04-16-192052 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137076 (https://phabricator.wikimedia.org/T367080)
[21:07:17] <James_F>	 Reedy: If you want to deploy that ^^ please go ahead, we're in services land only today.
[21:07:28] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Update evaluators from 2025-04-08-183717 to 2025-04-09-214434 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137075 (owner: 10Jforrester)
[21:07:41] <wikibugs>	 (03CR) 10Reedy: [C:03+2] specials: Fix PHP Warning on Special:PasswordReset for crafted input [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137073 (https://phabricator.wikimedia.org/T392086) (owner: 10Reedy)
[21:07:45] <Reedy>	 Cheers
[21:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-04-08-183717 to 2025-04-09-214434 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137075 (owner: 10Jforrester)
[21:09:43] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Update orchestrator from 2025-04-08-183631 to 2025-04-16-192052 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137076 (https://phabricator.wikimedia.org/T367080) (owner: 10Jforrester)
[21:11:10] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-04-08-183631 to 2025-04-16-192052 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137076 (https://phabricator.wikimedia.org/T367080) (owner: 10Jforrester)
[21:11:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2077.codfw.wmnet with OS bullseye
[21:13:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2079 to cirrussearch2079
[21:13:25] <logmsgbot>	 !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:13:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:14:00] <logmsgbot>	 !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:15:36] <logmsgbot>	 !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:16:32] <logmsgbot>	 !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:16:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P75160 and previous config saved to /var/cache/conftool/dbconfig/20250416-211634-fceratto.json
[21:16:48] <logmsgbot>	 !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:17:47] <logmsgbot>	 !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:18:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2079 to cirrussearch2079 - bking@cumin2002"
[21:20:34] <wikibugs>	 (03Merged) 10jenkins-bot: specials: Fix PHP Warning on Special:PasswordReset for crafted input [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137073 (https://phabricator.wikimedia.org/T392086) (owner: 10Reedy)
[21:21:43] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1137073|specials: Fix PHP Warning on Special:PasswordReset for crafted input (T392086)]]
[21:21:47] <stashbot>	 T392086: PHP Warning: Array to string conversion / RuntimeException: PCRE failure on Special:PasswordReset - https://phabricator.wikimedia.org/T392086
[21:25:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2079 to cirrussearch2079 - bking@cumin2002"
[21:25:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:25:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2079
[21:26:37] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1137073|specials: Fix PHP Warning on Special:PasswordReset for crafted input (T392086)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:26:42] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with sync
[21:26:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2079
[21:27:38] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2079 to cirrussearch2079
[21:27:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2079.codfw.wmnet on all recursors
[21:27:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2079.codfw.wmnet on all recursors
[21:27:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2079.codfw.wmnet with OS bullseye
[21:28:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2079
[21:30:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:30:36] <wikibugs>	 (03PS3) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:31:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P75161 and previous config saved to /var/cache/conftool/dbconfig/20250416-213141-fceratto.json
[21:33:30] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137073|specials: Fix PHP Warning on Special:PasswordReset for crafted input (T392086)]] (duration: 11m 47s)
[21:33:33] <stashbot>	 T392086: PHP Warning: Array to string conversion / RuntimeException: PCRE failure on Special:PasswordReset - https://phabricator.wikimedia.org/T392086
[21:34:06] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi)
[21:34:47] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron)
[21:37:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:39:03] <wikibugs>	 (03PS3) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610)
[21:39:15] <wikibugs>	 (03PS4) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:39:53] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:41:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2079 - bking@cumin2002"
[21:41:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2079 - bking@cumin2002"
[21:41:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:41:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2079.codfw.wmnet 128.16.192.10.in-addr.arpa 8.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:41:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2079.codfw.wmnet 128.16.192.10.in-addr.arpa 8.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:41:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2079
[21:41:33] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm)
[21:41:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2079
[21:41:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2079
[21:46:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:46:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T391056)', diff saved to https://phabricator.wikimedia.org/P75162 and previous config saved to /var/cache/conftool/dbconfig/20250416-214648-fceratto.json
[21:46:52] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:47:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: Maintenance
[21:47:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T391056)', diff saved to https://phabricator.wikimedia.org/P75163 and previous config saved to /var/cache/conftool/dbconfig/20250416-214710-fceratto.json
[21:49:29] <wikibugs>	 (03PS5) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:51:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:53:39] <jinxer-wm>	 FIRING: ProbeDown: Service restbase1045-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1045-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:55:33] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2.*
[21:56:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2079.codfw.wmnet with reason: host reimage
[21:58:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T391056)', diff saved to https://phabricator.wikimedia.org/P75164 and previous config saved to /var/cache/conftool/dbconfig/20250416-215804-fceratto.json
[21:58:08] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:59:03] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T2200)
[22:00:05] <jouncebot>	 aude: A patch you scheduled for Web Team deployment window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[22:01:05] <aude>	 deploying updates to the chart renderer service in a few minutes
[22:01:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2079.codfw.wmnet with reason: host reimage
[22:02:40] <wikibugs>	 (03PS5) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115
[22:07:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:09:38] <wikibugs>	 (03PS1) 10Aude: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027)
[22:13:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P75165 and previous config saved to /var/cache/conftool/dbconfig/20250416-221311-fceratto.json
[22:14:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10750370 (10phaultfinder)
[22:16:09] <wikibugs>	 (03CR) 10Seddon: "Deployment approved." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude)
[22:16:22] <wikibugs>	 (03CR) 10Seddon: [C:03+1] Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude)
[22:16:58] <wikibugs>	 (03CR) 10Aude: [C:03+2] Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude)
[22:18:31] <wikibugs>	 (03Merged) 10jenkins-bot: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude)
[22:19:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10750374 (10RobH) 05Open→03Stalled a:05Kappakayala→03RobH Please note this needs to be stalled as it turns out we may not use D6 for frack.  Please take no further...
[22:19:42] <wikibugs>	 (03CR) 10JHathaway: "thanks for giving it a try @ltoscano@wikimedia.org. Also, thanks for spotting the `Gemfile.lock` issue, the path was wrong, it should be `" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway)
[22:20:47] <logmsgbot>	 !log aude@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[22:21:24] <logmsgbot>	 !log aude@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[22:26:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2079.codfw.wmnet with OS bullseye
[22:27:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:28:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P75166 and previous config saved to /var/cache/conftool/dbconfig/20250416-222818-fceratto.json
[22:32:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:36:10] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610
[22:36:14] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[22:43:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T391056)', diff saved to https://phabricator.wikimedia.org/P75167 and previous config saved to /var/cache/conftool/dbconfig/20250416-224325-fceratto.json
[22:43:29] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[22:43:42] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2226.codfw.wmnet with reason: Maintenance
[22:43:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance
[22:44:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T391056)', diff saved to https://phabricator.wikimedia.org/P75168 and previous config saved to /var/cache/conftool/dbconfig/20250416-224405-fceratto.json
[22:46:12] <logmsgbot>	 !log aude@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[22:46:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T391056)', diff saved to https://phabricator.wikimedia.org/P75169 and previous config saved to /var/cache/conftool/dbconfig/20250416-224627-fceratto.json
[22:46:45] <logmsgbot>	 !log aude@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[22:48:39] <jinxer-wm>	 RESOLVED: ProbeDown: Service restbase1045-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1045-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:49:17] <logmsgbot>	 !log aude@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[22:49:49] <logmsgbot>	 !log aude@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[22:53:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10750415 (10bking) Thanks @Jhancock.wm ! Will try and reimage now.
[22:54:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye
[22:54:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091
[22:54:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091
[22:55:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:59:07] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1137086 (https://phabricator.wikimedia.org/T388610)
[23:00:47] <wikibugs>	 (03PS2) 10Bking: cirrussearch: Add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1137086 (https://phabricator.wikimedia.org/T388610)
[23:01:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P75170 and previous config saved to /var/cache/conftool/dbconfig/20250416-230134-fceratto.json
[23:09:08] <wikibugs>	 (03PS1) 10BryanDavis: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142)
[23:10:56] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1043.eqiad.wmnet
[23:10:56] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1043.eqiad.wmnet
[23:11:03] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1044.eqiad.wmnet
[23:11:03] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1044.eqiad.wmnet
[23:11:11] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1045.eqiad.wmnet
[23:11:11] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1045.eqiad.wmnet
[23:14:07] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye
[23:15:04] <logmsgbot>	 !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1028.eqiad.wmnet with reason: Decommissioning — T389423
[23:15:07] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[23:15:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye
[23:15:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091
[23:15:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091
[23:16:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P75171 and previous config saved to /var/cache/conftool/dbconfig/20250416-231641-fceratto.json
[23:16:51] <urandom>	 !log decommissioning restbase1028/Cassandra — T389423
[23:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:28:41] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye
[23:31:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T391056)', diff saved to https://phabricator.wikimedia.org/P75172 and previous config saved to /var/cache/conftool/dbconfig/20250416-233148-fceratto.json
[23:31:53] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[23:31:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2238.codfw.wmnet with reason: Maintenance
[23:32:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75173 and previous config saved to /var/cache/conftool/dbconfig/20250416-233200-fceratto.json
[23:33:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye
[23:33:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091
[23:33:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091
[23:33:58] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye
[23:34:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye
[23:34:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091
[23:34:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091
[23:40:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137089
[23:40:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137089 (owner: 10TrainBranchBot)
[23:42:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75174 and previous config saved to /var/cache/conftool/dbconfig/20250416-234221-fceratto.json
[23:42:25] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[23:45:47] <icinga-wm>	 PROBLEM - Disk space on an-worker1114 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 156818 MB (4% inode=99%): /var/lib/hadoop/data/k 245919 MB (6% inode=99%): /var/lib/hadoop/data/h 247407 MB (6% inode=99%): /var/lib/hadoop/data/b 146976 MB (3% inode=99%): /var/lib/hadoop/data/d 202739 MB (5% inode=99%): /var/lib/hadoop/data/f 233204 MB (6% inode=99%): /var/lib/hadoop/data/i 215232 MB (5% inode=99%): /var/lib/hadoop/data
[23:45:47] <icinga-wm>	 6 MB (4% inode=99%): /var/lib/hadoop/data/l 164780 MB (4% inode=99%): /var/lib/hadoop/data/c 242294 MB (6% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1114&var-datasource=eqiad+prometheus/ops
[23:52:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137089 (owner: 10TrainBranchBot)
[23:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:57:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P75175 and previous config saved to /var/cache/conftool/dbconfig/20250416-235728-fceratto.json