[00:02:18] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [00:02:22] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [00:10:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136823 [00:10:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136823 (owner: 10TrainBranchBot) [00:10:58] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P75083 and previous config saved to /var/cache/conftool/dbconfig/20250416-001156-fceratto.json [00:27:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T391056)', diff saved to https://phabricator.wikimedia.org/P75084 and previous config saved to /var/cache/conftool/dbconfig/20250416-002703-fceratto.json [00:27:07] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:27:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2221.codfw.wmnet with reason: Maintenance [00:27:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75085 and previous config saved to /var/cache/conftool/dbconfig/20250416-002725-fceratto.json [00:27:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136823 (owner: 10TrainBranchBot) [00:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:43:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75086 and previous config saved to /var/cache/conftool/dbconfig/20250416-004338-fceratto.json [00:43:42] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:58:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P75087 and previous config saved to /var/cache/conftool/dbconfig/20250416-005846-fceratto.json [01:13:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P75088 and previous config saved to /var/cache/conftool/dbconfig/20250416-011353-fceratto.json [01:21:36] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/0fa72902e0aab988e2631df2617f26171681e532532aebd7feb2130a6edd4519/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:29:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T391056)', diff saved to https://phabricator.wikimedia.org/P75089 and previous config saved to /var/cache/conftool/dbconfig/20250416-012901-fceratto.json [01:29:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:29:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2222.codfw.wmnet with reason: Maintenance [01:29:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75090 and previous config saved to /var/cache/conftool/dbconfig/20250416-012924-fceratto.json [01:41:36] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:45:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75091 and previous config saved to /var/cache/conftool/dbconfig/20250416-014529-fceratto.json [01:45:34] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:53:39] FIRING: [5x] ProbeDown: Service restbase1045-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:58:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:00:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P75092 and previous config saved to /var/cache/conftool/dbconfig/20250416-020036-fceratto.json [02:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:12:58] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 41.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:13:39] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:25] FIRING: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P75093 and previous config saved to /var/cache/conftool/dbconfig/20250416-021544-fceratto.json [02:16:16] RECOVERY - OpenSearch health check for shards on 9600 on cirrussearch2103 is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: green, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1678, active_shards: 5033, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_ [02:16:16] tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:16:16] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2103 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 60, number_of_data_nodes: 60, discovered_master: True, active_primary_shards: 1354, active_shards: 4185, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 3, delayed_unassigned_shards: 0, number_of_pending [02:16:16] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.92836676217765 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:23:27] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2103:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [02:30:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75094 and previous config saved to /var/cache/conftool/dbconfig/20250416-023052-fceratto.json [02:30:56] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:38:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:43:39] FIRING: [5x] ProbeDown: Service restbase1045-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:50:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:52:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:42:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:45:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:06] PROBLEM - Restbase root url on restbase1029 is CRITICAL: connect to address 10.64.16.173 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [05:38:16] !log installing spicerack v10.1.0 on cumin2002 [05:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T0600) [06:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:09:28] !log installing spicerack v10.1.0 on cumin1002 [06:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:39] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10746393 (10Lena_WMDE) @MatthewVernon works as expected, thank you! :) [06:23:06] (03PS1) 10Volans: __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 [06:23:06] (03PS1) 10Volans: Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 [06:23:06] (03PS1) 10Volans: I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 [06:25:05] (03PS1) 10Volans: I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 [06:25:05] (03PS1) 10Volans: ServiceOps cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839 [06:25:05] (03PS1) 10Volans: Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 [06:25:06] (03PS1) 10Volans: CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 [06:25:06] (03PS1) 10Volans: DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 [06:25:08] (03PS1) 10Volans: DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 [06:25:12] (03PS1) 10Volans: Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 [06:31:20] (03PS1) 10Fabfur: cache: add termination status to haproxy log format [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454) [06:37:00] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454) (owner: 10Fabfur) [06:37:07] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10746400 (10BCornwall) So far so good in the first 8 hours of uptime! We'll let it simmer overnight and see how it fares. [06:38:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:46:18] (03PS2) 10Fabfur: cache: add termination state to haproxy log format [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454) [06:57:46] (03CR) 10Elukey: [C:03+1] sre.hosts.reimage: check dbctl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [06:59:06] (03CR) 10Volans: [C:03+2] sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [07:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:50] !log powercycle ml-serve2007 - OEM event registered in getsel (seems DIMM-related) [07:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:06] (03PS17) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T390251) [07:05:47] (03Merged) 10jenkins-bot: sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [07:05:50] RECOVERY - Host ml-serve2007 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [07:05:52] (03PS18) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T391457) [07:06:30] (03PS1) 10Kevin Bazira: eventstreams: expose RRLA event stream publicly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) [07:06:34] RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 38, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:06:41] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746407 (10Jelto) >>! In T378922#10743705, @MatthewVernon wrote: > Looking at the Ceph metrics, it seems the packages were fewer l... [07:06:57] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746410 (10Jelto) [07:11:50] RESOLVED: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:22:07] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10746432 (10MatthewVernon) [07:24:45] (03Abandoned) 10Fabfur: cache: add termination state to haproxy log format [puppet] - 10https://gerrit.wikimedia.org/r/1136845 (https://phabricator.wikimedia.org/T387454) (owner: 10Fabfur) [07:26:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:26:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:27:16] (03PS1) 10MVernon: admin: add kcoleman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136936 (https://phabricator.wikimedia.org/T391861) [07:36:16] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [07:39:39] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136936 (https://phabricator.wikimedia.org/T391861) (owner: 10MVernon) [07:42:11] (03CR) 10MVernon: [C:03+2] admin: add kcoleman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136936 (https://phabricator.wikimedia.org/T391861) (owner: 10MVernon) [07:43:39] FIRING: [4x] ProbeDown: Service restbase1045-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:49:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10746512 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @KColeman-WMF this is done for you now (but I'd allow... [07:50:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10746521 (10Gehel) [07:50:32] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:50:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:48] PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [08:02:48] RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [08:10:55] 06SRE, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10746546 (10Gehel) p:05Triage→03High [08:12:06] 06SRE, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10746549 (10Gehel) 05Open→03Resolved a:03brouberol Archiva is still being used, so we should still keep an eye on it. Cleanup done by @brouberol, we should be good for a while. [08:16:02] !log destroy the "main" helmfile releases for mw-wikifunctions. The service is now being powered by the single version MediaWiki HTTP routing solution releases, this is a cleanup. [08:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:21] (03CR) 10Filippo Giunchedi: [C:03+1] Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 (owner: 10Volans) [08:30:33] (03CR) 10Alexandros Kosiaris: [C:03+2] scap: Stop updating main mw-wikifunctions release [puppet] - 10https://gerrit.wikimedia.org/r/1136749 (owner: 10Alexandros Kosiaris) [08:30:51] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Remove the main release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136748 (owner: 10Alexandros Kosiaris) [08:32:17] (03Merged) 10jenkins-bot: mw-wikifunctions: Remove the main release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136748 (owner: 10Alexandros Kosiaris) [08:39:52] (03PS4) 10Volans: Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [08:42:22] (03PS1) 10Ladsgroup: Bump thumbnail steps to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136963 (https://phabricator.wikimedia.org/T360589) [08:44:18] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136963 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [08:44:23] (03CR) 10Volans: [C:04-1] "I've rebased this one to resolve the rebase conflicts given the recent homer changes." [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [08:45:04] (03Merged) 10jenkins-bot: Bump thumbnail steps to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136963 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [08:45:19] (03PS2) 10Volans: Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 [08:45:25] FIRING: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:30] (03CR) 10Volans: [C:03+2] "Thanks for the review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 (owner: 10Volans) [08:46:12] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] [08:46:15] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [08:51:48] (03CR) 10CI reject: [V:04-1] Move fetch_device_interfaces from wmf-netbox to core Homer [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [08:52:09] (03Merged) 10jenkins-bot: Observabil. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136844 (owner: 10Volans) [08:58:02] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:58:06] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [08:59:06] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [08:59:34] (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: refresh FQDN of the neutron virtual router [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) [09:00:10] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: refresh FQDN of the neutron virtual router [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [09:02:10] !log ladsgroup@deploy1003 sync-world failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'write-values', '--output-file-template', '/tmp/tmpsh_tee3p']' returned non-zero exit status 3. (scap version: 4.153.0) (duration: 15m 58s) [09:02:59] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] [09:05:46] I'm retrying again [09:07:31] (03PS1) 10Ladsgroup: Change default thumbnail size to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136964 (https://phabricator.wikimedia.org/T355914) [09:07:39] anything that broke related to the registry issue? [09:07:42] or other things? [09:07:52] lemme know in case :D [09:08:54] https://www.irccloud.com/pastebin/cipVY2Nf/ [09:08:56] elukey: [09:09:42] ok never seen this before, and it looks really weird [09:10:24] yeah [09:10:26] it doesn't seem related to the registry though, but scap running helmfile in the wrong way [09:10:32] it couldn't even roll back [09:10:32] +1 to retry [09:12:23] (03CR) 10Klausman: [C:03+1] role::ml_k8s::master: move 1001 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:12:50] (03CR) 10Elukey: [C:03+2] role::ml_k8s::master: move 1001 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:15:02] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:15:06] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [09:15:15] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [09:15:23] !log repooling cp4047 - T387238 [09:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:30] T387238: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238 [09:17:27] (03PS37) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [09:17:47] (03CR) 10Majavah: [C:03+1] "Fine with me, I can merge/deploy as long as Francesco does not have any objections" [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup) [09:18:54] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve-ctrl1001.eqiad.wmnet with OS bookworm [09:19:30] (03CR) 10FNegri: [C:03+1] "Fine with me!" [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup) [09:20:26] (03CR) 10Majavah: [C:03+2] openstack: wikireplica_dns: Add termstore aliases for s8 [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup) [09:22:04] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136963|Bump thumbnail steps to 100% (T360589)]] (duration: 19m 05s) [09:22:08] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [09:22:08] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746665 (10Jelto) [09:22:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136964 (https://phabricator.wikimedia.org/T355914) (owner: 10Ladsgroup) [09:22:32] (03CR) 10Kamila Součková: [C:04-1] "I have a "Chesterton's fence" feeling about this. This seems reasonable for when you're developing, but for the submit checks on Gerrit I " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [09:22:53] elukey: It worked the second time *shrugs* [09:23:07] (03Merged) 10jenkins-bot: Change default thumbnail size to 250px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136964 (https://phabricator.wikimedia.org/T355914) (owner: 10Ladsgroup) [09:23:14] Amir1: it felt that you were upset, I'd have done the same [09:23:31] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136964|Change default thumbnail size to 250px (T355914)]] [09:23:31] I'm always upset :D [09:23:35] T355914: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 [09:23:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:15] (03CR) 10Kamila Součková: [C:04-1] "Just to clarify, I -1'd it because I want to hear someone else's opinion on this, I can be persuaded :-)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [09:24:47] Amir1: nah :D [09:28:13] (03CR) 10Volans: [C:04-1] "Forgot to ask, where are the queries? I don't see them in homer/public" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [09:29:38] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746694 (10Jelto) I've triggered a backup on the GitLab replica, which has been switched to object storage. The new backup runtime... [09:31:14] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10746698 (10MatthewVernon) >>! In T391544#10745829, @Eevans wrote: > Cassandra's JBOD is pretty dumb in this r... [09:31:57] jouncebot: nowandnext [09:31:57] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [09:31:57] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000) [09:32:55] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage [09:35:09] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136964|Change default thumbnail size to 250px (T355914)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:35:13] T355914: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 [09:36:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: host reimage [09:36:26] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [09:37:33] (03CR) 10Lucas Werkmeister (WMDE): Release campaignEvents extension to azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey) [09:39:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10746721 (10Stevemunene) a:05Gehel→03Stevemunene [09:39:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10746722 (10Gehel) a:05Stevemunene→03None [09:42:54] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136965 [09:43:07] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136964|Change default thumbnail size to 250px (T355914)]] (duration: 19m 35s) [09:43:22] T355914: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914 [09:54:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet with OS bookworm [09:57:29] (03PS1) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 [09:58:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:59:40] (03CR) 10CI reject: [V:04-1] Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (owner: 10Michael Große) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000) [10:00:44] (03PS2) 10Hnowlan: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) [10:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:01:25] (03PS2) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 [10:02:16] (03PS41) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:02:16] (03CR) 10Federico Ceratto: "Updated using new features from Spicerack" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:02:19] (03PS3) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510) [10:03:06] (03PS42) 10Federico Ceratto: sre.mysql.upgrade: Switch to Host, apt-get and mysql helpers [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:04:45] jouncebot: nowandnext [10:04:45] For the next 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000) [10:04:45] In 0 hour(s) and 55 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1100) [10:06:06] (03CR) 10Hnowlan: [C:03+2] switchdc: clarify inputs for moving active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/1128895 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:06:38] I'm looking into getting an urgent puppet script for mentorship reenabled. To that end, I would like to run the script as a test against testwiki. Is there an issue with that? [10:06:52] As is, with doing that now? [10:07:01] (03PS3) 10FNegri: openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) [10:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:09:04] MichaelG_WMF: no issue with the timing - is your patch restoring the jobs for T391695? [10:09:04] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [10:09:16] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056 (10cmooney) 03NEW p:05Triage→03Medium [10:09:39] hnowlan: yes, that is the goal [10:10:18] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746764 (10cmooney) [10:10:41] MichaelG_WMF: that's probably fine (ccing Amir1 for awareness) [10:11:07] (03Abandoned) 10Hnowlan: httpbb: use k8s jobrunners for healthchecking [puppet] - 10https://gerrit.wikimedia.org/r/1112728 (https://phabricator.wikimedia.org/T383317) (owner: 10Hnowlan) [10:11:18] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [10:12:06] (03Merged) 10jenkins-bot: switchdc: clarify inputs for moving active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/1128895 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:12:28] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746767 (10cmooney) [10:13:39] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:14:29] (03Abandoned) 10Hnowlan: deployment: switch deploy servers to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1127074 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [10:15:39] hnowlan: Amir1: running the script worked without error, we should be able to reenable it I hope. Who exactly should I talk to for this? The Wiki says "talk to SRE" [10:16:09] (03CR) 10Jgiannelos: [C:04-1] rest-gateway: add mobileapps/PCS endpoints that don't use internal cache (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [10:17:03] !log migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.25$ mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki testwiki --verbose #T391695 [10:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:07] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [10:17:12] I will enable it soon [10:17:23] Amir1 <3 [10:18:39] volans@cumin2002 downtime (PID 4108428) is awaiting input [10:18:47] elukey ^^^ yay [10:19:09] MichaelG_WMF: while I get to a pc, can you try running it on frwiki and ruwiki and record how long it took? [10:19:35] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database nupwiki (T390714) [10:19:38] T390714: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714 [10:19:40] Amir1 can do [10:19:41] (03CR) 10Hashar: "> for the submit checks on Gerrit I think we do actually want the diff against master, as that's what you're merging into." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [10:19:46] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database nupwiki (T390714) [10:20:01] Thanks! [10:20:12] (though not sure where to find the actual slow queries log and how to read it) [10:20:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: Maintenance [10:21:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:21:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T391056)', diff saved to https://phabricator.wikimedia.org/P75096 and previous config saved to /var/cache/conftool/dbconfig/20250416-102110-fceratto.json [10:21:14] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:23:21] !log migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.24$ time mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki frwiki --verbose #T391695 [10:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:24] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [10:24:33] volans: ah nice! [10:25:25] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10746807 (10MatthewVernon) That fits with what I see from bucket stats: gitlab-packages has 3,938 objects and 195GB, gitlab-artifac... [10:26:29] * MichaelG_WMF makes note to self: add some (--verbose) output while running to updateMenteeData.php -- looking at a shell that shows ~nothing is not great [10:29:13] Amir1: frwiki took 4m21s or 261 seconds. now running on ruwiki [10:29:50] !log migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.24$ time mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki ruwiki --verbose #T391695 [10:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:54] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [10:30:40] (03PS1) 10Volans: spicerack: enable IRC notification on user input [puppet] - 10https://gerrit.wikimedia.org/r/1136973 [10:32:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T391056)', diff saved to https://phabricator.wikimedia.org/P75097 and previous config saved to /var/cache/conftool/dbconfig/20250416-103236-fceratto.json [10:32:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:32:54] (03PS3) 10Hnowlan: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) [10:33:41] (03CR) 10Volans: "Tested on cumin2002, it notified me correctly:" [puppet] - 10https://gerrit.wikimedia.org/r/1136973 (owner: 10Volans) [10:33:57] Amir1: ruwiki was 173 seconds [10:34:15] (03CR) 10Volans: [C:04-1] "Forgot they are not yet merged, for reference are in Ia3ff62de353a2f2d2a48498b6d6ed96743fb3ffd" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [10:34:42] though from our metrics, I expect enwiki to be one of those that runs a really looong time [10:35:23] (03CR) 10Hnowlan: "Thanks for the review! My list of changes was based on the initial list in the ticket, good catches." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [10:37:24] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746861 (10cmooney) [10:37:48] yo MichaelG_WMF, could you try and run it using mwscript-k8s? That'd give us confidence for when we migrate it to mw-cron [10:38:07] (03CR) 10Kamila Součková: [C:04-1] "You're correct, but my worry is about a chain of commits and master diverging. (Sorry, I should have mentioned that explicitly.) In most c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [10:38:10] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746862 (10cmooney) [10:38:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:38:48] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746864 (10cmooney) [10:40:16] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10746877 (10cmooney) [10:40:22] claime: can I do that now? I know there used to be an issue around that because I only have restricted access and not full deployment access [10:40:38] ah, idk, unsure [10:40:49] I can run it if you give me an invoc and an ok [10:41:30] https://phabricator.wikimedia.org/T378429 guess not [10:43:14] claime: This should finish in about 11 seconds and be generally low-risk: `/srv/mediawiki/php-1.44.0-wmf.25$ mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki testwiki --verbose` [10:43:21] (03CR) 10Kamila Součková: [C:04-1] "I would feel a lot more comfortable with this change if we also added a new `check_master` task that replicates the old behaviour. It's pr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [10:43:22] cool [10:46:33] Done. Took 9 seconds. [10:46:39] Ran on php 8.1 inside k8s [10:47:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P75098 and previous config saved to /var/cache/conftool/dbconfig/20250416-104744-fceratto.json [10:48:27] Nice! [10:50:31] (03PS1) 10Abijeet Patro: Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) [10:52:23] !log cgoubert@deploy1003 Started scap build-images: (no justification provided) [10:52:43] PROBLEM - Hadoop NodeManager on an-worker1189 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:52:58] I need to check the load on db1180 and some other things and then let you know [10:54:35] (03CR) 10Jgiannelos: [C:03+1] rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [10:56:44] jouncebot: nowandnext [10:56:44] For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1000) [10:56:44] In 0 hour(s) and 3 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1100) [10:57:51] (03PS1) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [10:57:53] (03PS1) 10Elukey: profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) [10:58:00] !log cgoubert@deploy1003 Finished scap build-images: (no justification provided) (duration: 05m 36s) [10:58:28] (03CR) 10Elukey: "Please check that my assumptions are correct :)" [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [11:00:04] mvolz: That opportune time for a Services – Citoid / Zotero deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1100). [11:02:33] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) (owner: 10FNegri) [11:02:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P75099 and previous config saved to /var/cache/conftool/dbconfig/20250416-110252-fceratto.json [11:03:39] (03CR) 10Elukey: "I have zero experience in this template, it looks good but I'd rely on Filippo's input to be honest :(" [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron) [11:04:43] RECOVERY - Hadoop NodeManager on an-worker1189 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:04:48] (03PS2) 10Clément Goubert: php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) [11:05:19] (03CR) 10Clément Goubert: [V:03+2 C:03+2] php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [11:05:21] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5308/co" [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [11:05:23] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:05:38] (03PS1) 10Cathal Mooney: Cloud network: update policy to support /17 IPv4 aggregates [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) [11:06:04] !log Rebuilding php base images to pick up 1135922 - T391665 [11:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:08] T391665: Move mwscript wrapper from base image to copy on build - https://phabricator.wikimedia.org/T391665 [11:06:12] (03PS2) 10Abijeet Patro: Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) [11:09:21] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:09:43] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:10:21] !log cgoubert@deploy1003 Started scap sync-world: Move mwscript wrapper from base image to copy on build - T391665 [11:11:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:15:23] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:18:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T391056)', diff saved to https://phabricator.wikimedia.org/P75100 and previous config saved to /var/cache/conftool/dbconfig/20250416-111759-fceratto.json [11:18:03] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:18:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:18:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75101 and previous config saved to /var/cache/conftool/dbconfig/20250416-111822-fceratto.json [11:19:37] (03CR) 10Hnowlan: [C:03+2] rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [11:20:23] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747041 (10Ladsgroup) 05Open→03Resolved [11:21:09] (03Merged) 10jenkins-bot: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [11:21:23] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/922c734ba2d3515515e7e0c69be9fcf04f1bc210092cb07b58fc3729e51d4cd6/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [11:26:03] (03CR) 10Nikerabbit: [C:03+1] Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) (owner: 10Abijeet Patro) [11:26:27] PROBLEM - Hadoop NodeManager on an-worker1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:27:27] RECOVERY - Hadoop NodeManager on an-worker1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:29:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75102 and previous config saved to /var/cache/conftool/dbconfig/20250416-112948-fceratto.json [11:29:52] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:30:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10747081 (10cmooney) >>! In T392007#10745165, @Jclark-ctr wrote: > @RobH we have 1 free cross connect circuit id 21996480. but have plenty of room for additional p... [11:32:43] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:36:57] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [11:37:02] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [11:37:37] !log temporarily disable query sites on miscweb vms - T350793 [11:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:40] T350793: move query.wikidata.org to kubernetes - https://phabricator.wikimedia.org/T350793 [11:37:45] 26 minutes for a full image push but IT WENT THROUGH. [11:40:35] (03PS1) 10Volans: doc: expand logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136984 [11:41:15] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136985 [11:41:23] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [11:41:30] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:41:37] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:42:41] (03CR) 10Arturo Borrero Gonzalez: [C:04-1] "I like this idea! Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [11:43:39] FIRING: [4x] ProbeDown: Service restbase1045-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P75103 and previous config saved to /var/cache/conftool/dbconfig/20250416-114455-fceratto.json [11:45:12] FIRING: ProbeDown: Service miscweb2003:443 has failed probes (http_query_scholarly_wikidata_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:26] ^ miscweb alert is expected, I'll silence this [11:46:31] (03CR) 10Cathal Mooney: [C:03+2] Cloud network: update policy to support /17 IPv4 aggregates [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [11:47:03] (03Merged) 10jenkins-bot: Cloud network: update policy to support /17 IPv4 aggregates [homer/public] - 10https://gerrit.wikimedia.org/r/1136980 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [11:48:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136965 (owner: 10PipelineBot) [11:48:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135706 (owner: 10PipelineBot) [11:48:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136366 (owner: 10PipelineBot) [11:48:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136791 (owner: 10PipelineBot) [11:48:22] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136789 (owner: 10PipelineBot) [11:48:43] restbase1045-b is actually down - but also not in the cluster? [11:48:46] (03CR) 10Tacsipacsi: [C:03+1] "In T391297#10737100, it was highlighted that this is a regression (caused by I39d1d1f45c017e6522f71979c8ad70ae2b00c333). Given this, I’m f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [11:48:46] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136985 (owner: 10PipelineBot) [11:49:34] oh, restbase1045-b is possibly yet to be bootstrapped cc urandom [11:50:15] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136985 (owner: 10PipelineBot) [11:50:38] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127760 (owner: 10PipelineBot) [11:50:39] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133976 (owner: 10PipelineBot) [11:50:39] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132619 (owner: 10PipelineBot) [11:50:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:50:39] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132653 (owner: 10PipelineBot) [11:50:40] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131799 (owner: 10PipelineBot) [11:50:41] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123332 (owner: 10PipelineBot) [11:50:45] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126707 (owner: 10PipelineBot) [11:50:49] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125137 (owner: 10PipelineBot) [11:50:53] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124441 (owner: 10PipelineBot) [11:50:57] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114365 (owner: 10PipelineBot) [11:51:01] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112740 (owner: 10PipelineBot) [11:51:05] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092223 (owner: 10PipelineBot) [11:51:09] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100203 (owner: 10PipelineBot) [11:51:13] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105689 (owner: 10PipelineBot) [11:51:17] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111688 (owner: 10PipelineBot) [11:51:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088280 (owner: 10PipelineBot) [11:51:24] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:51:25] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1083795 (owner: 10PipelineBot) [11:51:29] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1082760 (owner: 10PipelineBot) [11:51:33] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077698 (owner: 10PipelineBot) [11:51:37] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079997 (owner: 10PipelineBot) [11:51:39] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:51:41] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066744 (owner: 10PipelineBot) [11:51:42] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aphlict2001.codfw.wmnet with reason: Bookworm Re-image [11:51:45] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068755 (owner: 10PipelineBot) [11:51:49] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070272 (owner: 10PipelineBot) [11:51:52] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:51:53] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [11:52:01] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:52:01] (03PS1) 10Cyndywikime: Growth: Configure higher Impact Module edit limits for testwiki pilot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) [11:52:31] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host aphlict2001.codfw.wmnet with OS bookworm [11:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:12] (03PS1) 10Slyngshede: Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694) [11:57:31] (03PS1) 10Cathal Mooney: WMCS: fix typo in updated cloud-in policy [homer/public] - 10https://gerrit.wikimedia.org/r/1136991 (https://phabricator.wikimedia.org/T364725) [11:57:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:57:47] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:58:07] (03PS2) 10Cyndywikime: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) [11:59:13] (03CR) 10Cathal Mooney: [C:03+2] WMCS: fix typo in updated cloud-in policy [homer/public] - 10https://gerrit.wikimedia.org/r/1136991 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [11:59:50] (03Merged) 10jenkins-bot: WMCS: fix typo in updated cloud-in policy [homer/public] - 10https://gerrit.wikimedia.org/r/1136991 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [12:00:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P75104 and previous config saved to /var/cache/conftool/dbconfig/20250416-120002-fceratto.json [12:00:33] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:00:54] !log cgoubert@deploy1003 Finished scap sync-world: Move mwscript wrapper from base image to copy on build - T391665 (duration: 50m 43s) [12:00:57] T391665: Move mwscript wrapper from base image to copy on build - https://phabricator.wikimedia.org/T391665 [12:04:30] (03PS1) 10Jgiannelos: pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) [12:04:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:05:11] (03PS4) 10Michael Große: Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510) [12:05:17] (03CR) 10Ladsgroup: [C:03+2] Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510) (owner: 10Michael Große) [12:05:21] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Revert "mediawiki: Absent updatementeedata jobs" [puppet] - 10https://gerrit.wikimedia.org/r/1136970 (https://phabricator.wikimedia.org/T390510) (owner: 10Michael Große) [12:05:33] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:05:41] (03PS2) 10Jgiannelos: pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) [12:05:42] (03CR) 10CI reject: [V:04-1] pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos) [12:05:45] (03PS2) 10Slyngshede: Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694) [12:05:49] (03PS1) 10Cathal Mooney: WMCS: fix typo in updated cloud-in policy #2 [homer/public] - 10https://gerrit.wikimedia.org/r/1136993 (https://phabricator.wikimedia.org/T364725) [12:06:15] (03PS2) 10Hnowlan: trafficserver: route various miscellaneous pcs services to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1136676 (https://phabricator.wikimedia.org/T385033) [12:06:32] (03CR) 10Cathal Mooney: [C:03+2] WMCS: fix typo in updated cloud-in policy #2 [homer/public] - 10https://gerrit.wikimedia.org/r/1136993 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [12:07:10] (03Merged) 10jenkins-bot: WMCS: fix typo in updated cloud-in policy #2 [homer/public] - 10https://gerrit.wikimedia.org/r/1136993 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [12:07:10] (03PS3) 10Cyndywikime: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) [12:08:02] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage [12:09:25] (03PS1) 10Clément Goubert: php-fpm-multiversion-base: Cleanup unused scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) [12:09:40] (03CR) 10Hnowlan: [C:03+1] pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos) [12:10:13] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:11:11] (03Abandoned) 10Clément Goubert: growthexperiments: Disable updatementeedata on s6 [puppet] - 10https://gerrit.wikimedia.org/r/1135988 (owner: 10Clément Goubert) [12:11:32] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage [12:11:33] (03CR) 10Jgiannelos: [C:03+2] pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos) [12:11:35] (03CR) 10Clément Goubert: [C:03+2] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [12:12:45] (03PS4) 10Cyndywikime: Growth: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) [12:13:05] (03Merged) 10jenkins-bot: pcs: Add HTTP request template for wikitext to html rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136992 (https://phabricator.wikimedia.org/T385033) (owner: 10Jgiannelos) [12:13:33] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:13:42] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:14:15] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:14:26] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:14:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:15:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75106 and previous config saved to /var/cache/conftool/dbconfig/20250416-121509-fceratto.json [12:15:14] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:15:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:15:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75107 and previous config saved to /var/cache/conftool/dbconfig/20250416-121532-fceratto.json [12:17:29] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:17:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75108 and previous config saved to /var/cache/conftool/dbconfig/20250416-121742-fceratto.json [12:17:50] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:18:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:36] (03PS1) 10Cathal Mooney: WMCS: Remove static routes for cloudsw2-d5-eqiad loopbacks [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) [12:19:16] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:54] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:20:02] PROBLEM - Exim SMTP on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Exim [12:20:06] PROBLEM - SSH on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:20:24] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:20:43] (03CR) 10Cathal Mooney: [C:03+2] WMCS: Remove static routes for cloudsw2-d5-eqiad loopbacks [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:21:14] (03Merged) 10jenkins-bot: WMCS: Remove static routes for cloudsw2-d5-eqiad loopbacks [homer/public] - 10https://gerrit.wikimedia.org/r/1136995 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:23:11] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:23:16] PROBLEM - Hadoop NodeManager on an-worker1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:23:39] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:23:50] (03PS1) 10Cathal Mooney: WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) [12:24:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:25:25] FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-updateMenteeData-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:30] (03PS1) 10Hnowlan: deployment_server: ignore overlayfs when checking disk space [puppet] - 10https://gerrit.wikimedia.org/r/1136997 [12:26:31] (03CR) 10Jelto: [C:03+1] "looks good to me, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans) [12:26:34] PROBLEM - Hadoop NodeManager on an-worker1097 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:27:17] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:27:27] (03CR) 10Cathal Mooney: [C:03+2] WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:27:54] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:59] (03Merged) 10jenkins-bot: WMCS: Change ASN for cloudsw1-e4/f4 [homer/public] - 10https://gerrit.wikimedia.org/r/1136996 (https://phabricator.wikimedia.org/T379283) (owner: 10Cathal Mooney) [12:32:46] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [12:32:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P75109 and previous config saved to /var/cache/conftool/dbconfig/20250416-123248-fceratto.json [12:34:14] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:36:51] (03PS1) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [12:37:56] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aphlict2001.codfw.wmnet with OS bookworm [12:38:16] (03CR) 10CI reject: [V:04-1] varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [12:38:51] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10747323 (10Jdforrester-WMF) [12:39:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [12:41:34] RECOVERY - Hadoop NodeManager on an-worker1097 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:43:36] (03CR) 10Clément Goubert: [C:03+1] deployment_server: ignore overlayfs when checking disk space [puppet] - 10https://gerrit.wikimedia.org/r/1136997 (owner: 10Hnowlan) [12:43:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:47:44] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:47:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P75111 and previous config saved to /var/cache/conftool/dbconfig/20250416-124755-fceratto.json [12:47:56] RECOVERY - SSH on lists1004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:47:58] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [12:48:06] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:48:22] (03CR) 10Volans: [C:03+2] CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans) [12:48:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:48:26] (03PS2) 10Volans: CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 [12:48:39] FIRING: [4x] ProbeDown: Service restbase1045-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:16] RECOVERY - Hadoop NodeManager on an-worker1139 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:51:08] (03CR) 10Filippo Giunchedi: [C:03+2] etcd: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129177 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [12:52:35] (03PS1) 10Cathal Mooney: WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) [12:55:27] (03PS2) 10Cathal Mooney: WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) [12:57:36] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [12:57:59] (03CR) 10Cathal Mooney: [C:03+2] WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [12:58:05] (03CR) 10Alexandros Kosiaris: [C:03+2] webperf: Move `php_admin_flag engine on` from subdir to docroot [puppet] - 10https://gerrit.wikimedia.org/r/1130211 (owner: 10Krinkle) [12:58:32] (03Merged) 10jenkins-bot: WMCS: Remove cloud-instances2-b specific ranges from BGP policy [homer/public] - 10https://gerrit.wikimedia.org/r/1137002 (https://phabricator.wikimedia.org/T364725) (owner: 10Cathal Mooney) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1300). [13:00:04] HouseOfM and tto: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:33] Greetings! [13:00:38] (03CR) 10Hnowlan: [C:03+2] deployment_server: ignore overlayfs when checking disk space [puppet] - 10https://gerrit.wikimedia.org/r/1136997 (owner: 10Hnowlan) [13:00:43] o/ [13:01:05] (03CR) 10Filippo Giunchedi: [C:03+1] Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:01:26] I can deploy, but I wouldn’t mind if someone else does it [13:01:50] o/ greetings [13:03:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T391056)', diff saved to https://phabricator.wikimedia.org/P75112 and previous config saved to /var/cache/conftool/dbconfig/20250416-130303-fceratto.json [13:03:07] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:03:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:03:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T391056)', diff saved to https://phabricator.wikimedia.org/P75113 and previous config saved to /var/cache/conftool/dbconfig/20250416-130326-fceratto.json [13:03:37] (03CR) 10Filippo Giunchedi: alertmanager: update irc template for pyrra slo alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron) [13:03:42] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "+1 as this restores a `strtolower()` that was already present prior to I39d1d1f45c." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [13:04:12] alright, I can deploy [13:04:21] :o thanks Lucas_WMDE! [13:04:22] and I’ll start with tto’s change since HouseOfM’s still has an open comment [13:04:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [13:04:44] it does? I hadn't seen that! thx [13:04:47] (03PS1) 10Fabfur: cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) [13:05:01] yeah, I looked at it earlier today [13:05:19] haven’t had the time yet to fully confirm but I think all the core-Permissions.php changes are unnecessary [13:05:27] since CampaignEvents configures a group by default now [13:05:31] (03Merged) 10jenkins-bot: search-redirect: fix case-sensitivity of project name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [13:05:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T391056)', diff saved to https://phabricator.wikimedia.org/P75114 and previous config saved to /var/cache/conftool/dbconfig/20250416-130536-fceratto.json [13:05:47] IIUC the only core-Permissions.php entries that are left related to CampaignEvents are for nonstandard situations, like test wikis where all users should have those permissions [13:05:58] (03PS2) 10Filippo Giunchedi: logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) [13:06:00] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134984|search-redirect: fix case-sensitivity of project name (T391297)]] [13:06:04] T391297: www.wiktionary.org and other portals are redirecting searches to wikipedia - https://phabricator.wikimedia.org/T391297 [13:06:10] (03CR) 10Jelto: [C:03+2] make helm3 alternative entry dependent on helm [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [13:10:04] (03PS3) 10Mhorsey: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) [13:10:31] You are correct @Lucas_WMDE I've made the relevant change [13:10:39] nice [13:10:52] we can see what the userrights API reports on mwdebug :) [13:10:53] (03PS1) 10Ssingh: utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007 [13:11:26] (03CR) 10CI reject: [V:04-1] utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007 (owner: 10Ssingh) [13:12:03] (03CR) 10Giuseppe Lavagetto: [C:04-2] "The current behaviour is deliberately chosen for a reason: we want to know the full diff compared to what is in production right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [13:15:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:16:55] (03PS2) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [13:16:59] !log lucaswerkmeister-wmde@deploy1003 wargo, lucaswerkmeister-wmde: Backport for [[gerrit:1134984|search-redirect: fix case-sensitivity of project name (T391297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:03] T391297: www.wiktionary.org and other portals are redirecting searches to wikipedia - https://phabricator.wikimedia.org/T391297 [13:17:27] (03PS2) 10Ssingh: utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007 [13:17:42] tto: please test with WikimediaDebug :) [13:18:09] (I assume it should still work for docroot/wwwportal stuff) [13:18:24] (03CR) 10Tiziano Fogli: profile::prometheus::k8s: drop two more labels in Istio metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:18:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:18:49] !log bounce thanos on titan100* - overload [13:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:56] OK, will do... [13:19:38] https://www.wikipedia.org/search-redirect.php?language=de&search=Test&family=Wiktionary seems to work for me (redirects to Wikipedia currently but Wiktionary with -H 'X-Wikimedia-Debug: backend=k8s-mwdebug') [13:20:16] Can confirm working on k8s-mwdebug [13:20:23] !log lucaswerkmeister-wmde@deploy1003 wargo, lucaswerkmeister-wmde: Continuing with sync [13:20:27] nice, thanks! [13:20:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:20:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P75115 and previous config saved to /var/cache/conftool/dbconfig/20250416-132043-fceratto.json [13:21:54] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:22:20] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:23:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:24:40] !log finish rollout of thanos 0.38 to prometheus* - T383966 [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:44] T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966 [13:26:52] RECOVERY - OpenSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: [13:26:52] er_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [13:26:54] (03PS2) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [13:27:01] (03CR) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [13:28:05] (03CR) 10Mhorsey: Release campaignEvents extension to azwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey) [13:28:56] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134984|search-redirect: fix case-sensitivity of project name (T391297)]] (duration: 22m 55s) [13:29:00] T391297: www.wiktionary.org and other portals are redirecting searches to wikipedia - https://phabricator.wikimedia.org/T391297 [13:29:28] (03PS3) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [13:29:46] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:29:56] Lucas_WMDE I can confirm this is now working in production! [13:29:58] (03CR) 10CI reject: [V:04-1] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:30:00] Thanks for your assistance as ever [13:31:03] (03PS2) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [13:31:04] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:31:22] Goodnight all [13:32:05] Lucas_WMDE: o/ all good with the deployments so far right? [13:32:40] (03PS4) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [13:32:42] elukey: yup [13:33:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey) [13:33:46] I’m also in a meeting now, so might be a bit slow to respond to messages [13:33:47] FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:33:50] hopefully the deployment will go smoothly [13:34:15] (03Merged) 10jenkins-bot: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey) [13:34:38] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1136754|Release campaignEvents extension to azwiki (T390805)]] [13:34:42] T390805: Enable CampaignEvents Extension on azwiki - https://phabricator.wikimedia.org/T390805 [13:34:53] yep yep, ping me if needed [13:35:02] it seems that the 5 minutes delay is working [13:35:29] (03CR) 10Hashar: "> The current behaviour is deliberately chosen for a reason: we want to know the full diff compared to what is in production right now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [13:35:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P75116 and previous config saved to /var/cache/conftool/dbconfig/20250416-133552-fceratto.json [13:38:59] ah, the sleep is hidden in build-and-push-container-images ^^ [13:39:07] (03CR) 10Elukey: [C:03+1] __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 (owner: 10Volans) [13:39:52] (03CR) 10Elukey: [C:03+1] Data Platform cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans) [13:40:00] (03CR) 10Elukey: [C:03+1] I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 (owner: 10Volans) [13:41:25] yay, sleep done [13:43:36] it should tell you something in the scap log though [13:43:46] there is also a "Sorry" :D [13:43:47] FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:43:54] elukey: that’s only in the output file [13:44:00] 13:34:59 Started build-and-push-container-images [13:44:00] 13:34:59 K8s images build/push output redirected to /home/lucaswerkmeister-wmde/scap-image-build-and-push-log [13:44:00] 13:41:07 Finished build-and-push-container-images (duration: 06m 08s) [13:44:08] and once I looked at that file I saw the “sorry” [13:44:19] ahhh right right [13:44:36] (03CR) 10Vgutierrez: [C:04-1] "we need to return `X-Cache: hostname int` and `X-Cache-Status: int-tls` here as well" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [13:44:58] !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Backport for [[gerrit:1136754|Release campaignEvents extension to azwiki (T390805)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:45:02] T390805: Enable CampaignEvents Extension on azwiki - https://phabricator.wikimedia.org/T390805 [13:45:04] HouseOfM: please test :) [13:45:32] user rights look promising to me fwiw [13:45:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:46:58] LGTM :) [13:47:01] !log lucaswerkmeister-wmde@deploy1003 mhorsey, lucaswerkmeister-wmde: Continuing with sync [13:47:02] yay [13:47:45] (03CR) 10Tiziano Fogli: [C:03+1] profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [13:48:23] (03CR) 10Elukey: [C:03+1] I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 (owner: 10Volans) [13:49:11] (03CR) 10Herron: [C:03+1] profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [13:50:36] 07sre-alert-triage, 06SRE Observability, 06Traffic: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T392091 (10LSobanski) 03NEW [13:50:37] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#10747701 (10Jgreen) [13:50:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T391056)', diff saved to https://phabricator.wikimedia.org/P75117 and previous config saved to /var/cache/conftool/dbconfig/20250416-135059-fceratto.json [13:51:03] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:51:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:51:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75118 and previous config saved to /var/cache/conftool/dbconfig/20250416-135121-fceratto.json [13:51:58] !log "Imported helm311 3.11.3-4 to bullseye-wikimedia and bookworm-wikimedia - T387548" [13:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:02] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [13:52:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:52:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:53:17] (03PS2) 10Fabfur: cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) [13:53:27] (03CR) 10Fabfur: "Do you mean on every error request? In this case it's better to provide a separate configuration that will apply to every error generated " [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [13:53:29] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094 (10cmooney) 03NEW p:05Triage→03Low [13:53:41] (03CR) 10CI reject: [V:04-1] cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [13:53:48] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136754|Release campaignEvents extension to azwiki (T390805)]] (duration: 19m 09s) [13:53:51] T390805: Enable CampaignEvents Extension on azwiki - https://phabricator.wikimedia.org/T390805 [13:54:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:54:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:54:51] (03PS1) 10Bking: cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) [13:55:04] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:55:07] !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1045.eqiad.wmnet with reason: Bootstrapping — T389423 [13:55:09] !log UTC afternoon backport+config window done [13:55:10] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:32] tysm Lucas_WMDE. [13:56:36] (03PS3) 10Fabfur: cache: copy allowed methods check to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) [13:57:35] Lucas_WMDE: yeah, I didn't have time to patch scap to add the logging in there, only to the build script, sorry [13:57:43] (03CR) 10Vgutierrez: "every response generated by haproxy needs to be flagged as `int-tls`" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [13:57:48] np ^^ [13:58:27] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#10747776 (10Jgreen) [13:58:33] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#10747777 (10Jgreen) [13:58:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10747774 (10Jgreen) [13:58:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [13:58:47] FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400) [14:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:01:39] (03PS3) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [14:01:44] (03CR) 10Volans: [C:03+2] __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 (owner: 10Volans) [14:01:48] (03PS2) 10Elukey: profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) [14:02:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75119 and previous config saved to /var/cache/conftool/dbconfig/20250416-140228-fceratto.json [14:02:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:03:34] (03PS1) 10Jelto: make helm3 alternative entry dependent on helm [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) [14:04:15] (03PS2) 10Volans: I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 [14:04:23] (03CR) 10Volans: [C:03+2] I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 (owner: 10Volans) [14:04:31] (03CR) 10Jelto: "similar change for `helm317`" [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [14:04:47] 07Puppet: Add PATCH method to Wmflib::HTTP::Method - https://phabricator.wikimedia.org/T392096 (10Fabfur) 03NEW [14:04:59] (03PS3) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) [14:05:15] (03CR) 10Elukey: [C:03+1] ServiceOps cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136839 (owner: 10Volans) [14:05:17] (03CR) 10CI reject: [V:04-1] alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron) [14:05:41] (03CR) 10Elukey: [C:03+1] Traffic cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136840 (owner: 10Volans) [14:06:02] (03PS4) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) [14:06:19] (03CR) 10Elukey: [C:03+1] DataPlatf. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136842 (owner: 10Volans) [14:06:51] (03CR) 10Elukey: [C:03+1] DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [14:07:03] (03CR) 10Herron: alertmanager: update irc template for pyrra slo alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron) [14:07:30] (03CR) 10Brouberol: [C:03+1] cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:08:00] (03CR) 10DCausse: [C:03+1] cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:08:03] (03PS3) 10Elukey: profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) [14:08:03] (03PS4) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [14:08:10] (03Merged) 10jenkins-bot: __title__: remove when it's just the __doc__ [cookbooks] - 10https://gerrit.wikimedia.org/r/1136835 (owner: 10Volans) [14:08:25] (03CR) 10Bking: [C:03+2] cirrussearch: Enable row B for OpenSearch migration. [puppet] - 10https://gerrit.wikimedia.org/r/1137008 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:11:05] (03Merged) 10jenkins-bot: I/F cookbooks: use base argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/1136837 (owner: 10Volans) [14:11:48] (03CR) 10Elukey: [C:03+2] profile::pyrra: enable/disable Istio Pyrra alerts programmatically [puppet] - 10https://gerrit.wikimedia.org/r/1136979 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [14:13:39] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:17] (03PS1) 10Arturo Borrero Gonzalez: prometheus: kernel-messages-ignore-regex.txt: ignore another message [puppet] - 10https://gerrit.wikimedia.org/r/1137012 [14:14:50] (03PS1) 10Krinkle: [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 [14:14:56] (03PS2) 10Arturo Borrero Gonzalez: prometheus: kernel-messages-ignore-regex.txt: ignore another message [puppet] - 10https://gerrit.wikimedia.org/r/1137012 (https://phabricator.wikimedia.org/T391408) [14:15:30] (03CR) 10Herron: "would there be a downside to pushing this even further to say 30+ days essentially to run forcemerge only on the hdd nodes?" [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi) [14:15:48] (03CR) 10Volans: [C:03+2] I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 (owner: 10Volans) [14:16:05] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - brouberol@cumin2002 - T388610 [14:16:08] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:16:43] (03CR) 10Krinkle: "This was committed anonymously in Thu 14 Mar 2024 without a change-id." [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [14:17:08] (03CR) 10CI reject: [V:04-1] [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [14:17:25] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2071.codfw.wmnet on all recursors [14:17:29] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2071.codfw.wmnet on all recursors [14:17:29] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2099.codfw.wmnet on all recursors [14:17:32] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2099.codfw.wmnet on all recursors [14:17:33] !log brouberol@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2101.codfw.wmnet on all recursors [14:17:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P75120 and previous config saved to /var/cache/conftool/dbconfig/20250416-141735-fceratto.json [14:17:36] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2101.codfw.wmnet on all recursors [14:18:08] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [14:18:30] (03CR) 10Krinkle: "I don't know volatile means in this context but https://gerrit.wikimedia.org/r/q/project:operations/puppet+message:%22puppetserver::volati" [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [14:20:43] (03PS5) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [14:20:44] (03PS1) 10Elukey: profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) [14:21:17] (03CR) 10CI reject: [V:04-1] profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [14:21:47] (03Merged) 10jenkins-bot: I/F cookbooks: use the parser tuning attributes [cookbooks] - 10https://gerrit.wikimedia.org/r/1136838 (owner: 10Volans) [14:22:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: kernel-messages-ignore-regex.txt: ignore another message [puppet] - 10https://gerrit.wikimedia.org/r/1137012 (https://phabricator.wikimedia.org/T391408) (owner: 10Arturo Borrero Gonzalez) [14:22:06] !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech3_amd64.changes: T205378 [14:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:10] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [14:23:28] (03PS5) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [14:24:53] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [14:26:20] (03CR) 10Kamila Součková: [C:03+2] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [14:26:26] (03CR) 10BBlack: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1137007 (owner: 10Ssingh) [14:26:42] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [14:26:46] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:26:50] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [14:26:55] (03CR) 10Ssingh: [C:03+2] utils/type65: handle base64 encoded ECHConfigList [dns] - 10https://gerrit.wikimedia.org/r/1137007 (owner: 10Ssingh) [14:27:27] !log sukhe@dns1004 START - running authdns-update [14:27:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:27:58] (03PS2) 10Elukey: profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) [14:27:58] (03PS6) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [14:28:34] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [14:29:29] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5310/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [14:29:55] !log sukhe@dns1004 END - running authdns-update [14:31:28] (03CR) 10Filippo Giunchedi: [C:03+1] alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) (owner: 10Herron) [14:32:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P75121 and previous config saved to /var/cache/conftool/dbconfig/20250416-143242-fceratto.json [14:33:09] (03CR) 10Kamila Součková: [C:03+1] php-fpm-multiversion-base: Cleanup unused scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [14:33:51] jouncebot: nowandnext [14:33:51] For the next 0 hour(s) and 26 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400) [14:33:52] In 2 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1700) [14:37:36] (03CR) 10FNegri: [C:03+2] openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) (owner: 10FNegri) [14:37:59] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748105 (10cmooney) [14:38:18] (03CR) 10Federico Ceratto: [C:03+1] "Confirmed with @claime on IRC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:38:21] (03CR) 10Federico Ceratto: [C:03+2] Add zarcillo k8s service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:38:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:38:57] (03CR) 10Tiziano Fogli: [C:03+1] profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [14:39:05] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10748126 (10cmooney) These are the two for codfw: ` ip route add vrf vrf-cloudgw blackhole 172.16.128.0/17 metric 9999 ip route add vrf vrf-cloudgw blackhole 2a02:... [14:39:12] (03CR) 10Elukey: [V:03+1 C:03+2] profile::pyrra: fix istio burnrates toggle [puppet] - 10https://gerrit.wikimedia.org/r/1137014 (https://phabricator.wikimedia.org/T391925) (owner: 10Elukey) [14:40:17] jouncebot: now and next [14:40:17] For the next 0 hour(s) and 19 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400) [14:40:32] (03CR) 10Filippo Giunchedi: [C:03+2] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [14:40:45] (03PS2) 10Filippo Giunchedi: deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) [14:40:58] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [14:41:09] (03PS2) 10Majavah: Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 [14:41:47] taavi: \o/ o/ [14:41:54] (03CR) 10Cathal Mooney: [C:03+1] Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 (owner: 10Majavah) [14:41:58] (03CR) 10Cathal Mooney: [C:03+2] Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 (owner: 10Majavah) [14:42:31] (03Merged) 10jenkins-bot: Restore access for taavi [homer/public] - 10https://gerrit.wikimedia.org/r/1127542 (owner: 10Majavah) [14:44:57] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:45:17] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:47:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T391056)', diff saved to https://phabricator.wikimedia.org/P75122 and previous config saved to /var/cache/conftool/dbconfig/20250416-144750-fceratto.json [14:47:54] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:48:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:48:18] (03CR) 10Kamila Součková: [C:04-2] "> That was deemed a problem in T387781" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [14:49:38] jouncebot: now and next [14:49:38] For the next 0 hour(s) and 10 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1400) [14:50:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:23] (03CR) 10Filippo Giunchedi: "tbh I don't know, though off the top of my head I don't see why not, except maybe forcemerge performance on hdd might be costly? we'll nee" [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi) [14:51:40] (03CR) 10Alexandros Kosiaris: Add zarcillo k8s service (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135414 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:52:39] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:52:47] (03PS1) 10FNegri: wikireplicas: maintain-views should not create _p db [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) [14:53:06] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:53:14] (03PS6) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:53:19] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:53:34] (03CR) 10CI reject: [V:04-1] [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:53:37] (03CR) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:53:40] (03PS7) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:53:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:54:11] (03CR) 10CI reject: [V:04-1] [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:54:14] (03PS8) 10Brouberol: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:55:01] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [14:56:26] (03CR) 10Ottomata: [C:03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [14:57:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1229.eqiad.wmnet with reason: Maintenance [14:57:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T391056)', diff saved to https://phabricator.wikimedia.org/P75123 and previous config saved to /var/cache/conftool/dbconfig/20250416-145718-fceratto.json [14:57:22] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:59:01] (03PS3) 10Scott French: PageTriage: migrate updatePageTriageQueue-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) [14:59:01] (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [14:59:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T391056)', diff saved to https://phabricator.wikimedia.org/P75124 and previous config saved to /var/cache/conftool/dbconfig/20250416-145928-fceratto.json [15:00:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:55] (03PS9) 10Brouberol: Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [15:01:11] (03PS7) 10Elukey: profile::prometheus::k8s: drop two more labels in Istio metrics [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) [15:01:57] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10748293 (10fnegri) Thanks @Jclark-ctr, do you think there is a way to disable the sensor so that it will not trigger the alert? We could also sile... [15:02:10] (03PS1) 10Kamila Součková: Revert "CampaignEvents: Migrate aggregateparticipantanswers-test2wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1137020 [15:02:29] (03PS2) 10Kamila Součková: Revert "CampaignEvents: Migrate aggregateparticipantanswers-test2wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1137020 [15:03:41] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10748298 (10fgiunchedi) [15:03:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:20] (03PS5) 10JHathaway: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:04:22] (03PS1) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) [15:05:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10748300 (10RobH) 05Open→03Stalled Please note this is stalled while the evaluation of D6 is performed. , please see T392007 and... [15:05:19] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:06:38] (03CR) 10Dwisehaupt: [C:03+2] Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:09] (03CR) 10Hnowlan: [C:03+2] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:10:34] (03PS1) 10Ssingh: utils/type65: fix typo s/bas64/base64 [dns] - 10https://gerrit.wikimedia.org/r/1137022 [15:13:38] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [15:14:09] (03CR) 10Ssingh: [V:03+2 C:03+2] "Fixing typo, no code change." [dns] - 10https://gerrit.wikimedia.org/r/1137022 (owner: 10Ssingh) [15:14:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P75125 and previous config saved to /var/cache/conftool/dbconfig/20250416-151438-fceratto.json [15:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10748338 (10phaultfinder) [15:14:41] !log sukhe@dns1004 START - running authdns-update [15:15:09] (03CR) 10Kamila Součková: [C:03+2] Revert "CampaignEvents: Migrate aggregateparticipantanswers-test2wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1137020 (owner: 10Kamila Součková) [15:16:31] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans) [15:17:14] !log sukhe@dns1004 END - running authdns-update [15:20:25] FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:25] FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-updateMenteeData-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:26] (03Merged) 10jenkins-bot: CollabSvcs cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136841 (owner: 10Volans) [15:26:40] (03CR) 10Tiziano Fogli: [C:03+1] profile::prometheus::k8s: drop two more labels in Istio metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136978 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [15:27:48] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, Andrew!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [15:27:54] (03CR) 10Ssingh: [C:04-2] "DO NOT MERGE until April 24, week of deploy." [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:29:16] (03Merged) 10jenkins-bot: eventstreams: expose RRLA event stream publicly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [15:29:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P75126 and previous config saved to /var/cache/conftool/dbconfig/20250416-152945-fceratto.json [15:30:42] (03PS1) 10Herron: Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 [15:32:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [15:32:56] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [15:32:56] (03CR) 10CI reject: [V:04-1] Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron) [15:33:24] (03PS1) 10Dwisehaupt: hiera: acme_chief: add community-crm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) [15:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10748444 (10phaultfinder) [15:34:50] (03CR) 10Dwisehaupt: "Here is the acmechief stanza I believe we need. It is using community-crm instead of the crm role since that is the public name of the ser" [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:35:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:20] (03CR) 10JHathaway: [C:03+2] hiera: acme_chief: add community-crm.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1137028 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [15:37:29] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: connect to address 10.192.0.18 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [15:42:46] MichaelG_WMF: looks like there might be some failures for mediawiki_job_growthexperiments-updateMenteeData-s1.service [15:43:17] hnowlan: meh. Where do you see them? [15:44:08] MichaelG_WMF: there's been one or two SystemdUnitFailed messages in here, at 15:20 and 12:25. haven't looked more [15:44:38] * MichaelG_WMF scrolls up [15:44:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T391056)', diff saved to https://phabricator.wikimedia.org/P75127 and previous config saved to /var/cache/conftool/dbconfig/20250416-154452-fceratto.json [15:44:56] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:45:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1233.eqiad.wmnet with reason: Maintenance [15:45:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [15:45:14] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [15:45:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T391056)', diff saved to https://phabricator.wikimedia.org/P75128 and previous config saved to /var/cache/conftool/dbconfig/20250416-154515-fceratto.json [15:46:34] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2070 to cirrussearch2070 [15:46:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:47:27] hnowlan: I'm seeing it now, thanks. Though haven't found them yet in logstash [15:48:19] (03PS1) 10Hnowlan: mw:periodic_job:kubernetes: fail when job name in kubernetes is too long [puppet] - 10https://gerrit.wikimedia.org/r/1137029 [15:49:42] (03PS5) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [15:50:11] (03CR) 10BCornwall: [C:03+1] Revert^2 "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772 (owner: 10Ssingh) [15:51:00] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2070 to cirrussearch2070 - bking@cumin2002" [15:52:27] (03CR) 10Ssingh: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133563 (owner: 10Ncmonitor) [15:52:54] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1133563 (owner: 10Ncmonitor) [15:53:27] (03PS2) 10Hnowlan: mw:periodic_job:kubernetes: fail when job name in kubernetes is too long [puppet] - 10https://gerrit.wikimedia.org/r/1137029 [15:53:32] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:55] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan) [15:54:04] hnowlan: or Amir1: any ideas for how to debug this? `systemctl list-units --state=failed` is not listing the unit [15:54:37] MichaelG_WMF: lemme take a look [15:54:48] claime: thanks! [15:54:50] I think you can find logs in /var/log/maint-name [15:55:05] (03PS2) 10Ssingh: Revert^2 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772 [15:55:06] yes, looked at that, contains nothing helpful [15:55:17] (03CR) 10Ssingh: "Updated to durum3003 so the DEfO folks in IE can test." [puppet] - 10https://gerrit.wikimedia.org/r/1136772 (owner: 10Ssingh) [15:55:29] only that the job started for enwiki, but no error message anything of the sort [15:56:09] there is this [15:56:12] https://www.irccloud.com/pastebin/EcbdjS5B/ [15:56:19] Main PID: 31703 (code=exited, status=0/SUCCESS) [15:56:27] it worked correctly [15:56:34] sudo systemctl status mediawiki_job_growthexperiments-updateMenteeData-s1.service [15:56:38] [...] [15:56:38] > Apr 16 15:18:28 mwmaint1002 systemd[1]: mediawiki_job_growthexperiments-updateMenteeData-s1.service: Current command vanished from the unit file, execution of the command list won't be resumed. [15:56:43] Apr 16 15:55:16 mwmaint1002 mediawiki_job_growthexperiments-updateMenteeData-s1[31703]: enwiki: Done. Took 2416 seconds. [15:56:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T391056)', diff saved to https://phabricator.wikimedia.org/P75129 and previous config saved to /var/cache/conftool/dbconfig/20250416-155655-fceratto.json [15:56:59] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:58:18] claime: ok, when I looked minutes ago, the success message wasn't there yet XD [15:58:36] but then why the error messages here about the systemd unit having failed? [15:58:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2070 to cirrussearch2070 - bking@cumin2002" [15:58:45] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:58:46] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070 [15:58:55] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2070 [15:59:06] Also, the thing posted by Amir1 sounds strange [15:59:08] > Apr 16 15:18:28 mwmaint1002 systemd[1]: mediawiki_job_growthexperiments-updateMenteeData-s1.service: Current command vanished from the unit file, execution of the command list won't be resumed. [15:59:19] (03PS1) 10Ahmon Dancy: spiderpig: Set global_cert_name on deployment-deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945) [15:59:22] I have not seen this before [15:59:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2070 to cirrussearch2070 [15:59:36] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2070.codfw.wmnet on all recursors [15:59:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2070.codfw.wmnet on all recursors [15:59:41] Me neither [16:00:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2070.codfw.wmnet with OS bullseye [16:00:14] (03PS2) 10Ahmon Dancy: spiderpig: Set global_cert_name on deployment-deploy04 [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945) [16:00:25] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2070 [16:00:27] (03CR) 10Elukey: [C:03+1] spicerack: enable IRC notification on user input [puppet] - 10https://gerrit.wikimedia.org/r/1136973 (owner: 10Volans) [16:00:53] (03CR) 10Elukey: [C:03+1] doc: expand logging documentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136984 (owner: 10Volans) [16:00:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10748625 (10VRiley-WMF) I have sent an email to them requesting an update on this. Awaiting response. [16:01:07] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:01:30] (03CR) 10Ahmon Dancy: "This finalizes a change that was lurking on deployment-puppetserver-1.deployment-prep." [puppet] - 10https://gerrit.wikimedia.org/r/1137030 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [16:01:43] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron) [16:02:04] (03PS1) 10Dwisehaupt: hiera: acme_chief: move community-crm to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715) [16:03:21] (03CR) 10JHathaway: [C:03+2] hiera: acme_chief: move community-crm to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:03:29] (03CR) 10JHathaway: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137032 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:04:26] (03PS3) 10Filippo Giunchedi: logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) [16:04:44] (03PS3) 10Hnowlan: mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 [16:04:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/1136604/5311/" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [16:05:12] (03CR) 10CI reject: [V:04-1] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan) [16:05:37] (03CR) 10Filippo Giunchedi: "As discussed at the meeting, pushed forcemerge to 30d" [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi) [16:05:57] (03PS4) 10Hnowlan: mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 [16:06:22] (03CR) 10CI reject: [V:04-1] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan) [16:06:59] (03PS1) 10Clément Goubert: mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 [16:07:08] !log kevinbazira@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:07:28] !log kevinbazira@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:07:40] (03PS5) 10Hnowlan: mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 [16:07:45] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2070 - bking@cumin2002" [16:07:50] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2070 - bking@cumin2002" [16:07:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:51] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:07:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:07:55] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070 [16:08:23] !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1136772"' [16:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:10] (03CR) 10Ssingh: [C:03+2] Revert^2 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772 (owner: 10Ssingh) [16:10:09] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan) [16:10:12] !log stopping bird on durum3003 to temporarily disable advertising of anycast IPs [16:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:29] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: testing ECH [16:12:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P75132 and previous config saved to /var/cache/conftool/dbconfig/20250416-161202-fceratto.json [16:13:14] (03CR) 10Hnowlan: [C:03+1] mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert) [16:13:50] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert) [16:15:33] (03CR) 10Scott French: [C:03+1] mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert) [16:15:45] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:15:53] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:16:04] ^ expected, host is depooled [16:16:20] (03Merged) 10jenkins-bot: mediawiki: Shorten CronJob names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137033 (owner: 10Clément Goubert) [16:16:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:17:15] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:17:21] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:17:27] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:17:27] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:17:33] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:17:35] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:17:35] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:18:01] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [16:18:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cirrussearch2070 [16:18:25] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070 [16:18:34] !log bking@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cirrussearch2070 [16:18:34] !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.10 [16:18:40] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:20:48] !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki chart 0.8.10 (duration: 03m 20s) [16:21:50] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:21:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=93) for host cirrussearch2070 [16:21:55] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2070.codfw.wmnet with OS bullseye [16:22:01] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2095.codfw.wmnet on all recursors [16:22:04] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2095.codfw.wmnet on all recursors [16:22:13] (03CR) 10Clément Goubert: [C:03+1] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan) [16:22:22] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2110.codfw.wmnet on all recursors [16:22:25] (03PS1) 10Fabfur: wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096) [16:22:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2110.codfw.wmnet on all recursors [16:22:45] (03CR) 10Hnowlan: [C:03+2] mw:periodic_job:kubernetes: shorten job name, check name length [puppet] - 10https://gerrit.wikimedia.org/r/1137029 (owner: 10Hnowlan) [16:23:01] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [16:23:04] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [16:24:15] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 379664 seconds left:Certificate *.wikimania.com valid until 2025-05-20 06:53:14 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:24:21] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 211598 seconds left:Certificate wikipedia.com valid until 2025-05-29 22:00:27 +0000 (expires in 43 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:24:27] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 204332 seconds left:Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 71 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:24:27] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 323492 seconds left:Certificate *.wikispecies.net valid until 2025-05-20 04:52:46 +0000 (expires in 33 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:24:33] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 326306 seconds left:Certificate wikimedia.is valid until 2025-06-05 06:20:49 +0000 (expires in 49 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:24:35] RECOVERY - HTTPS non-canonical-redirect-8 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikimediacommons.uk has 190344 seconds left:Certificate wikimediacommons.uk valid until 2025-07-01 19:46:04 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:24:35] RECOVERY - HTTPS non-canonical-redirect-7 on ncredir4001 is OK: SSL OK - OCSP staple validity for wikipedia.ro has 235764 seconds left:Certificate wikipedia.ro valid until 2025-07-01 19:44:46 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:25:01] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir4001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 318178 seconds left:Certificate *.wikipedia.bg valid until 2025-06-07 02:21:46 +0000 (expires in 51 days) https://wikitech.wikimedia.org/wiki/Ncredir [16:25:54] (03CR) 10Ssingh: [C:03+1] wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096) (owner: 10Fabfur) [16:26:47] (03CR) 10BCornwall: [C:03+1] wmflib: add PATCH method to the Wmflib::HTTP::Method list [puppet] - 10https://gerrit.wikimedia.org/r/1137035 (https://phabricator.wikimedia.org/T392096) (owner: 10Fabfur) [16:27:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P75133 and previous config saved to /var/cache/conftool/dbconfig/20250416-162709-fceratto.json [16:28:00] (03CR) 10BCornwall: [C:03+1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [16:31:35] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:08] !log kevinbazira@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:32:17] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:31] (03PS1) 10Dwisehaupt: Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 [16:32:42] (03PS4) 10Fabfur: cache,haproxy: allowed methods check and set response headers [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) [16:32:55] !log kevinbazira@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:32:56] (03CR) 10CI reject: [V:04-1] Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (owner: 10Dwisehaupt) [16:33:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2070.codfw.wmnet with OS bullseye [16:33:23] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2070 [16:33:31] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:33:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:34:13] (03PS2) 10Dwisehaupt: Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (https://phabricator.wikimedia.org/T383715) [16:34:29] (03CR) 10Dwisehaupt: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:34:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.919 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:34:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:07] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:20] (03CR) 10JHathaway: [C:03+2] Revert "hiera: acme_chief: add community-crm.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1137037 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:35:45] 06SRE, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10748841 (10Dzahn) How about notifications for next time? [16:36:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:07] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:36:10] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2070.codfw.wmnet 110.16.192.10.in-addr.arpa 0.1.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:36:11] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2070 [16:36:12] (03PS6) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [16:36:23] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2070 [16:36:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2070 [16:36:35] PROBLEM - Host kafka-logging2005 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:47] !log kevinbazira@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [16:37:34] !log kevinbazira@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [16:38:16] (03CR) 10Raymond Ndibe: "tested by execing into `toolforge-control-plane` on lima-kilo and everything works as expected. the index is tracking things properly and " [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [16:42:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T391056)', diff saved to https://phabricator.wikimedia.org/P75135 and previous config saved to /var/cache/conftool/dbconfig/20250416-164216-fceratto.json [16:42:20] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:42:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1239.eqiad.wmnet with reason: Maintenance [16:46:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:46:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:48:39] FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:39] FIRING: [3x] ProbeDown: Service restbase1045-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:16] (03PS1) 10Clément Goubert: mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 [16:51:11] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1254.eqiad.wmnet with reason: Maintenance [16:51:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T391056)', diff saved to https://phabricator.wikimedia.org/P75136 and previous config saved to /var/cache/conftool/dbconfig/20250416-165118-fceratto.json [16:51:22] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:53:59] (03CR) 10Scott French: [C:03+1] mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 (owner: 10Clément Goubert) [16:54:05] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 (owner: 10Clément Goubert) [16:55:22] (03CR) 10Majavah: "Does this mean we should remove the `--drop` option from the script too?" [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) (owner: 10FNegri) [16:56:25] (03Merged) 10jenkins-bot: mediawiki: Don't deploy php-fpm-exporter on non-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137038 (owner: 10Clément Goubert) [16:58:07] !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.11 [16:58:39] FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:14] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1005.eqiad.wmnet - https://phabricator.wikimedia.org/T391413#10748937 (10VRiley-WMF) [16:59:28] (03CR) 10Clément Goubert: [V:03+2 C:03+2] php-fpm-multiversion-base: Cleanup unused scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [16:59:44] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol1005.eqiad.wmnet - https://phabricator.wikimedia.org/T391413#10748943 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This is completed [16:59:56] (03CR) 10Clément Goubert: [V:03+2 C:03+2] "This can be auto-picked up by the weekly rebuild, or we can do a full build tomorrow." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1700) [17:00:32] !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki chart 0.8.11 (duration: 03m 02s) [17:00:39] (03CR) 10Clément Goubert: [V:03+2 C:03+2] "Actually I don't think that's even needed since the image itself isn't changing, just the repo files." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1136994 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [17:03:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T391056)', diff saved to https://phabricator.wikimedia.org/P75137 and previous config saved to /var/cache/conftool/dbconfig/20250416-170305-fceratto.json [17:03:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:05:06] (03CR) 10Kevin Bazira: [C:03+2] "archiving the deployment steps here: https://phabricator.wikimedia.org/P75134" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136933 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [17:07:31] (03CR) 10Herron: [C:03+1] logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi) [17:07:36] (03CR) 10FNegri: "I think that can still be useful, if we have to drop an entire wiki from clouddbs. I'm not sure if that ever happened, and what is the cur" [puppet] - 10https://gerrit.wikimedia.org/r/1137019 (https://phabricator.wikimedia.org/T392105) (owner: 10FNegri) [17:09:11] (03PS6) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [17:09:26] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [17:09:31] (03CR) 10Clément Goubert: [C:03+1] PageTriage: migrate updatePageTriageQueue-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136038 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [17:09:45] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:09:55] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:39] FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:15] (03PS7) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [17:16:13] (03CR) 10Vgutierrez: [C:03+1] cache,haproxy: allowed methods check and set response headers [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [17:16:46] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [17:17:09] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [17:18:07] (03CR) 10Fabfur: cache,haproxy: allowed methods check and set response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [17:18:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P75138 and previous config saved to /var/cache/conftool/dbconfig/20250416-171813-fceratto.json [17:18:36] (03CR) 10Fabfur: "Do you want to split this part into a separate MR?" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [17:21:50] (03CR) 10Clément Goubert: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [17:25:20] (03CR) 10Hnowlan: [C:03+2] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [17:28:19] (03PS1) 10Bking: cirrussearch: fix row B regex [puppet] - 10https://gerrit.wikimedia.org/r/1137043 (https://phabricator.wikimedia.org/T388610) [17:29:19] (03CR) 10Bking: [C:03+2] cirrussearch: fix row B regex [puppet] - 10https://gerrit.wikimedia.org/r/1137043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:29:27] (03CR) 10Bking: [C:03+2] "self-merging to prevent failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1137043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:33:06] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:33:11] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:33:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P75139 and previous config saved to /var/cache/conftool/dbconfig/20250416-173320-fceratto.json [17:33:38] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2070.codfw.wmnet with reason: host reimage [17:37:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2070.codfw.wmnet with reason: host reimage [17:40:50] (03CR) 10Vgutierrez: [C:03+1] "not really needed given it's the first time we start actively responding in the `tls` frontend" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur) [17:42:10] (03PS1) 10Bking: WIP: run puppet/restart ferm across DC after reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) [17:44:48] (03PS1) 10Ssingh: Revert^3 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1137046 [17:45:08] (03CR) 10Ssingh: "Context: this worked but since it's a long weekend, we are reverting and will deploy again next week." [puppet] - 10https://gerrit.wikimedia.org/r/1137046 (owner: 10Ssingh) [17:48:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T391056)', diff saved to https://phabricator.wikimedia.org/P75140 and previous config saved to /var/cache/conftool/dbconfig/20250416-174828-fceratto.json [17:48:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:48:39] FIRING: [3x] ProbeDown: Service restbase1045-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:50:22] (03CR) 10Ssingh: [C:03+2] Revert^3 "P:durum: add conditional to enable ECH (durum3003)" [puppet] - 10https://gerrit.wikimedia.org/r/1137046 (owner: 10Ssingh) [17:51:34] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 36431240 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:52:34] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5862000 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:53:39] FIRING: [3x] ProbeDown: Service restbase1045-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:10] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum3003.esams.wmnet with OS bookworm [17:58:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:58:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [17:58:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T391056)', diff saved to https://phabricator.wikimedia.org/P75142 and previous config saved to /var/cache/conftool/dbconfig/20250416-175842-fceratto.json [17:58:46] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:58:46] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:54] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:59:47] ^ exepcted [17:59:49] reimaging [17:59:51] James_F or Reedy: is there a fix in the works for https://phabricator.wikimedia.org/T392086 ? [18:00:05] dduvall and brennen: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T1800) [18:00:53] o/ [18:01:29] brennen: howdy o/ [18:01:58] brennen: currently unsure if we can roll due to https://phabricator.wikimedia.org/T392086 [18:02:40] * brennen nods [18:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:05:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2070.codfw.wmnet with OS bullseye [18:08:58] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [18:09:02] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [18:09:38] RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 4.619 second response time https://wikitech.wikimedia.org/wiki/RESTBase [18:09:40] RECOVERY - Restbase root url on restbase1029 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/RESTBase [18:10:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:11:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T391056)', diff saved to https://phabricator.wikimedia.org/P75144 and previous config saved to /var/cache/conftool/dbconfig/20250416-181105-fceratto.json [18:11:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:19:03] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [18:22:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum3003.esams.wmnet with reason: host reimage [18:23:59] brennen: k. that task is no longer a blocker/UBN. rolling [18:24:11] ack, godspeed [18:25:15] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137048 (https://phabricator.wikimedia.org/T386220) [18:25:17] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137048 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [18:26:06] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137048 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [18:26:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P75145 and previous config saved to /var/cache/conftool/dbconfig/20250416-182613-fceratto.json [18:27:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:29:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10749402 (10VRiley-WMF) [18:30:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10749408 (10VRiley-WMF) ms-fe1015 Rack E8 U 21 Port 17 CableID 240707900054 ms-fe1016 Rack F8 U 22 Port 17 CableID 240707900052 [18:37:26] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [18:38:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:40:56] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:41:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P75146 and previous config saved to /var/cache/conftool/dbconfig/20250416-184121-fceratto.json [18:41:41] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10749423 (10Eevans) >>! In T391544#10746698, @MatthewVernon wrote: >>>! In T391544#10745829, @Eevans wrote: >>... [18:41:46] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:41:57] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum3003.esams.wmnet with OS bookworm [18:42:52] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.25 refs T386220 [18:42:56] T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220 [18:44:10] !log re-enable puppet on A:durum [18:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:51] (03CR) 10Eevans: [C:03+1] DataPers. cookbooks: use the parser tuning attrs [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [18:53:02] (03PS1) 10Ssingh: secret: rename ech-durum.pem [labs/private] - 10https://gerrit.wikimedia.org/r/1137051 [18:54:36] (03CR) 10Ssingh: [V:03+2 C:03+2] secret: rename ech-durum.pem [labs/private] - 10https://gerrit.wikimedia.org/r/1137051 (owner: 10Ssingh) [18:56:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T391056)', diff saved to https://phabricator.wikimedia.org/P75147 and previous config saved to /var/cache/conftool/dbconfig/20250416-185628-fceratto.json [18:56:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:56:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2175.codfw.wmnet with reason: Maintenance [18:56:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T391056)', diff saved to https://phabricator.wikimedia.org/P75148 and previous config saved to /var/cache/conftool/dbconfig/20250416-185651-fceratto.json [19:06:44] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2063 to cirrussearch2063 [19:06:56] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:07:34] PROBLEM - Disk space on an-worker1116 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 202800 MB (5% inode=99%): /var/lib/hadoop/data/m 222976 MB (5% inode=99%): /var/lib/hadoop/data/b 245413 MB (6% inode=99%): /var/lib/hadoop/data/c 223851 MB (5% inode=99%): /var/lib/hadoop/data/k 156034 MB (4% inode=99%): /var/lib/hadoop/data/i 184038 MB (4% inode=99%): /var/lib/hadoop/data/h 125055 MB (3% inode=99%): /var/lib/hadoop/data [19:07:34] 4 MB (5% inode=99%): /var/lib/hadoop/data/j 152553 MB (4% inode=99%): /var/lib/hadoop/data/d 156819 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [19:08:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T391056)', diff saved to https://phabricator.wikimedia.org/P75149 and previous config saved to /var/cache/conftool/dbconfig/20250416-190823-fceratto.json [19:08:27] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:09:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10749567 (10VRiley-WMF) I received this as a response today "After reviewing the debug logs and thermal data, we did not uncover any new information. It appears that the issue is self-correcting until it... [19:10:25] (03PS1) 10Vgutierrez: wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) [19:14:03] (03CR) 10CI reject: [V:04-1] wmflib: Add list_secrets() function [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [19:18:59] (03PS1) 10Cwhite: logstash: drop out_request field [puppet] - 10https://gerrit.wikimedia.org/r/1137057 (https://phabricator.wikimedia.org/T390215) [19:20:25] FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:25] FIRING: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:34] ^ parsoidtest1001 one shouldn't be there anymore - I'll take a look [19:23:09] (03CR) 10Cwhite: [C:03+2] logstash: drop out_request field [puppet] - 10https://gerrit.wikimedia.org/r/1137057 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [19:23:13] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2063 to cirrussearch2063 - bking@cumin2002" [19:23:30] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2063 to cirrussearch2063 - bking@cumin2002" [19:23:30] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:23:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P75150 and previous config saved to /var/cache/conftool/dbconfig/20250416-192330-fceratto.json [19:23:31] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2063 [19:23:35] dduvall: brennen: any objections if I sneak in a non-deploy (--stop-before-sync) scap run to pick up a make-container-image change? [19:23:41] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2063 [19:24:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2063 to cirrussearch2063 [19:24:22] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2063.codfw.wmnet on all recursors [19:24:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2063.codfw.wmnet on all recursors [19:24:33] swfrench-wmf: no objections here [19:25:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2063.codfw.wmnet with OS bullseye [19:25:42] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2063 [19:25:58] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:26:51] brennen: great, thank you! [19:30:08] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2063 - bking@cumin2002" [19:30:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2063 - bking@cumin2002" [19:30:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:14] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2063.codfw.wmnet 108.16.192.10.in-addr.arpa 8.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:30:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2063.codfw.wmnet 108.16.192.10.in-addr.arpa 8.0.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:30:18] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2063 [19:30:23] !log swfrench@deploy1003 Started scap sync-world: Test stop-before-sync scap run to pick up make-container-image changes - T390251 [19:30:28] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [19:30:31] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2063 [19:30:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2063 [19:30:58] !log swfrench@deploy1003 Stopping before sync operations [19:33:45] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [19:34:39] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [19:38:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P75151 and previous config saved to /var/cache/conftool/dbconfig/20250416-193838-fceratto.json [19:40:01] (03PS1) 10Dzahn: aptrepo: add jenkins to bookworm section in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137060 (https://phabricator.wikimedia.org/T392127) [19:44:53] PROBLEM - Disk space on an-worker1163 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 256348 MB (6% inode=99%): /var/lib/hadoop/data/b 237762 MB (6% inode=99%): /var/lib/hadoop/data/j 146035 MB (3% inode=99%): /var/lib/hadoop/data/l 165764 MB (4% inode=99%): /var/lib/hadoop/data/h 171807 MB (4% inode=99%): /var/lib/hadoop/data/i 122642 MB (3% inode=99%): /var/lib/hadoop/data/k 117285 MB (3% inode=99%): https://wikitech.wik [19:44:53] rg/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1163&var-datasource=eqiad+prometheus/ops [19:45:30] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2063.codfw.wmnet with reason: host reimage [19:48:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2063.codfw.wmnet with reason: host reimage [19:48:39] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:50:13] (03CR) 10Ebernhardson: [C:03+1] "Reading the docs, this seems like a reasonable change and should do as the commit message says." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136716 (https://phabricator.wikimedia.org/T390853) (owner: 10DCausse) [19:50:20] (03PS2) 10Hashar: Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron) [19:50:59] (03CR) 10Hashar: [C:03+1] "Thank you for the cc: and I feel sorry it did not improve the current situation 😢" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron) [19:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T391056)', diff saved to https://phabricator.wikimedia.org/P75152 and previous config saved to /var/cache/conftool/dbconfig/20250416-195345-fceratto.json [19:54:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: Maintenance [19:54:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T391056)', diff saved to https://phabricator.wikimedia.org/P75153 and previous config saved to /var/cache/conftool/dbconfig/20250416-195408-fceratto.json [19:54:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:57:23] (03CR) 10Arturo Borrero Gonzalez: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [19:57:44] (03PS2) 10Scott French: P:parsoid::mediawiki: use installed PHP versions for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) [19:59:34] (03PS1) 10Cwhite: logstash: also remove outRequest field [puppet] - 10https://gerrit.wikimedia.org/r/1137064 (https://phabricator.wikimedia.org/T390215) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T2000). Please do the needful. [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:01:31] (03CR) 10Herron: [C:03+1] logstash: also remove outRequest field [puppet] - 10https://gerrit.wikimedia.org/r/1137064 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [20:01:47] PROBLEM - Disk space on an-worker1139 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 138654 MB (3% inode=99%): /var/lib/hadoop/data/f 209556 MB (5% inode=99%): /var/lib/hadoop/data/j 119935 MB (3% inode=99%): /var/lib/hadoop/data/m 121225 MB (3% inode=99%): /var/lib/hadoop/data/h 200345 MB (5% inode=99%): /var/lib/hadoop/data/k 110428 MB (2% inode=99%): /var/lib/hadoop/data/e 166921 MB (4% inode=99%): /var/lib/hadoop/data [20:01:47] 0 MB (6% inode=99%): /var/lib/hadoop/data/b 209089 MB (5% inode=99%): /var/lib/hadoop/data/d 144571 MB (3% inode=99%): /var/lib/hadoop/data/i 141450 MB (3% inode=99%): /var/lib/hadoop/data/l 159229 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [20:01:55] (03CR) 10AOkoth: [C:03+1] aptrepo: add jenkins to bookworm section in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137060 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [20:02:28] (03CR) 10Cwhite: [C:03+2] logstash: also remove outRequest field [puppet] - 10https://gerrit.wikimedia.org/r/1137064 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [20:03:39] FIRING: [5x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T391056)', diff saved to https://phabricator.wikimedia.org/P75154 and previous config saved to /var/cache/conftool/dbconfig/20250416-200437-fceratto.json [20:04:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:06:07] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [20:09:11] (03PS1) 10Cwhite: logstash: expand conditional [puppet] - 10https://gerrit.wikimedia.org/r/1137065 (https://phabricator.wikimedia.org/T390215) [20:12:41] (03CR) 10Scott French: "Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [20:13:25] (03CR) 10Cwhite: [C:03+2] logstash: expand conditional [puppet] - 10https://gerrit.wikimedia.org/r/1137065 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [20:15:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2063.codfw.wmnet with OS bullseye [20:19:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P75155 and previous config saved to /var/cache/conftool/dbconfig/20250416-201943-fceratto.json [20:20:24] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2077 to cirrussearch2077 [20:20:47] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:22:02] (03CR) 10Dzahn: "This all makes sense to me and looks good just the PHP versions it looks up in Hiera are still 7.4 (installed) and 7.2 (absented). Looking" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [20:25:25] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2077 to cirrussearch2077 - bking@cumin2002" [20:26:07] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2077 to cirrussearch2077 - bking@cumin2002" [20:26:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:26:09] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2077 [20:26:44] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2077 [20:27:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10749908 (10Eevans) >>! In T391903#10743696, @Jclark-ctr wrote: > @Eevans This server is out of Warranty We have used drives from recently Decom servers please advise when and if you would like to replace.... [20:27:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2077 to cirrussearch2077 [20:27:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2077.codfw.wmnet on all recursors [20:27:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2077.codfw.wmnet on all recursors [20:28:03] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2077.codfw.wmnet with OS bullseye [20:28:15] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2077 [20:28:30] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:33:37] PROBLEM - Disk space on an-worker1088 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/i 143716 MB (3% inode=99%): /var/lib/hadoop/data/k 113100 MB (3% inode=99%): /var/lib/hadoop/data/h 192074 MB (5% inode=99%): /var/lib/hadoop/data/l 183653 MB (4% inode=99%): /var/lib/hadoop/data/e 217515 MB (5% inode=99%): /var/lib/hadoop/data/j 138981 MB (3% inode=99%): /var/lib/hadoop/data/c 130013 MB (3% inode=99%): https://wikitech.wik [20:33:37] rg/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1088&var-datasource=eqiad+prometheus/ops [20:33:39] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2077 - bking@cumin2002" [20:33:45] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2077 - bking@cumin2002" [20:33:45] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:33:45] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2077.codfw.wmnet 125.16.192.10.in-addr.arpa 5.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:33:49] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2077.codfw.wmnet 125.16.192.10.in-addr.arpa 5.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:33:50] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2077 [20:34:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2077 [20:34:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2077 [20:34:24] (03PS1) 10Aleksandar Mastilovic: Turn off Gobblin test jobs (all at once). [puppet] - 10https://gerrit.wikimedia.org/r/1137067 (https://phabricator.wikimedia.org/T390249) [20:34:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P75156 and previous config saved to /var/cache/conftool/dbconfig/20250416-203450-fceratto.json [20:42:34] (03CR) 10Scott French: "Excellent question!" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [20:43:35] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:44:33] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:46:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:48:53] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2077.codfw.wmnet with reason: host reimage [20:49:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10750017 (10Eevans) > sessionstore1006: > [] (service owner) Does the host need to stay in row D and keep its IP/VLAN? It //does// need to stay in row D, yes. If the IP/V... [20:49:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T391056)', diff saved to https://phabricator.wikimedia.org/P75157 and previous config saved to /var/cache/conftool/dbconfig/20250416-204957-fceratto.json [20:50:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:50:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance [20:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:52:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2077.codfw.wmnet with reason: host reimage [20:53:31] (03CR) 10Dzahn: [C:03+1] "after you pointed out to me that you are overriding the versions at the host name level in Hiera.. NEVERMIND :) lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [20:55:49] (03CR) 10Scott French: [C:03+2] P:parsoid::mediawiki: use installed PHP versions for auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/1137062 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [20:56:24] (03Abandoned) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1097535 (https://phabricator.wikimedia.org/T379333) (owner: 10Ryan Kemper) [20:56:46] (03Abandoned) 10Ryan Kemper: cirrus: (WIP) support rename elastic->cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [20:57:28] (03PS1) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) [20:57:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:57:56] (03CR) 10CI reject: [V:04-1] cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:58:17] (03CR) 10Ryan Kemper: [C:03+2] wdqs-update-lag: don't count wdqs-categories lag [puppet] - 10https://gerrit.wikimedia.org/r/1133554 (owner: 10Ryan Kemper) [20:58:39] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2204.codfw.wmnet with reason: Maintenance [20:59:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T391056)', diff saved to https://phabricator.wikimedia.org/P75158 and previous config saved to /var/cache/conftool/dbconfig/20250416-205907-fceratto.json [20:59:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:59:16] (03PS2) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) [20:59:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T2100) [21:01:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T391056)', diff saved to https://phabricator.wikimedia.org/P75159 and previous config saved to /var/cache/conftool/dbconfig/20250416-210128-fceratto.json [21:01:32] (03PS1) 10Reedy: specials: Fix PHP Warning on Special:PasswordReset for crafted input [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137073 (https://phabricator.wikimedia.org/T392086) [21:03:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10750114 (10Eevans) > aqs1022 > [] (service owner) Does the host need to stay in row D and keep its IP/VLAN? It //can// go anywhere in row D —or— anywhere in... [21:05:21] (03CR) 10Ryan Kemper: [C:03+2] sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [21:05:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_php7.4-fpm.service on parsoidtest1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:38] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10750121 (10Eevans) > restbase1045 > [] (service owner) Does the host need to stay in row D and keep its IP/VLAN? Yes. > [] (service owner) What hosts can t... [21:06:23] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-04-08-183717 to 2025-04-09-214434 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137075 [21:06:23] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-08-183631 to 2025-04-16-192052 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137076 (https://phabricator.wikimedia.org/T367080) [21:07:17] Reedy: If you want to deploy that ^^ please go ahead, we're in services land only today. [21:07:28] (03CR) 10Ecarg: [C:03+2] wikifunctions: Update evaluators from 2025-04-08-183717 to 2025-04-09-214434 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137075 (owner: 10Jforrester) [21:07:41] (03CR) 10Reedy: [C:03+2] specials: Fix PHP Warning on Special:PasswordReset for crafted input [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137073 (https://phabricator.wikimedia.org/T392086) (owner: 10Reedy) [21:07:45] Cheers [21:09:14] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-04-08-183717 to 2025-04-09-214434 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137075 (owner: 10Jforrester) [21:09:43] (03CR) 10Ecarg: [C:03+2] wikifunctions: Update orchestrator from 2025-04-08-183631 to 2025-04-16-192052 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137076 (https://phabricator.wikimedia.org/T367080) (owner: 10Jforrester) [21:11:10] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-04-08-183631 to 2025-04-16-192052 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137076 (https://phabricator.wikimedia.org/T367080) (owner: 10Jforrester) [21:11:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2077.codfw.wmnet with OS bullseye [21:13:07] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2079 to cirrussearch2079 [21:13:25] !log ecarg@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:13:29] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:14:00] !log ecarg@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:15:36] !log ecarg@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:16:32] !log ecarg@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:16:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P75160 and previous config saved to /var/cache/conftool/dbconfig/20250416-211634-fceratto.json [21:16:48] !log ecarg@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:17:47] !log ecarg@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:18:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2079 to cirrussearch2079 - bking@cumin2002" [21:20:34] (03Merged) 10jenkins-bot: specials: Fix PHP Warning on Special:PasswordReset for crafted input [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137073 (https://phabricator.wikimedia.org/T392086) (owner: 10Reedy) [21:21:43] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1137073|specials: Fix PHP Warning on Special:PasswordReset for crafted input (T392086)]] [21:21:47] T392086: PHP Warning: Array to string conversion / RuntimeException: PCRE failure on Special:PasswordReset - https://phabricator.wikimedia.org/T392086 [21:25:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2079 to cirrussearch2079 - bking@cumin2002" [21:25:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:25:09] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2079 [21:26:37] !log reedy@deploy1003 reedy: Backport for [[gerrit:1137073|specials: Fix PHP Warning on Special:PasswordReset for crafted input (T392086)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:42] !log reedy@deploy1003 reedy: Continuing with sync [21:26:58] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2079 [21:27:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2079 to cirrussearch2079 [21:27:39] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2079.codfw.wmnet on all recursors [21:27:42] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2079.codfw.wmnet on all recursors [21:27:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2079.codfw.wmnet with OS bullseye [21:28:09] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2079 [21:30:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:30:36] (03PS3) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:31:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P75161 and previous config saved to /var/cache/conftool/dbconfig/20250416-213141-fceratto.json [21:33:30] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137073|specials: Fix PHP Warning on Special:PasswordReset for crafted input (T392086)]] (duration: 11m 47s) [21:33:33] T392086: PHP Warning: Array to string conversion / RuntimeException: PCRE failure on Special:PasswordReset - https://phabricator.wikimedia.org/T392086 [21:34:06] (03CR) 10Cwhite: [C:03+2] logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi) [21:34:47] (03CR) 10Cwhite: [C:03+2] Revert "logstash: increase refresh_interval to 10s in index templates" [puppet] - 10https://gerrit.wikimedia.org/r/1137027 (owner: 10Herron) [21:37:05] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:39:03] (03PS3) 10Bking: cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) [21:39:15] (03PS4) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:39:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:41:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2079 - bking@cumin2002" [21:41:24] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2079 - bking@cumin2002" [21:41:24] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:41:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2079.codfw.wmnet 128.16.192.10.in-addr.arpa 8.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:41:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2079.codfw.wmnet 128.16.192.10.in-addr.arpa 8.2.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:41:29] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2079 [21:41:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/745496 (owner: 10JMeybohm) [21:41:47] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2079 [21:41:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2079 [21:46:28] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:46:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T391056)', diff saved to https://phabricator.wikimedia.org/P75162 and previous config saved to /var/cache/conftool/dbconfig/20250416-214648-fceratto.json [21:46:52] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:47:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2225.codfw.wmnet with reason: Maintenance [21:47:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T391056)', diff saved to https://phabricator.wikimedia.org/P75163 and previous config saved to /var/cache/conftool/dbconfig/20250416-214710-fceratto.json [21:49:29] (03PS5) 10Ryan Kemper: sre.elasticsearch.rolling-operation: restart ferm after host rename [cookbooks] - 10https://gerrit.wikimedia.org/r/1137045 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:51:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:53:39] FIRING: ProbeDown: Service restbase1045-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1045-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:55:33] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2.* [21:56:18] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2079.codfw.wmnet with reason: host reimage [21:58:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T391056)', diff saved to https://phabricator.wikimedia.org/P75164 and previous config saved to /var/cache/conftool/dbconfig/20250416-215804-fceratto.json [21:58:08] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:59:03] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250416T2200) [22:00:05] aude: A patch you scheduled for Web Team deployment window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [22:01:05] deploying updates to the chart renderer service in a few minutes [22:01:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2079.codfw.wmnet with reason: host reimage [22:02:40] (03PS5) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 [22:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:09:38] (03PS1) 10Aude: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) [22:13:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P75165 and previous config saved to /var/cache/conftool/dbconfig/20250416-221311-fceratto.json [22:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10750370 (10phaultfinder) [22:16:09] (03CR) 10Seddon: "Deployment approved." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude) [22:16:22] (03CR) 10Seddon: [C:03+1] Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude) [22:16:58] (03CR) 10Aude: [C:03+2] Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude) [22:18:31] (03Merged) 10jenkins-bot: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137082 (https://phabricator.wikimedia.org/T386027) (owner: 10Aude) [22:19:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10750374 (10RobH) 05Open→03Stalled a:05Kappakayala→03RobH Please note this needs to be stalled as it turns out we may not use D6 for frack. Please take no further... [22:19:42] (03CR) 10JHathaway: "thanks for giving it a try @ltoscano@wikimedia.org. Also, thanks for spotting the `Gemfile.lock` issue, the path was wrong, it should be `" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [22:20:47] !log aude@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [22:21:24] !log aude@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [22:26:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2079.codfw.wmnet with OS bullseye [22:27:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:28:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P75166 and previous config saved to /var/cache/conftool/dbconfig/20250416-222818-fceratto.json [22:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:36:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row B - bking@cumin2002 - T388610 [22:36:14] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:43:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T391056)', diff saved to https://phabricator.wikimedia.org/P75167 and previous config saved to /var/cache/conftool/dbconfig/20250416-224325-fceratto.json [22:43:29] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:43:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2226.codfw.wmnet with reason: Maintenance [22:43:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance [22:44:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T391056)', diff saved to https://phabricator.wikimedia.org/P75168 and previous config saved to /var/cache/conftool/dbconfig/20250416-224405-fceratto.json [22:46:12] !log aude@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [22:46:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T391056)', diff saved to https://phabricator.wikimedia.org/P75169 and previous config saved to /var/cache/conftool/dbconfig/20250416-224627-fceratto.json [22:46:45] !log aude@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [22:48:39] RESOLVED: ProbeDown: Service restbase1045-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1045-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:17] !log aude@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [22:49:49] !log aude@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [22:53:22] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10750415 (10bking) Thanks @Jhancock.wm ! Will try and reimage now. [22:54:16] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [22:54:20] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091 [22:54:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091 [22:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:59:07] (03PS1) 10Bking: cirrussearch: Add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1137086 (https://phabricator.wikimedia.org/T388610) [23:00:47] (03PS2) 10Bking: cirrussearch: Add newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1137086 (https://phabricator.wikimedia.org/T388610) [23:01:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P75170 and previous config saved to /var/cache/conftool/dbconfig/20250416-230134-fceratto.json [23:09:08] (03PS1) 10BryanDavis: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) [23:10:56] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1043.eqiad.wmnet [23:10:56] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1043.eqiad.wmnet [23:11:03] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1044.eqiad.wmnet [23:11:03] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1044.eqiad.wmnet [23:11:11] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for restbase1045.eqiad.wmnet [23:11:11] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1045.eqiad.wmnet [23:14:07] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye [23:15:04] !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1028.eqiad.wmnet with reason: Decommissioning — T389423 [23:15:07] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [23:15:55] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [23:15:59] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091 [23:15:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091 [23:16:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P75171 and previous config saved to /var/cache/conftool/dbconfig/20250416-231641-fceratto.json [23:16:51] !log decommissioning restbase1028/Cassandra — T389423 [23:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:25] FIRING: SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:41] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch2091.codfw.wmnet with OS bullseye [23:31:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T391056)', diff saved to https://phabricator.wikimedia.org/P75172 and previous config saved to /var/cache/conftool/dbconfig/20250416-233148-fceratto.json [23:31:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:31:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2238.codfw.wmnet with reason: Maintenance [23:32:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75173 and previous config saved to /var/cache/conftool/dbconfig/20250416-233200-fceratto.json [23:33:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [23:33:35] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091 [23:33:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091 [23:33:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2091.codfw.wmnet with OS bullseye [23:34:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2091.codfw.wmnet with OS bullseye [23:34:45] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2091 [23:34:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2091 [23:40:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137089 [23:40:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137089 (owner: 10TrainBranchBot) [23:42:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T391056)', diff saved to https://phabricator.wikimedia.org/P75174 and previous config saved to /var/cache/conftool/dbconfig/20250416-234221-fceratto.json [23:42:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:45:47] PROBLEM - Disk space on an-worker1114 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 156818 MB (4% inode=99%): /var/lib/hadoop/data/k 245919 MB (6% inode=99%): /var/lib/hadoop/data/h 247407 MB (6% inode=99%): /var/lib/hadoop/data/b 146976 MB (3% inode=99%): /var/lib/hadoop/data/d 202739 MB (5% inode=99%): /var/lib/hadoop/data/f 233204 MB (6% inode=99%): /var/lib/hadoop/data/i 215232 MB (5% inode=99%): /var/lib/hadoop/data [23:45:47] 6 MB (4% inode=99%): /var/lib/hadoop/data/l 164780 MB (4% inode=99%): /var/lib/hadoop/data/c 242294 MB (6% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1114&var-datasource=eqiad+prometheus/ops [23:52:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137089 (owner: 10TrainBranchBot) [23:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:57:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P75175 and previous config saved to /var/cache/conftool/dbconfig/20250416-235728-fceratto.json