[00:03:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1132:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1132 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:03:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P74770 and previous config saved to /var/cache/conftool/dbconfig/20250409-000358-fceratto.json [00:09:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135142 [00:09:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135142 (owner: 10TrainBranchBot) [00:10:29] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [00:14:27] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [00:16:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3190 MB (3% inode=98%): /tmp 3190 MB (3% inode=98%): /var/tmp 3190 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [00:17:04] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2062 to cirrussearch2062 [00:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:17:16] !log bking@cumin2002 START - Cookbook sre.dns.netbox [00:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [00:19:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P74771 and previous config saved to /var/cache/conftool/dbconfig/20250409-001905-fceratto.json [00:23:06] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2062 to cirrussearch2062 - bking@cumin2002" [00:23:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2062 to cirrussearch2062 - bking@cumin2002" [00:23:29] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:23:29] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2062 [00:23:42] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2062 [00:24:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2062 to cirrussearch2062 [00:24:23] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2062.codfw.wmnet on all recursors [00:24:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2062.codfw.wmnet on all recursors [00:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:24:51] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2062.codfw.wmnet with OS bullseye [00:25:03] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2062 [00:25:14] !log bking@cumin2002 START - Cookbook sre.dns.netbox [00:28:55] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1135142 (owner: 10TrainBranchBot) [00:29:30] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2062 - bking@cumin2002" [00:29:35] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2062 - bking@cumin2002" [00:29:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:29:36] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2062.codfw.wmnet 144.0.192.10.in-addr.arpa 4.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [00:29:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2062.codfw.wmnet 144.0.192.10.in-addr.arpa 4.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [00:29:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2062 [00:29:54] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2062 [00:29:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2062 [00:34:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T391056)', diff saved to https://phabricator.wikimedia.org/P74772 and previous config saved to /var/cache/conftool/dbconfig/20250409-003412-fceratto.json [00:34:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:34:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2163.codfw.wmnet with reason: Maintenance [00:34:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T391056)', diff saved to https://phabricator.wikimedia.org/P74773 and previous config saved to /var/cache/conftool/dbconfig/20250409-003434-fceratto.json [00:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [00:44:02] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:14] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2062.codfw.wmnet with reason: host reimage [00:47:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T391056)', diff saved to https://phabricator.wikimedia.org/P74774 and previous config saved to /var/cache/conftool/dbconfig/20250409-004717-fceratto.json [00:47:20] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [00:49:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2062.codfw.wmnet with reason: host reimage [01:02:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P74775 and previous config saved to /var/cache/conftool/dbconfig/20250409-010224-fceratto.json [01:07:05] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:14:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2062.codfw.wmnet with OS bullseye [01:15:34] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2088 to cirrussearch2088 [01:15:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [01:17:13] FIRING: [2x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P74776 and previous config saved to /var/cache/conftool/dbconfig/20250409-011731-fceratto.json [01:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [01:20:03] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2088 to cirrussearch2088 - bking@cumin2002" [01:21:55] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2088 to cirrussearch2088 - bking@cumin2002" [01:21:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:21:56] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2088 [01:22:22] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2088 [01:23:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2088 to cirrussearch2088 [01:23:02] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2088.codfw.wmnet on all recursors [01:23:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2088.codfw.wmnet on all recursors [01:24:12] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2088.codfw.wmnet with OS bullseye [01:24:23] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2088 [01:24:38] !log bking@cumin2002 START - Cookbook sre.dns.netbox [01:24:44] PROBLEM - Ensure traffic_manager is running for instance backend on cp6011 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [01:25:44] RECOVERY - Ensure traffic_manager is running for instance backend on cp6011 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [01:27:05] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:32:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T391056)', diff saved to https://phabricator.wikimedia.org/P74777 and previous config saved to /var/cache/conftool/dbconfig/20250409-013238-fceratto.json [01:32:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:32:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2164.codfw.wmnet with reason: Maintenance [01:33:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance [01:33:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T391056)', diff saved to https://phabricator.wikimedia.org/P74778 and previous config saved to /var/cache/conftool/dbconfig/20250409-013316-fceratto.json [01:33:36] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2088 - bking@cumin2002" [01:33:42] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2088 - bking@cumin2002" [01:33:42] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:33:43] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2088.codfw.wmnet 91.0.192.10.in-addr.arpa 1.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [01:33:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2088.codfw.wmnet 91.0.192.10.in-addr.arpa 1.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [01:33:47] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2088 [01:33:58] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2088 [01:33:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2088 [01:40:27] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T391056)', diff saved to https://phabricator.wikimedia.org/P74779 and previous config saved to /var/cache/conftool/dbconfig/20250409-014612-fceratto.json [01:46:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [01:50:27] RESOLVED: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:01:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P74780 and previous config saved to /var/cache/conftool/dbconfig/20250409-020119-fceratto.json [02:12:06] (03Abandoned) 10Ryan Kemper: wdqs-internal: remove absented monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1100871 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [02:12:40] (03CR) 10Ryan Kemper: [C:03+2] sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135133 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [02:16:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P74781 and previous config saved to /var/cache/conftool/dbconfig/20250409-021626-fceratto.json [02:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [02:22:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:24:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [02:25:31] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:27:38] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:31:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T391056)', diff saved to https://phabricator.wikimedia.org/P74782 and previous config saved to /var/cache/conftool/dbconfig/20250409-023134-fceratto.json [02:31:37] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:31:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: Maintenance [02:31:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T391056)', diff saved to https://phabricator.wikimedia.org/P74783 and previous config saved to /var/cache/conftool/dbconfig/20250409-023156-fceratto.json [02:31:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti servers to codfw - jhancock@cumin2002" [02:32:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti servers to codfw - jhancock@cumin2002" [02:32:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:33:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [02:33:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048 [02:33:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [02:33:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048 [02:33:42] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2049 [02:33:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2049 [02:33:51] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2050 [02:34:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2050 [02:37:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:44:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T391056)', diff saved to https://phabricator.wikimedia.org/P74784 and previous config saved to /var/cache/conftool/dbconfig/20250409-024439-fceratto.json [02:44:43] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [02:48:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:49:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:49:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:49:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:49:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:49:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:50:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:50:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:55:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:55:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:59:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [02:59:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P74785 and previous config saved to /var/cache/conftool/dbconfig/20250409-025946-fceratto.json [03:00:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:00:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:00:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:00:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:01:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:01:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:01:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:01:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:01:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:01:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:05:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:05:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:08:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [03:08:29] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10724457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [03:08:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2088.codfw.wmnet with OS bullseye [03:09:00] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2089 to cirrussearch2089 [03:09:22] !log bking@cumin2002 START - Cookbook sre.dns.netbox [03:14:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P74786 and previous config saved to /var/cache/conftool/dbconfig/20250409-031453-fceratto.json [03:15:52] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2089 to cirrussearch2089 - bking@cumin2002" [03:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [03:17:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2089 to cirrussearch2089 - bking@cumin2002" [03:17:47] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:17:48] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2089 [03:18:15] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2089 [03:18:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2089 to cirrussearch2089 [03:18:56] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2089.codfw.wmnet on all recursors [03:18:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2089.codfw.wmnet on all recursors [03:20:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2089.codfw.wmnet with OS bullseye [03:20:13] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2089 [03:20:53] !log bking@cumin2002 START - Cookbook sre.dns.netbox [03:24:57] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS7195/IPv6: Connect - EdgeUno, AS7195/IPv4: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:25:54] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2089 - bking@cumin2002" [03:25:59] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2089 - bking@cumin2002" [03:26:00] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:26:00] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2089.codfw.wmnet 92.0.192.10.in-addr.arpa 2.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [03:26:03] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2089.codfw.wmnet 92.0.192.10.in-addr.arpa 2.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [03:26:04] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2089 [03:30:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T391056)', diff saved to https://phabricator.wikimedia.org/P74787 and previous config saved to /var/cache/conftool/dbconfig/20250409-033001-fceratto.json [03:30:04] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:30:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2166.codfw.wmnet with reason: Maintenance [03:30:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74788 and previous config saved to /var/cache/conftool/dbconfig/20250409-033025-fceratto.json [03:30:34] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2089 [03:30:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2089 [03:35:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:43] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:35:52] PROBLEM - Check unit status of sync-puppet-volatile on puppetmaster2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:37:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10724471 (10Jhancock.wm) heads up. i got the new drives in and installed them. i redid the provisioning successfully. when i tried to image ganeti2045, I got an erro... [03:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:43:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74789 and previous config saved to /var/cache/conftool/dbconfig/20250409-034302-fceratto.json [03:43:06] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:45:52] RECOVERY - Check unit status of sync-puppet-volatile on puppetmaster2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [03:48:10] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2089.codfw.wmnet with reason: host reimage [03:50:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2089.codfw.wmnet with reason: host reimage [03:58:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P74790 and previous config saved to /var/cache/conftool/dbconfig/20250409-035810-fceratto.json [04:13:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P74791 and previous config saved to /var/cache/conftool/dbconfig/20250409-041317-fceratto.json [04:15:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (cloudcontrol1011), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:17:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2089.codfw.wmnet with OS bullseye [04:17:12] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [04:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:17:16] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [04:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [04:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:28:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74792 and previous config saved to /var/cache/conftool/dbconfig/20250409-042824-fceratto.json [04:28:28] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:28:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2167.codfw.wmnet with reason: Maintenance [04:28:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74793 and previous config saved to /var/cache/conftool/dbconfig/20250409-042846-fceratto.json [04:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [04:41:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74794 and previous config saved to /var/cache/conftool/dbconfig/20250409-044134-fceratto.json [04:41:38] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [04:56:07] (03PS1) 10Kevin Bazira: changeprop: add liftwing RRLA source stream to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135153 (https://phabricator.wikimedia.org/T326179) [04:56:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P74795 and previous config saved to /var/cache/conftool/dbconfig/20250409-045642-fceratto.json [05:01:38] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [05:01:41] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [05:05:46] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2090 to cirrussearch2090 [05:05:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [05:10:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P74796 and previous config saved to /var/cache/conftool/dbconfig/20250409-051149-fceratto.json [05:12:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:15:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:15:42] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:02] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:17:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:21:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [05:23:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:26:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74797 and previous config saved to /var/cache/conftool/dbconfig/20250409-052656-fceratto.json [05:27:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:27:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: Maintenance [05:27:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T391056)', diff saved to https://phabricator.wikimedia.org/P74798 and previous config saved to /var/cache/conftool/dbconfig/20250409-052719-fceratto.json [05:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:32:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:39:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T391056)', diff saved to https://phabricator.wikimedia.org/P74799 and previous config saved to /var/cache/conftool/dbconfig/20250409-053957-fceratto.json [05:40:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [05:48:12] (03PS1) 10Marostegui: installserver: Add db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1135158 (https://phabricator.wikimedia.org/T391372) [05:49:46] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2090 to cirrussearch2090 - bking@cumin2002" [05:50:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool ms1 T391317', diff saved to https://phabricator.wikimedia.org/P74800 and previous config saved to /var/cache/conftool/dbconfig/20250409-055028-marostegui.json [05:50:31] T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317 [05:50:42] (03CR) 10Marostegui: [C:03+2] installserver: Add db1246 [puppet] - 10https://gerrit.wikimedia.org/r/1135158 (https://phabricator.wikimedia.org/T391372) (owner: 10Marostegui) [05:50:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2142.codfw.wmnet,db1152.eqiad.wmnet with reason: Maintenance [05:52:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2089-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:53:06] (03PS1) 10Marostegui: mariadb: Migrate db1152,db2142 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1135159 (https://phabricator.wikimedia.org/T391317) [05:53:56] (03CR) 10Marostegui: [C:03+2] mariadb: Migrate db1152,db2142 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1135159 (https://phabricator.wikimedia.org/T391317) (owner: 10Marostegui) [05:55:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P74801 and previous config saved to /var/cache/conftool/dbconfig/20250409-055504-fceratto.json [05:59:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool ms1 T391317', diff saved to https://phabricator.wikimedia.org/P74802 and previous config saved to /var/cache/conftool/dbconfig/20250409-055903-marostegui.json [05:59:07] T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T0600) [06:00:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:38] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2089-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:10:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P74803 and previous config saved to /var/cache/conftool/dbconfig/20250409-061012-fceratto.json [06:11:38] RECOVERY - Ensure traffic_server is running for instance backend on cp4047 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:13:13] (03CR) 10Ayounsi: [C:03+1] "niiiiice!" [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [06:16:47] (03CR) 10Ayounsi: [C:03+1] homer: move NetboxData initialization [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [06:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [06:20:33] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2090 to cirrussearch2090 - bking@cumin2002" [06:20:34] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:20:35] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2090 [06:24:25] (03CR) 10Ayounsi: [C:03+1] commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [06:25:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T391056)', diff saved to https://phabricator.wikimedia.org/P74804 and previous config saved to /var/cache/conftool/dbconfig/20250409-062519-fceratto.json [06:25:22] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [06:25:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2195.codfw.wmnet with reason: Maintenance [06:25:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T391056)', diff saved to https://phabricator.wikimedia.org/P74805 and previous config saved to /var/cache/conftool/dbconfig/20250409-062542-fceratto.json [06:36:12] (03PS1) 10Phedenskog: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135326 (https://phabricator.wikimedia.org/T325283) [06:37:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T391056)', diff saved to https://phabricator.wikimedia.org/P74806 and previous config saved to /var/cache/conftool/dbconfig/20250409-063718-fceratto.json [06:37:22] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [06:46:48] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2090 [06:47:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2090 to cirrussearch2090 [06:47:29] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2090.codfw.wmnet on all recursors [06:47:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2090.codfw.wmnet on all recursors [06:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [06:52:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P74807 and previous config saved to /var/cache/conftool/dbconfig/20250409-065225-fceratto.json [06:57:39] (03CR) 10Jelto: [V:03+1 C:03+2] trafficserver: switch all querybuilder backends to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:03] (03PS4) 10Volans: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) [07:01:03] (03PS4) 10Volans: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 [07:04:31] (03CR) 10Abijeet Patro: AX: Enable entry-points on Tswana and Venetian wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [07:04:33] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:41] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:17] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2090.codfw.wmnet with OS bullseye [07:05:23] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:05:29] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2090 [07:05:31] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:06:09] (03CR) 10Volans: "addressed comments, changed to yes/no/all/none" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [07:06:28] (03PS5) 10Volans: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) [07:06:29] (03PS5) 10Volans: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 [07:07:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P74808 and previous config saved to /var/cache/conftool/dbconfig/20250409-070733-fceratto.json [07:09:46] !log bking@cumin2002 START - Cookbook sre.dns.netbox [07:10:53] (03CR) 10Volans: [C:03+2] spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [07:17:50] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2090 - bking@cumin2002" [07:17:55] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2090 - bking@cumin2002" [07:17:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:17:56] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2090.codfw.wmnet 97.0.192.10.in-addr.arpa 7.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:17:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2090.codfw.wmnet 97.0.192.10.in-addr.arpa 7.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [07:18:00] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2090 [07:19:41] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2090 [07:19:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2090 [07:19:52] (03PS3) 10Abijeet Patro: AX: Enable Quick Surveys extension on Asturian and Lombard wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135337 (https://phabricator.wikimedia.org/T390023) [07:20:00] (03PS3) 10Abijeet Patro: AX: Enable entry-points on Asturian and Lombard wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135340 (https://phabricator.wikimedia.org/T390023) [07:20:12] (03PS1) 10Tiziano Fogli: doc/readme: fix test command [alerts] - 10https://gerrit.wikimedia.org/r/1135342 [07:21:19] (03CR) 10Volans: [C:03+2] cumin: Update insetup role report [puppet] - 10https://gerrit.wikimedia.org/r/1134632 (https://phabricator.wikimedia.org/T389825) (owner: 10Volans) [07:21:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10724690 (10ayounsi) @BTullis following up from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134670 an-worker11... [07:22:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T391056)', diff saved to https://phabricator.wikimedia.org/P74809 and previous config saved to /var/cache/conftool/dbconfig/20250409-072240-fceratto.json [07:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [07:22:44] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [07:22:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2198.codfw.wmnet with reason: Maintenance [07:31:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2200.codfw.wmnet with reason: Maintenance [07:33:45] 06SRE, 10Prod-Kubernetes, 06Traffic, 10Wikidata, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10724715 (10Ifrahkhanyaree_WMDE) [07:34:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:35:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:37:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2090.codfw.wmnet with reason: host reimage [07:40:17] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10724716 (10elukey) Thanks Jenn! I see that the new disk is listed as "Good" this time (as opposed to "Bad"), but I think we'll still n... [07:40:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2241.codfw.wmnet with reason: Maintenance [07:40:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10724717 (10elukey) @Jhancock.wm could you please restore the old disk? So I'll make the same test.. [07:41:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2090.codfw.wmnet with reason: host reimage [07:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [07:49:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2242.codfw.wmnet with reason: Maintenance [07:53:11] (03CR) 10DCausse: [C:03+1] Search update pipeline: 504 handling, weighted tags rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135019 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [07:53:40] (03Abandoned) 10AOkoth: releases: add force puppet 7 hiera [puppet] - 10https://gerrit.wikimedia.org/r/1135089 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [07:55:01] (03PS2) 10Phedenskog: perf/navtiming: Add FCP alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135326 (https://phabricator.wikimedia.org/T325283) [07:55:52] (03CR) 10DCausse: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [07:56:18] (03CR) 10Filippo Giunchedi: [C:03+1] doc/readme: fix test command [alerts] - 10https://gerrit.wikimedia.org/r/1135342 (owner: 10Tiziano Fogli) [07:58:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2243.codfw.wmnet with reason: Maintenance [07:58:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2243 (T391056)', diff saved to https://phabricator.wikimedia.org/P74810 and previous config saved to /var/cache/conftool/dbconfig/20250409-075815-fceratto.json [07:58:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [08:05:13] (03PS1) 10Elukey: services: move citoid to Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135378 [08:08:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2243 (T391056)', diff saved to https://phabricator.wikimedia.org/P74811 and previous config saved to /var/cache/conftool/dbconfig/20250409-080826-fceratto.json [08:08:29] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [08:09:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135337 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [08:09:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135340 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [08:09:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2090.codfw.wmnet with OS bullseye [08:09:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [08:09:59] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [08:14:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135340 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [08:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [08:23:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2243', diff saved to https://phabricator.wikimedia.org/P74812 and previous config saved to /var/cache/conftool/dbconfig/20250409-082333-fceratto.json [08:23:37] (03PS1) 10Clément Goubert: php: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135379 (https://phabricator.wikimedia.org/T387208) [08:24:30] (03CR) 10Ayounsi: [V:03+1] "Tested the none and all options. Works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [08:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:33:17] (03CR) 10Ayounsi: [V:03+1 C:03+1] "awesome work!! thanks" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [08:33:35] (03PS1) 10Volans: reports: catch wrong rows in accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 [08:33:35] (03PS1) 10Volans: reports: fix librenms error [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 [08:33:56] (03CR) 10Volans: "tested on netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 (owner: 10Volans) [08:34:13] (03CR) 10Volans: "tested on netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 (owner: 10Volans) [08:36:26] (03CR) 10Elukey: [C:03+1] reports: catch wrong rows in accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 (owner: 10Volans) [08:36:54] (03PS2) 10Clément Goubert: MWScript.php: exit code on mesh, longer timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133935 (https://phabricator.wikimedia.org/T390972) [08:36:54] (03CR) 10Ayounsi: [C:03+1] reports: catch wrong rows in accounting (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 (owner: 10Volans) [08:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:37:14] (03CR) 10Ayounsi: [C:03+1] reports: fix librenms error [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 (owner: 10Volans) [08:37:17] (03CR) 10Elukey: [C:03+1] reports: fix librenms error [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 (owner: 10Volans) [08:37:19] (03PS2) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [08:37:27] (03CR) 10CI reject: [V:04-1] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [08:37:51] (03PS3) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [08:38:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2243', diff saved to https://phabricator.wikimedia.org/P74813 and previous config saved to /var/cache/conftool/dbconfig/20250409-083840-fceratto.json [08:39:20] (03CR) 10Volans: "Changed in log_failure as suggestes, will self-merge given it was already reviewed" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 (owner: 10Volans) [08:40:16] (03PS2) 10Volans: reports: catch wrong rows in accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 [08:40:16] (03PS2) 10Volans: reports: fix librenms error [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 [08:40:32] (03CR) 10Alexandros Kosiaris: [C:03+1] services: move citoid to Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135378 (owner: 10Elukey) [08:44:31] (03CR) 10Tiziano Fogli: [C:03+2] doc/readme: fix test command [alerts] - 10https://gerrit.wikimedia.org/r/1135342 (owner: 10Tiziano Fogli) [08:46:05] (03CR) 10Volans: [C:03+2] reports: catch wrong rows in accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 (owner: 10Volans) [08:46:12] (03CR) 10Volans: [C:03+2] reports: fix librenms error [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 (owner: 10Volans) [08:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [08:48:10] (03Merged) 10jenkins-bot: reports: catch wrong rows in accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135380 (owner: 10Volans) [08:48:11] (03Merged) 10jenkins-bot: reports: fix librenms error [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135381 (owner: 10Volans) [08:49:27] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [08:50:18] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [08:53:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2243 (T391056)', diff saved to https://phabricator.wikimedia.org/P74814 and previous config saved to /var/cache/conftool/dbconfig/20250409-085347-fceratto.json [08:53:51] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [08:54:02] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:54:32] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:01:56] (03PS1) 10Jelto: wikidata-query-gui: add query-legacy-full to existing gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135383 (https://phabricator.wikimedia.org/T350793) [09:05:53] !log rollout security upgrades for ghostscript [09:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:01] (03PS3) 10Clément Goubert: mwcron: Allow setting ttlSecondsAfterFinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) [09:10:40] (03PS3) 10Clément Goubert: MWScript.php: exit code on mesh, longer timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133935 (https://phabricator.wikimedia.org/T390972) [09:15:49] (03CR) 10Elukey: [C:03+2] services: move citoid to Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135378 (owner: 10Elukey) [09:17:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:17:11] (03PS1) 10Alexandros Kosiaris: Remove x2, add ms{1,2,3} to profile::mariadb::section_ports: [puppet] - 10https://gerrit.wikimedia.org/r/1135385 (https://phabricator.wikimedia.org/T387332) [09:17:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:27] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/citoid: sync [09:18:40] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: sync [09:19:31] (03CR) 10Hnowlan: [C:03+1] "lgtm, one optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [09:21:06] (03PS4) 10Clément Goubert: mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) [09:21:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:22:41] (03CR) 10Hnowlan: [C:03+2] service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [09:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [09:23:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:23:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:23:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:25:31] (03PS12) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [09:25:31] (03PS44) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [09:25:31] (03PS4) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [09:26:17] (03PS5) 10Clément Goubert: mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) [09:26:56] (03CR) 10Clément Goubert: mwcron: Allow setting ttlsecondsafterfinished (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [09:27:41] (03CR) 10Hnowlan: [C:03+1] mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [09:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:30:14] (03CR) 10Vgutierrez: [C:03+2] "thanks @denisse && @sukhe for taking care of the CI issues" [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [09:30:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_jobrunner.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:31:03] (03PS1) 10Brouberol: airflow-platform-eng: grant task pods egress permissions to gitlab [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135388 (https://phabricator.wikimedia.org/T386675) [09:32:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:32:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [09:32:45] (03PS2) 10Federico Ceratto: hiera: Add zarcillo service to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135382 (https://phabricator.wikimedia.org/T384212) [09:32:45] (03CR) 10Federico Ceratto: "Initial CR as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1135382 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:32:48] (03Merged) 10jenkins-bot: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [09:33:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:33:11] (03PS1) 10Federico Ceratto: hiera: Add zarcillo k8s service on traffic server [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) [09:33:11] (03CR) 10Federico Ceratto: "Initial CR as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:33:13] (03CR) 10Btullis: airflow: scrape additional metrics (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [09:34:12] (03CR) 10Brouberol: airflow: scrape additional metrics (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [09:34:14] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:34:25] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:43] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [09:35:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_jobrunner.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:36:03] (03PS1) 10Hnowlan: site: remove last jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1135389 (https://phabricator.wikimedia.org/T383226) [09:36:03] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker2142.codfw.wmnet [09:36:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10724965 (10ops-monitoring-bot) check host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: Hardware failure [09:36:24] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) check for host wikikube-worker2142.codfw.wmnet [09:36:42] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:46] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:36:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:37:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:37:11] !log cgoubert@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2142.codfw.wmnet with reason: Hardware failure [09:37:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10724973 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b0849406-4915-4da3-8220-1f360a73f331) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1... [09:38:18] (03CR) 10KCVelaga: [C:03+1] airflow-platform-eng: grant task pods egress permissions to gitlab [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135388 (https://phabricator.wikimedia.org/T386675) (owner: 10Brouberol) [09:38:30] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:38:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:38:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:39:25] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:33] I'm restarting all httpbb tests [09:39:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10724975 (10phaultfinder) [09:40:21] (03PS2) 10Brouberol: airflow-analytics-product: grant task pods egress permissions to gitlab [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135388 (https://phabricator.wikimedia.org/T386675) [09:40:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_jobrunner.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:40:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:40:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:40:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:40:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:41:22] PROBLEM - OpenSearch health check for shards on 9400 on cirrussearch2090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:41:22] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:41:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [09:41:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:42:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:42:39] (03CR) 10KCVelaga: [C:03+1] airflow-analytics-product: grant task pods egress permissions to gitlab [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135388 (https://phabricator.wikimedia.org/T386675) (owner: 10Brouberol) [09:43:01] (03PS6) 10Clément Goubert: mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) [09:43:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:44:02] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [09:44:25] RESOLVED: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:26] (03CR) 10Alexandros Kosiaris: "A quick look at turnilo for the last 30 days returns just 4 results, some from public clouds. However, turnilo is sampled at 1/128, so the" [puppet] - 10https://gerrit.wikimedia.org/r/1130096 (https://phabricator.wikimedia.org/T307965) (owner: 10Aklapper) [09:45:08] (03CR) 10Ladsgroup: [C:03+1] Remove x2, add ms{1,2,3} to profile::mariadb::section_ports: [puppet] - 10https://gerrit.wikimedia.org/r/1135385 (https://phabricator.wikimedia.org/T387332) (owner: 10Alexandros Kosiaris) [09:45:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [09:46:21] (03CR) 10Clément Goubert: [C:03+1] hiera: Add zarcillo service to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135382 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [09:48:30] PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 105 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 105, delayed_unassigned_shards: 0, number_of_pendin [09:48:30] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 0.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:48:30] PROBLEM - OpenSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 105 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 105, delayed_unassigned_shards: 0, number_of_pendin [09:48:30] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 0.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:49:06] PROBLEM - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f82d268c1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [09:49:06] org/wiki/Search%23Administration [09:49:06] PROBLEM - OpenSearch health check for shards on 9200 on relforge1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f599ddcb1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [09:49:06] org/wiki/Search%23Administration [09:49:30] (03CR) 10Clément Goubert: [C:03+1] "Needs a DNS patch for CNAME to `k8s-ingress-wikikube.svc.eqiad.wmnet.` as well as the `deployment-charts` namespace patch to be functional" [puppet] - 10https://gerrit.wikimedia.org/r/1135382 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:13] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:18] (03CR) 10Clément Goubert: [C:03+1] site: remove last jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1135389 (https://phabricator.wikimedia.org/T383226) (owner: 10Hnowlan) [09:54:43] (03CR) 10Brouberol: [C:03+2] airflow-analytics-product: grant task pods egress permissions to gitlab [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135388 (https://phabricator.wikimedia.org/T386675) (owner: 10Brouberol) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:03] (03PS1) 10Volans: reports: fix librenms with newer APIs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135391 [09:58:40] (03PS1) 10Ladsgroup: Bump thumbnail steps to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135392 (https://phabricator.wikimedia.org/T360589) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1000) [10:00:43] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:01:24] (03PS1) 10Phedenskog: perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) [10:01:47] jouncebot: nowandnext [10:01:48] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1000) [10:01:48] In 0 hour(s) and 58 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1100) [10:02:07] it doesn't look like people are deploying in the infra [10:02:13] FIRING: [5x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:19] (03CR) 10Volans: "Run on netbox-next (where some mismatch are due to stale data in netbox-next) available here:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135391 (owner: 10Volans) [10:02:30] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135392 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:02:46] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135392 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:03:19] (03CR) 10Volans: [C:03+2] capirca: optimization refactor [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [10:03:21] (03Merged) 10jenkins-bot: Bump thumbnail steps to 80% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135392 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:03:28] (03CR) 10Volans: [C:03+2] homer: move NetboxData initialization [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [10:04:19] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135392|Bump thumbnail steps to 80% (T360589)]] [10:04:23] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:05:36] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [10:05:43] (03CR) 10Ayounsi: [C:03+1] "thx!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135391 (owner: 10Volans) [10:05:46] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet [10:06:00] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1008 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:06:49] (03PS3) 10Effie Mouzeli: logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (https://phabricator.wikimedia.org/T391452) [10:07:28] (03CR) 10Effie Mouzeli: logging: add support for php 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [10:07:33] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [10:07:47] (03PS2) 10Effie Mouzeli: switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021 [10:08:28] (03CR) 10Marostegui: [C:03+1] "Thank you Alex" [puppet] - 10https://gerrit.wikimedia.org/r/1135385 (https://phabricator.wikimedia.org/T387332) (owner: 10Alexandros Kosiaris) [10:08:33] (03PS1) 10Hnowlan: services_proxy: remove videoscaler service [puppet] - 10https://gerrit.wikimedia.org/r/1135395 (https://phabricator.wikimedia.org/T354791) [10:08:48] (03PS1) 10Elukey: services: fix citoid ingress config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135396 [10:10:56] (03CR) 10Ladsgroup: [C:03+1] "For when the policy change is approved and communicated, I think it's good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [10:11:25] (03CR) 10Clément Goubert: [C:03+1] services_proxy: remove videoscaler service [puppet] - 10https://gerrit.wikimedia.org/r/1135395 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:11:52] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet [10:12:00] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1135392|Bump thumbnail steps to 80% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:12:02] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:12:04] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:12:10] (03CR) 10Hnowlan: [C:03+2] services_proxy: remove videoscaler service [puppet] - 10https://gerrit.wikimedia.org/r/1135395 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:12:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [10:13:34] RECOVERY - BGP status on lsw1-c2-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:13:40] (03PS7) 10Clément Goubert: mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) [10:13:42] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [10:13:46] (03CR) 10Elukey: [C:03+2] services: fix citoid ingress config for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135396 (owner: 10Elukey) [10:13:53] (03CR) 10Volans: [C:03+2] reports: fix librenms with newer APIs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135391 (owner: 10Volans) [10:14:02] FIRING: [6x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:20] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Allow to discover/test in more isolation spicerack features - https://phabricator.wikimedia.org/T389329#10725273 (10Volans) 05In progress→03Resolved spicerack-shell has been merged and deployed, related documentation is available at https://wiki... [10:15:14] (03Merged) 10jenkins-bot: capirca: optimization refactor [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [10:15:16] (03Merged) 10jenkins-bot: homer: move NetboxData initialization [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [10:15:45] (03Merged) 10jenkins-bot: reports: fix librenms with newer APIs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1135391 (owner: 10Volans) [10:17:13] FIRING: [7x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [10:18:19] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:18:31] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:18:36] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:18:39] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135392|Bump thumbnail steps to 80% (T360589)]] (duration: 14m 19s) [10:18:42] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:19:05] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:20:04] (03PS8) 10Clément Goubert: mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) [10:20:07] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [10:22:57] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/citoid: sync [10:23:00] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: sync [10:24:24] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135399 [10:26:01] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10725338 (10ayounsi) For sure that's an odd one... Maybe we could try with a different port. For OSPF, +1 to do it for the troubleshooting window. D... [10:30:00] (03CR) 10Hnowlan: [C:03+2] site: remove last jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1135389 (https://phabricator.wikimedia.org/T383226) (owner: 10Hnowlan) [10:32:39] (03CR) 10Ayounsi: [C:03+1] "lgtm! One question inline." [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) (owner: 10Cathal Mooney) [10:33:15] (03PS1) 10Elukey: modules: comment out gatewayHosts->domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) [10:37:47] !log hnowlan@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2278-2279].codfw.wmnet [10:41:16] !log hnowlan@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[1349-1351].eqiad.wmnet [10:41:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [10:42:31] (03CR) 10Cathal Mooney: "Thanks for the review, answered the question in line." [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) (owner: 10Cathal Mooney) [10:42:38] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135399 (owner: 10PipelineBot) [10:42:53] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:43:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:43:32] ^ that's me, will clean up that alert [10:44:27] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135399 (owner: 10PipelineBot) [10:45:38] (03CR) 10Ayounsi: [C:03+1] Cloudsw: adjust routing-policies to reflect change to IBGP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) (owner: 10Cathal Mooney) [10:45:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [10:48:15] FIRING: [2x] AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:49:26] (03PS1) 10Hnowlan: mediawiki: remove alerting for metal mediawiki instances [alerts] - 10https://gerrit.wikimedia.org/r/1135405 (https://phabricator.wikimedia.org/T354791) [10:49:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx) [10:49:45] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2278-2279].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [10:50:18] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2278-2279].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [10:50:18] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:19] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[2278-2279].codfw.wmnet [10:50:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:51:00] 06SRE, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review, 10Wikidata Integration in Wikimedia projects (Kanban Board): Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10725404 (10seanleong-WMDE) 05Open→03Resolved [10:51:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx) [10:53:33] (03CR) 10Effie Mouzeli: [C:03+1] mediawiki: remove alerting for metal mediawiki instances [alerts] - 10https://gerrit.wikimedia.org/r/1135405 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:53:58] !log hnowlan@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw2278.codfw.wmnet [10:54:51] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:59:03] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1349-1351].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [10:59:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1349-1351].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [10:59:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:59:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1349-1351].eqiad.wmnet [10:59:12] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:59:52] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet [11:00:00] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [11:00:05] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1100). [11:01:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:01:47] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2278.codfw.wmnet [11:04:56] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=mwdebug1002.eqiad.wmnet [11:05:03] (03PS2) 10Volans: log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 [11:05:03] (03CR) 10Volans: "Now that wmflib has been upgraded this is ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans) [11:05:10] jouncebot: next [11:05:10] In 0 hour(s) and 54 minute(s): Special: Cite Parsoid CSS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1200) [11:05:14] jouncebot: now [11:05:15] For the next 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1100) [11:05:55] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet [11:06:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [11:07:45] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [11:07:54] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet [11:10:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw1407.eqiad.wmnet [11:10:43] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10725471 (10Ladsgroup) >>! In T355914#10723774, @Jdforrester-WMF wrote: >>>! In T355914#10717142, @Ladsgroup wrote: >> It'd be nice to add this to next we... [11:11:29] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135029 (owner: 10PipelineBot) [11:14:02] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet [11:14:05] (03PS4) 10Effie Mouzeli: logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (https://phabricator.wikimedia.org/T391452) [11:14:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [11:15:32] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [11:16:33] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:17:17] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [11:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [11:17:52] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [11:18:29] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [11:19:11] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [11:19:57] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [11:20:25] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [11:20:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:21:28] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw1407.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [11:22:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw1407.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [11:22:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:22:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1407.eqiad.wmnet [11:22:48] (03PS2) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135029 [11:22:59] (03CR) 10Mvolz: [V:03+2 C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135029 (owner: 10PipelineBot) [11:23:15] RESOLVED: [2x] AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:23:48] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 06serviceops, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10725490 (10hnowlan) [11:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10725494 (10phaultfinder) [11:27:38] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw2278, mw2279 - https://phabricator.wikimedia.org/T391001#10725497 (10hnowlan) [11:28:50] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:29:25] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:30:03] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:32:15] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:35:17] (03CR) 10Hnowlan: [C:03+2] mediawiki: remove alerting for metal mediawiki instances [alerts] - 10https://gerrit.wikimedia.org/r/1135405 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:37:50] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:37:51] (03Merged) 10jenkins-bot: mediawiki: remove alerting for metal mediawiki instances [alerts] - 10https://gerrit.wikimedia.org/r/1135405 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:38:20] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:41:08] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:41:35] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:38] PROBLEM - MariaDB memory on db2220 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1575) = 93.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:42:47] (03PS1) 10Phedenskog: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) [11:44:40] (03CR) 10Effie Mouzeli: [C:03+2] logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [11:45:11] (03PS3) 10Phedenskog: perf/navtiming: Add FCP alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135326 (https://phabricator.wikimedia.org/T325283) [11:47:20] (03PS2) 10Phedenskog: perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) [11:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [11:48:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [11:49:13] (03PS1) 10Slyngshede: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 [11:50:26] (03PS1) 10Jforrester: Switch out various old PHP aliases to the current class names [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135410 [11:53:38] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [11:57:44] (03PS1) 10Jelto: Support multiple helm versions [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135411 (https://phabricator.wikimedia.org/T341984) [11:57:48] (03PS1) 10Jelto: make helm3 alternative entry dependent on helm [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) [11:58:38] PROBLEM - MariaDB memory on db2220 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1575) = 93.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:59:00] (03PS2) 10Slyngshede: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 [11:59:45] (03PS2) 10Jelto: make helm3 alternative entry dependent on helm [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) [11:59:51] (03PS2) 10Peter Fischer: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) [12:00:04] (03PS1) 10Kamila Součková: alertmanager: add route for task-severity data-persistence alerts [puppet] - 10https://gerrit.wikimedia.org/r/1135413 (https://phabricator.wikimedia.org/T385709) [12:00:05] awight: OwO what's this, a deployment window?? Special: Cite Parsoid CSS. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1200). nyaa~ [12:00:44] (03PS4) 10Phedenskog: perf/navtiming: Add FCP alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135326 (https://phabricator.wikimedia.org/T325283) [12:00:53] (03CR) 10Peter Fischer: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [12:01:34] (03CR) 10CI reject: [V:04-1] Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [12:02:48] (03PS3) 10Jelto: make helm3 alternative entry dependent on helm [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) [12:03:17] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet [12:03:25] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [12:03:28] (03Abandoned) 10Jelto: Support multiple helm versions [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135411 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [12:03:29] (03PS3) 10Slyngshede: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 [12:05:52] (03CR) 10Jelto: "Do you think it makes sense to use the `helm311` branch to update the old `helm311` package with the fixed alternatives? This change shoul" [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [12:06:37] !log prepping mwdebug1002 for reimage [12:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet [12:09:34] (03CR) 10Filippo Giunchedi: "Mostly minor/nit, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:09:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [12:10:00] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) - https://phabricator.wikimedia.org/T391465 (10LSobanski) 03NEW [12:10:21] (03PS1) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [12:10:58] !log mwdebug1002 has been depooled and removed from scap dsh [12:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:14] (03PS2) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [12:11:26] (03CR) 10Filippo Giunchedi: "Please mention in the commit message that 'sentry3' will no longer be scraped" [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:11:27] (03PS3) 10Effie Mouzeli: switch mwdebug1002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (https://phabricator.wikimedia.org/T391452) [12:13:10] (03PS1) 10Brouberol: Don't fail a dump task when a job do not apply to a given wiki [dumps] - 10https://gerrit.wikimedia.org/r/1135417 (https://phabricator.wikimedia.org/T391466) [12:15:52] (03PS1) 10Kamila Součková: alertmanager: Route 3 teams' task-severity alerts to Phab [puppet] - 10https://gerrit.wikimedia.org/r/1135418 (https://phabricator.wikimedia.org/T385709) [12:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [12:20:09] (03PS4) 10Effie Mouzeli: switch mwdebug1002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (https://phabricator.wikimedia.org/T391452) [12:21:41] (03CR) 10Kamila Součková: [C:03+1] switch mwdebug1002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [12:21:50] (03CR) 10Effie Mouzeli: switch mwdebug1002 to php8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [12:21:58] (03CR) 10Effie Mouzeli: [C:03+2] switch mwdebug1002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [12:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:25:33] (03PS4) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [12:25:33] (03PS1) 10Brouberol: airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) [12:25:41] (03CR) 10CI reject: [V:04-1] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [12:25:44] (03CR) 10CI reject: [V:04-1] airflow: increase pool metrics computation frequency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135419 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [12:25:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:26:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 0.7353% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:26:33] !incidents [12:26:33] 6028 (UNACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [12:26:33] 6026 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [12:26:38] !ack 6028 [12:26:38] 6028 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [12:26:42] hello [12:26:56] thanks for ACKing tappof! [12:26:58] db-related I think [12:27:04] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-30m&to=now connections spiking quite a it [12:27:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:27:19] currently getting a lot of DBUnexpectedErrors visiting pages on metawiki, e.g. [2c19495c-9f6c-4a79-bb3d-e5a31db132fd]. potentially related to the above, reporting in case unrelated [12:27:43] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mwdebug1002.eqiad.wmnet with OS bullseye [12:27:53] A_smart_kitten: probably related [12:27:58] yeah [12:28:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:29:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 19.47s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:29:24] (03CR) 10Cathal Mooney: commit: allow to approve/reject diffs globally (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [12:29:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:30:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:31:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:31:27] connections back down [12:31:42] yeah [12:32:07] also metawiki seems to be responding correctly now [12:32:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:32:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:32:31] yeah errors are down. what happened here though remains to be seen :) [12:32:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:33:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:34:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 4.517s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:35:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:36:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:37:25] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:13] FIRING: [7x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:13] (03CR) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [12:43:30] (03PS2) 10Kamila Součková: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 [12:43:33] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug1002.eqiad.wmnet with reason: host reimage [12:44:02] FIRING: [7x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:08] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková) [12:44:51] (03PS4) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) [12:45:26] (03CR) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [12:46:00] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1008 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:47:05] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug1002.eqiad.wmnet with reason: host reimage [12:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [12:51:12] (03PS3) 10Kamila Součková: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 [12:53:50] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková) [12:56:38] (03CR) 10Elukey: "Left a comment, can you add a bit more info in the commit msg about what changed between the versions?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [12:56:55] jouncebot: now [12:56:55] For the next 0 hour(s) and 3 minute(s): Special: Cite Parsoid CSS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1200) [12:57:00] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1008 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:57:03] (03PS6) 10Volans: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) [12:57:03] (03PS6) 10Volans: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 [12:57:13] FIRING: [7x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:20] (03CR) 10Volans: "addressed comment" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [12:59:38] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [12:59:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1300). [13:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:41] !log special window completed [13:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:29] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-04-02-130409 to 2025-04-08-183717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135423 (https://phabricator.wikimedia.org/T387359) [13:01:33] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-02-124609 to 2025-04-08-183631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135424 (https://phabricator.wikimedia.org/T386312) [13:01:33] Lucas_WMDE urbanecm TheresNoTime mwdebug1002 is being reimaged and is depooled, I do not expect scap to cause any trouble [13:03:17] ack [13:03:20] (03PS1) 10Btullis: Add a dummy password for rsyncing mediawiki-dumps-legacy [labs/private] - 10https://gerrit.wikimedia.org/r/1135425 (https://phabricator.wikimedia.org/T390738) [13:03:30] anzx: I can deploy, just gimme a sec :) [13:03:35] ok [13:04:07] (03CR) 10Btullis: [V:03+2 C:03+2] Add a dummy password for rsyncing mediawiki-dumps-legacy [labs/private] - 10https://gerrit.wikimedia.org/r/1135425 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:04:09] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: ProbeDown (instance ripe-atlas-codfw:0) - https://phabricator.wikimedia.org/T390676#10725943 (10tappof) You're free to close this task, since all the checks have been migrated to the corresponding HTTP version, as per {T388419}, and no... [13:04:17] o/ [13:04:38] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:04:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:04:41] (03PS3) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) [13:04:41] (03CR) 10Arnaudb: "this is a first take at implementing what has been written in https://wikitech.wikimedia.org/wiki/Gerrit/Operations#Switch_over" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:05:55] anzx: okay, ready — will deploy them both together :) [13:06:11] ok [13:06:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx) [13:06:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx) [13:07:00] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1008 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:07:06] (03Merged) 10jenkins-bot: madwiktionary: add logo, icon, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx) [13:07:10] (03Merged) 10jenkins-bot: arywiki: enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx) [13:07:13] FIRING: [7x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:25] (03PS1) 10Elukey: services: point rest-gateway in staging to the ingress citoid endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) [13:07:37] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1135035|madwiktionary: add logo, icon, wordmark and tagline (T391318)]], [[gerrit:1135036|arywiki: enable wgMinervaEnableSiteNotice]] [13:07:40] T391318: Change project logo in Wiktionary Madurese - https://phabricator.wikimedia.org/T391318 [13:10:39] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: ProbeDown (instance ripe-atlas-codfw:0) - https://phabricator.wikimedia.org/T390676#10725974 (10ayounsi) 05Open→03Resolved a:03ayounsi [13:11:47] (03PS2) 10Elukey: services: point rest-gateway to the ingress citoid endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) [13:12:13] FIRING: [6x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:31] !log samtar@deploy1003 samtar, anzx: Backport for [[gerrit:1135035|madwiktionary: add logo, icon, wordmark and tagline (T391318)]], [[gerrit:1135036|arywiki: enable wgMinervaEnableSiteNotice]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:15:33] TheresNoTime: checking [13:15:34] T391318: Change project logo in Wiktionary Madurese - https://phabricator.wikimedia.org/T391318 [13:15:39] ack [13:16:34] (03PS1) 10DCausse: opensearch: allow setting LD_LIBRARY_PATH [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) [13:16:46] TheresNoTime: both logo and minerva sitenotice change looks good [13:17:00] !log samtar@deploy1003 samtar, anzx: Continuing with sync [13:18:53] (03PS1) 10Giuseppe Lavagetto: haproxy: enable requestctl rules everywhere [puppet] - 10https://gerrit.wikimedia.org/r/1135431 [13:19:03] (03CR) 10Elukey: "Hugh I checked the diff in https://integration.wikimedia.org/ci/job/helm-lint/24289/console and somehow I expected to see citoid.k8s-ingre" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:19:12] (03PS2) 10DCausse: opensearch: allow setting LD_LIBRARY_PATH [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) [13:20:14] (03CR) 10Elukey: "I see that setting internal_host should solve, but I didn't get why the analytics services don't need it.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:20:46] (03CR) 10Alexandros Kosiaris: [C:04-1] "As the comment above says, those are pretty standard for our environment, I don't see why commenting them out makes sense. It would requir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:20:48] (03CR) 10Fabfur: "[question] why not in the profile instead of splitting between text|upload?" [puppet] - 10https://gerrit.wikimedia.org/r/1135431 (owner: 10Giuseppe Lavagetto) [13:22:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [13:23:50] (03PS1) 10Elukey: Add citoid CNAMEs for the Istio ingress [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) [13:23:52] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135035|madwiktionary: add logo, icon, wordmark and tagline (T391318)]], [[gerrit:1135036|arywiki: enable wgMinervaEnableSiteNotice]] (duration: 16m 14s) [13:23:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwdebug1002.eqiad.wmnet with OS bullseye [13:23:55] T391318: Change project logo in Wiktionary Madurese - https://phabricator.wikimedia.org/T391318 [13:24:28] (03CR) 10CI reject: [V:04-1] Add citoid CNAMEs for the Istio ingress [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:24:30] anzx: done, and images purged :) [13:24:52] TheresNoTime: thank you for deploying [13:25:36] (03PS3) 10Elukey: services: point rest-gateway to the ingress citoid endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) [13:26:33] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391476 (10phaultfinder) 03NEW [13:28:00] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10726042 (10fnegri) 05In progress→03Stalled [13:28:39] (03CR) 10Ssingh: Add citoid CNAMEs for the Istio ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:29:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:29:56] (03PS3) 10DCausse: opensearch: allow setting LD_LIBRARY_PATH [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) [13:30:25] (03CR) 10Filippo Giunchedi: [C:03+1] Add BFD down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1134664 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:30:49] (03CR) 10Elukey: Add citoid CNAMEs for the Istio ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:31:18] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135430 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [13:32:25] RESOLVED: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:34:08] go tappof [13:34:10] lolz [13:34:31] go godog [13:34:47] (03Abandoned) 10Brouberol: Don't fail a dump task when a job do not apply to a given wiki [dumps] - 10https://gerrit.wikimedia.org/r/1135417 (https://phabricator.wikimedia.org/T391466) (owner: 10Brouberol) [13:35:02] (03PS1) 10Clément Goubert: python-webapp: Update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) [13:35:02] (03CR) 10Clément Goubert: "Update for Ie8cdab5047f57258a1703a1cdb18ce495514c521 (zarcillo deployment) that will need `base.networkpolicy.egress.mariadb`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) (owner: 10Clément Goubert) [13:35:05] haha! [13:35:19] (03PS14) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [13:35:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:36:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:36:36] (03PS1) 10DCausse: cirrussearch: enable knn native lib [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) [13:37:13] (03CR) 10Elukey: "Right, I am totally stupid, of course citoid.discovery.wmnet will be a CNAME." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [13:37:53] (03CR) 10Tiziano Fogli: [C:03+2] perf/navtiming: Add FCP alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135326 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [13:38:03] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2088.codfw.wmnet with OS bullseye [13:38:07] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2088 [13:38:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2088 [13:38:48] (03PS1) 10Btullis: Rename mediawiki-dumps-legacy rsync password [labs/private] - 10https://gerrit.wikimedia.org/r/1135442 (https://phabricator.wikimedia.org/T390738) [13:39:12] (03PS2) 10Elukey: Add citoid-ingress CNAMEs for the Istio ingress [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) [13:42:02] (03CR) 10Btullis: [V:03+2 C:03+2] Rename mediawiki-dumps-legacy rsync password [labs/private] - 10https://gerrit.wikimedia.org/r/1135442 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:42:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:42:33] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [13:42:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:42:39] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [13:43:24] (03Merged) 10jenkins-bot: perf/navtiming: Add FCP alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135326 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [13:45:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:45:37] (03PS3) 10Tiziano Fogli: perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [13:45:44] (03CR) 10CI reject: [V:04-1] perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [13:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [13:52:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:29] (03PS3) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [13:57:13] FIRING: [5x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:40] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1004 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:59:42] (03PS4) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1400) [14:00:08] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10726166 (10VRiley-WMF) 05Open→03Resolved This drive has been replaced [14:00:45] (03PS1) 10AOkoth: site: revert releases to production role [puppet] - 10https://gerrit.wikimedia.org/r/1135444 (https://phabricator.wikimedia.org/T384595) [14:01:09] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-04-02-130409 to 2025-04-08-183717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135423 (https://phabricator.wikimedia.org/T387359) (owner: 10Jforrester) [14:02:11] (03CR) 10Jforrester: [C:03+2] Move to new async Parsoid fragment provision [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135084 (https://phabricator.wikimedia.org/T373253) (owner: 10Jforrester) [14:02:13] (03CR) 10Jforrester: [C:03+2] Switch out various old PHP aliases to the current class names [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135410 (owner: 10Jforrester) [14:02:13] FIRING: [5x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:36] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-04-02-130409 to 2025-04-08-183717 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135423 (https://phabricator.wikimedia.org/T387359) (owner: 10Jforrester) [14:03:19] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:03:55] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:07] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:04:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10726204 (10Marostegui) Thank you, the disk is rebuilding ` 0 0 6 64:6 10 DRIVE Rbld Y 1.745 TB dflt N N dflt - N ` [14:04:49] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:04:51] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:31] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:05:38] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-04-02-124609 to 2025-04-08-183631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135424 (https://phabricator.wikimedia.org/T386312) (owner: 10Jforrester) [14:05:44] (03CR) 10Alexandros Kosiaris: [C:03+2] "Rereading the task, let's just be bold. There is apparently very little use of this, with the requests I reviewed having very dubious User" [puppet] - 10https://gerrit.wikimedia.org/r/1130096 (https://phabricator.wikimedia.org/T307965) (owner: 10Aklapper) [14:06:07] (03CR) 10Elukey: Add citoid-ingress CNAMEs for the Istio ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [14:06:58] (03Merged) 10jenkins-bot: Move to new async Parsoid fragment provision [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135084 (https://phabricator.wikimedia.org/T373253) (owner: 10Jforrester) [14:07:01] (03Merged) 10jenkins-bot: Switch out various old PHP aliases to the current class names [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135410 (owner: 10Jforrester) [14:07:05] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-04-02-124609 to 2025-04-08-183631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135424 (https://phabricator.wikimedia.org/T386312) (owner: 10Jforrester) [14:07:33] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:07:34] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:07:40] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1004 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:07:40] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:08:08] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:08:29] (03PS2) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) [14:08:30] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:08:59] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:09:00] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:09:37] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:10:23] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10726250 (10VRiley-WMF) You're correct. When these were orginally set up, I had plugged 1 Gig cables into them. However, those have been removed. Let us know if this takes care of... [14:10:26] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10726251 (10VRiley-WMF) You're correct. When these were orginally set up, I had plugged 1 Gig cables into them. However, those have been removed. Let us know if this takes care of... [14:10:33] (03CR) 10CI reject: [V:04-1] gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:10:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:10:47] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10726252 (10VRiley-WMF) You're correct. When these were orginally set up, I had plugged 1 Gig cables into them. However, those have been removed. Let us know if this takes care of... [14:11:06] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] [14:11:11] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [14:11:11] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [14:11:30] (03CR) 10Elukey: "@akosiaris@wikimedia.org they are automatically generated by the template as fallback option, see:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [14:11:50] (03PS1) 10Majavah: Add .tox to gitignore [software/bitu] - 10https://gerrit.wikimedia.org/r/1135448 [14:12:09] (03CR) 10Federico Ceratto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:12:13] FIRING: [5x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:11] (03CR) 10Elukey: [C:03+1] python-webapp: Update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) (owner: 10Clément Goubert) [14:14:53] (03PS5) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [14:14:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:03] hello [14:15:07] !incidents [14:15:07] 6029 (UNACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:15:07] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:15:08] 6026 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [14:15:08] !incidents [14:15:08] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10726284 (10ayounsi) 05Open→03Resolved yep all good, thx! [14:15:08] 6029 (UNACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:15:08] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:15:09] 6026 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [14:15:10] !ack 6029 [14:15:11] 6029 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [14:15:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 7.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:15:16] it's the same thing again I am guessing [14:15:28] apparently, yes [14:15:38] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:16:19] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10726291 (10VRiley-WMF) You're correct. When these were orginally set up, I had plugged 1 Gig cables into them. However, those have been removed. Let us know if this takes car... [14:16:27] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10726292 (10VRiley-WMF) You're correct. When these were orginally set up, I had plugged 1 Gig cables into them. However, those have been removed. Let us know if this takes car... [14:17:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 13.77s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [14:18:21] (03PS4) 10Elukey: services: point rest-gateway to the ingress citoid endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) [14:18:21] (03PS1) 10Elukey: services: add extra fqdn to the citoid's ingress config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135449 (https://phabricator.wikimedia.org/T391457) [14:18:40] (03CR) 10Elukey: "Ok it should work now :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [14:18:54] !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] (duration: 07m 47s) [14:18:58] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [14:18:59] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [14:19:32] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] [14:19:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:20:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:20:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.96% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:21:33] (03PS1) 10Andrew Bogott: heat and magnum: scale up DB connections, increase timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1135451 [14:21:33] (03PS1) 10Andrew Bogott: magnum.conf: remove a bunch of marked-out config options [puppet] - 10https://gerrit.wikimedia.org/r/1135452 [14:22:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 3.313s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:22:31] (03PS2) 10Andrew Bogott: heat and magnum: scale up DB connections, increase timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1135451 [14:22:31] (03PS2) 10Andrew Bogott: magnum.conf: remove a bunch of marked-out config options [puppet] - 10https://gerrit.wikimedia.org/r/1135452 [14:22:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135451 (owner: 10Andrew Bogott) [14:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10726333 (10phaultfinder) [14:25:45] (03CR) 10Andrew Bogott: [C:03+2] heat and magnum: scale up DB connections, increase timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1135451 (owner: 10Andrew Bogott) [14:26:34] (03PS5) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) [14:26:39] (03CR) 10Xcollazo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [14:26:49] (03CR) 10Ssingh: [C:03+1] Add citoid-ingress CNAMEs for the Istio ingress [dns] - 10https://gerrit.wikimedia.org/r/1135433 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [14:28:20] !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] (duration: 08m 48s) [14:28:24] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [14:28:25] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [14:28:52] (03PS1) 10Ladsgroup: Increase max db connection count before circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135454 (https://phabricator.wikimedia.org/T390510) [14:28:58] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] [14:29:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2088.codfw.wmnet with OS bullseye [14:30:17] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2088.codfw.wmnet'] [14:30:39] (03PS6) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [14:31:19] Hmm, scap seems stuck at the `docker-pusher` step; I restarted and it got stuck again. [14:32:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [14:32:13] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:32:18] James_F: o/ might be another occurrence of https://phabricator.wikimedia.org/T390251#10716525 [14:32:27] lemme check [14:32:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10726404 (10Vgutierrez) >>! In T387145#10720903, @cmooney wrote: > (IMPORTANT) The obvious complication there is that lvs1016 has insufficient 10G ports to connect to everything that lvs1... [14:32:40] elukey: Aha, yes, looks likely. [14:33:17] let me know once you're done [14:33:27] mediawiki-publish-81 pushed fine, but mediawiki-publish seems to be stuck. [14:33:27] yeah the deploy1003's local daemon is stating the same errors sigh [14:33:37] elukey: Should I abort and retry? [14:33:44] Or give up? [14:33:44] yes please [14:33:50] !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] (duration: 04m 52s) [14:33:54] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [14:33:55] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [14:33:58] sorry, too many options while I was writing :) [14:34:02] Ha. [14:34:05] retry if it is possible yes [14:34:15] Sure. [14:34:18] Amir1: Sorry for slowness. [14:34:33] (03PS7) 10Btullis: Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) [14:34:35] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] [14:34:44] don't worry. I don't think another spike would happen right now [14:34:54] Jinx. [14:35:00] xD [14:35:31] elukey: Back on mediawiki-publish being stuck, sadly. [14:35:44] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5240/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [14:35:53] If I merge another patch maybe it'd clear out the local blobs? [14:35:54] (03CR) 10Hnowlan: [C:03+1] php: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135379 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert) [14:36:13] (03PS3) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) [14:36:20] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cirrussearch2088.codfw.wmnet'] [14:36:23] (03CR) 10Hnowlan: [C:03+1] MWScript.php: exit code on mesh, longer timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133935 (https://phabricator.wikimedia.org/T390972) (owner: 10Clément Goubert) [14:36:44] my impression is that this is an issue with scap, docker returns an error and gives up but scap remains hanging [14:36:56] * dancy eyes [14:36:57] I could merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1126659 as well (no-op, just getting config ready). [14:37:02] hey dancy :) [14:37:09] heh dancy I was about to ping you for eyes on this :D [14:37:49] Scap calls docker-pusher and waits for it to finish. [14:38:03] docker-pusher has not finished. Scap can't do anything about that. [14:38:09] And docker-pusher isn't erroring it just gets stuck. [14:38:14] okok, what is docker-pusher btw? [14:38:15] (03CR) 10CI reject: [V:04-1] gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:38:23] (03PS4) 10Jforrester: Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 [14:38:26] because docker itself gave up, at least from the deploy1003 logs [14:38:37] docker-pusher is a shell script which calls `docker -c push` [14:39:05] `/usr/bin/docker --config /etc/docker-pusher push -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-04-09-143456-publish-81` is the hanging operation. [14:39:41] this started happening after https://gerrit.wikimedia.org/r/1133740 [14:39:53] so I am wondering if the serialization makes docker-pusher confused/hanging [14:40:32] probably at this point we should the hanging docker-pusher process on deploy1003, and let James retry [14:40:38] what do you think? [14:40:53] nod.. control-c and try again. [14:41:01] !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]] (duration: 06m 26s) [14:41:05] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [14:41:05] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [14:41:10] Ctrl-C again. [14:41:21] (03CR) 10Scott French: [C:03+1] php: mwscript bugfix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135379 (https://phabricator.wikimedia.org/T387208) (owner: 10Clément Goubert) [14:41:22] all right no more processes hanging around [14:41:23] The version of docker used on the deploy server is quite old [14:41:33] I'm going to add in a third patch to see if that magically works. [14:41:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 (owner: 10Jforrester) [14:41:38] ack [14:41:47] (Different local image etc.) [14:41:48] (03PS4) 10Tiziano Fogli: perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [14:42:09] But yes, Ctrl-C on scap successfully passes SIGINT down to docker-pusher [14:42:25] (03Merged) 10jenkins-bot: Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126659 (owner: 10Jforrester) [14:42:50] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]], [[gerrit:1126659|Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls]] [14:42:53] (03CR) 10Scott French: [C:03+1] MWScript.php: exit code on mesh, longer timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133935 (https://phabricator.wikimedia.org/T390972) (owner: 10Clément Goubert) [14:43:40] Aha, docker-pusher worked this time. [14:44:06] checking what went wrong on the registry hosts [14:44:18] (03CR) 10Hnowlan: [C:03+1] "Looks good to me! Please test in staging first. Might be worth giving mvolz a heads-up that we're doing this also" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [14:44:21] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489 (10ayounsi) 03NEW [14:44:28] Hmm, now it seems paused? Last message "14:43:23 [webserver-webserver] Image build finished" [14:44:34] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom eqiad row B <-> cloudsw links - https://phabricator.wikimedia.org/T391489#10726454 (10ayounsi) [14:44:47] `dockerd` is still taking up CPU so it's doing something [14:45:46] No log outputs though? [14:46:26] in the logs I see this for the two times that James tried to deploy [14:46:29] ="Not continuing with push after error: context canceled" [14:46:39] Is that after I did Ctrl-C? [14:47:28] no I see it multiple times starting when you first deployed [14:47:36] Huh. [14:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [14:47:45] !log jforrester@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.23,1.44.0-wmf.24 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.disco [14:47:46] very.wmnet/restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.152.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.23,1.44.0-wmf.24 --label vnd.w [14:47:46] ikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.152.0) (duration: 04m 55s) [14:47:52] !log restart docker on deploy1003 [14:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:01] (03PS1) 10Majavah: bird: Only specify interface for link-local peerings [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) [14:48:02] Hence the scap failure now. [14:48:34] so, I had assumed the high dockerd CPI was preparing the compressed layer blobs for upload [14:48:36] let's retry to see if it is better [14:48:42] Wilco. [14:48:44] since it was *CPU [14:48:57] swfrench-wmf: That is usually the case. [14:49:06] probably yes, I wanted to start from a clean state, got carried out :) [14:49:10] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]], [[gerrit:1126659|Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls]] [14:49:14] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [14:49:14] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [14:49:14] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2068 to cirrussearch2068 [14:49:26] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:49:52] weird, this time I don't see any traces of 500s returned by the docker registyr [14:50:32] I just mean that in the cases I've looked at previously, the "hanging for a while" during what is ostensibly the wait for docker-pusher seems to be precisely that [14:50:35] Seems stuck on docker-pusher again. [14:51:12] I _think_ the wait is expected, if indeed it never got done pushing the large layer (i.e., it needs to prepare it for upload again) [14:51:15] [mediawiki-publish] Running sudo /usr/local/bin/docker-pusher… took <1s per the log. [14:51:35] 14:49:39 [mediawiki-publish-81] Running sudo /usr/local/bin/docker-pusher … is the same log second (unless these are in parallel?) [14:51:37] There are two active pushes [14:51:47] Ah, OK. I'll shut up then. [14:51:49] Image building does run in parallel. [14:52:00] Pushes are currently serialized at the daemon level. [14:52:34] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:52:58] * swfrench-wmf was naively hoping that dockerd only serialized the upload, but is increasing believing prep is also serialized [14:53:08] I wouldn't expect such as long push unless something was merged that resulted in l10n rebuild. [14:53:20] (03CR) 10Majavah: [V:03+1] "PCC diffs except cephosd1004.eqiad.wmnet and cloudlb2002-dev.codfw.wmnet seem unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:53:24] (03PS4) 10Filippo Giunchedi: Netbox: Update alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [14:54:10] it would be hilarious if all these issues were due to an old docker version on deploy1003 [14:54:25] "hilarious" [14:54:28] dancy: The main patch I merged has new i18n, sadly, yes. [14:54:47] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2068 to cirrussearch2068 - bking@cumin2002" [14:54:51] (03CR) 10Filippo Giunchedi: "Please see PS4" [alerts] - 10https://gerrit.wikimedia.org/r/1135409 (owner: 10Slyngshede) [14:54:51] So maybe this is actually all just Working As Expected™? [14:54:55] Ah, in that case this all makes perfect sense.. Patience is needed. [14:55:00] swfrench-wmf, dancy - do we have an idea why dockerd emits "Not continuing with push after error: context canceled" ? [14:55:04] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2068 to cirrussearch2068 - bking@cumin2002" [14:55:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:55:05] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2068 [14:55:16] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2068 [14:55:17] LocalisationCache build took < 40s which used to be the slow bit? [14:55:35] elukey: I'd not seen that one before during previous instances of this [14:55:54] James_F: Do you have the transcript that shows how many langs were rebuilt? [14:55:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2068 to cirrussearch2068 [14:55:57] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2068.codfw.wmnet on all recursors [14:56:01] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2068.codfw.wmnet on all recursors [14:56:31] (03CR) 10Tiziano Fogli: [C:03+2] perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [14:56:51] elukey: Dunno. I would expect to see other surrounding error messages. [14:56:55] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2068.codfw.wmnet with OS bullseye [14:56:57] swfrench-wmf: I've seen it the last time, after we serialized pushes [14:57:01] I love that dockerd logs spell cancelled two different ways ... [14:57:06] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2068 [14:57:14] (cancelled and canceled) [14:57:24] but the other time I saw errors from docker distribution, namely 500 [14:57:27] this time nothing [14:57:29] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:57:31] at least from a quick search [14:57:42] so I am very confused [14:58:32] dancy: This time around, 0. I don't have the old log I think. [14:58:44] ok. [14:58:47] top [14:59:51] It's unclear if `dockerd` is making any progress. I'm running periodic `df /srv` and not seeing changes in storage (which I expected to see increasing if it is constructing a compressed image). [15:00:11] P&T meeting! [15:01:20] Should I give up and revert everything and try later? [15:01:43] jouncebot: nowandnext [15:01:43] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [15:01:43] In 1 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1700) [15:02:29] James_F: lemme check one thing [15:02:35] Sure. [15:04:21] I think it finished [15:04:28] no more dockerd taking up cpu [15:04:31] oh, it's back.. aha [15:04:37] No signs of life in the script output. [15:04:48] `Upload failed, retrying: blob upload unknown` [15:04:56] So bad blob again? [15:05:00] at 14:56 and 15:04 [15:05:20] This push started at 14:49 so that'd be the two it's trying, I suppose? [15:05:34] nod [15:05:41] exactly, yeah - it also retries internally [15:06:10] OK, let's aboirt. [15:06:15] (03Merged) 10jenkins-bot: perf/navtiming: Add LoadEventEnd alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135393 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [15:06:18] !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]], [[gerrit:1126659|Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls]] (duration: 17m 08s) [15:06:22] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [15:06:23] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [15:06:23] Major DB errors in logspam-watch. [15:06:29] (major counts of...) [15:06:38] (03PS1) 10Jforrester: Revert "Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135458 [15:06:48] Overload.. circuit breaking happening. [15:06:58] (03PS1) 10Jforrester: Revert "Switch out various old PHP aliases to the current class names" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135459 [15:07:06] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2068 - bking@cumin2002" [15:07:12] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2068 - bking@cumin2002" [15:07:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:07:12] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2068.codfw.wmnet 102.48.192.10.in-addr.arpa 2.0.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:13] Eurgh. Should I try to land and scap the reverts? [15:07:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 15.95% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:07:16] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2068.codfw.wmnet 102.48.192.10.in-addr.arpa 2.0.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:16] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2068 [15:07:30] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2068 [15:07:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2068 [15:07:32] Amir1: You maybe spoke too soon. :-( [15:07:44] siiiiiiigh [15:07:57] swfrench-wmf: very interesting, some new errors about blobs, and I don't see explicit HTTP 500s [15:08:06] let me know when I can deploy that change [15:08:21] swfrench-wmf: shall we revert teh serialization patch? It seems making it worse [15:09:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 4.754s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:09:37] I'm not going to try to deploy anything further, even reverts, without an OK. [15:10:24] elukey: yeah, it might be worth reverting the dockerd-based push serialization and trying to it up in the tool [15:10:38] *trying to do it [15:10:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:56] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:11:54] (03PS1) 10Elukey: Revert "role::deployment_server::kubernetes: limit Docker concurrent uploads" [puppet] - 10https://gerrit.wikimedia.org/r/1135460 [15:11:58] !log upgrading to varnish 7.1.1-1.1~bpo11+wmf3 in cp3073 (text) and cp3081 (upload) - T391334 [15:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:01] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [15:12:01] swfrench-wmf: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135460 :( [15:12:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 5.343% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:12:35] (03CR) 10Scott French: [C:03+1] Revert "role::deployment_server::kubernetes: limit Docker concurrent uploads" [puppet] - 10https://gerrit.wikimedia.org/r/1135460 (owner: 10Elukey) [15:12:51] (03CR) 10Elukey: [C:03+2] Revert "role::deployment_server::kubernetes: limit Docker concurrent uploads" [puppet] - 10https://gerrit.wikimedia.org/r/1135460 (owner: 10Elukey) [15:14:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 4.666s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:16:04] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2088.codfw.wmnet'] [15:17:15] looking at the logs on registry2004, there's a `time="2025-04-09T15:04:10.135612749Z" level=error msg="response completed with error" err.code="blob upload invalid" err.detail="blob invalid length"` on an upload for one of the large layer blobs (d36db83bc7abae322d017166048686aaa580be4f5f3531b371faa224f79d47ab, push UUID 5931f9d9-88e1-4781-8ff3-dcbe436c3921) [15:17:31] yep saw it as well [15:17:33] nothing new / novel, unfortunately [15:17:51] I am still puzzled why it happens only with these images [15:17:57] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [15:18:03] other than the fact that they're huge [15:18:12] elukey: also I searched and can't find an actual trigger [15:18:24] yeah but the ML team have a lot of big images as well [15:18:28] even worse than this one [15:18:39] Do their layers change often? [15:18:51] the big ones no [15:19:03] (nitpick: Layers never change) [15:19:08] Yes yes. [15:19:14] Layers are like war [15:19:30] Do MW's used layers change a fair bit? [15:19:39] I am going to restart docker on deploy1003 [15:20:07] ack [15:20:12] !log restart docker on deploy1003 to revert the push serialization change - T390251 [15:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:15] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [15:21:08] James_F: Any time there's a change to /srv/mediawiki-staging, a new layer is added on top of the prior image layers. [15:21:10] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:21:29] (unless the changes are too big, in which case it calls back to the base image and a full rsync) [15:21:34] dancy: Yeah, but I meant that the new multi-version each week doesn't have much in common, I think? [15:21:54] dancy: Which means MW might be more likely to run into an issue with big-image-layer-corruption than ML. [15:21:55] Right. During the weekly presync, it's always a fresh, big image. [15:22:07] Ack. [15:22:13] also when the base production image changes, which is also once per week [15:22:25] Do those bumps align on Monday night? [15:22:26] or when there's a big l10n rebuild. [15:22:40] If only we had static JSON files for i18n. [15:22:44] RECOVERY - Dell PowerEdge RAID Controller on db1185 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:22:48] Etc. [15:23:19] swfrench-wmf: you mentioned that the layers are huge, how much compressed size are we talking about? Because from docker history I didn't find a lot of things [15:23:50] and we haven't really got into an occurrence of nginx saturating tmpfs, unless we were pushing multiple images in parallel [15:23:50] I'd guess ~2GB for a compressed two-version image layer. [15:23:52] (03PS1) 10Gergő Tisza: mariadb catalog: Fix list formatting in README [puppet] - 10https://gerrit.wikimedia.org/r/1135461 [15:23:55] in the cases I've looked at previously and manually pulled the compressed blob down with curl, it was 2+ GiB [15:24:33] pretty sure ML has way more, we hit the 4G mark recently, it must be 3GB+ now (for some of the big pytorch layers) [15:24:56] and the docker version on build2001 is the same [15:24:58] elukey: are we running the same docker version on build2002 and deploy1003? [15:25:03] Interesting. [15:25:09] ah, you're using 2001 [15:25:53] yeah in the past I've used 2001, but now I have some doubts [15:26:30] (03CR) 10Ladsgroup: [C:03+2] mariadb catalog: Fix list formatting in README [puppet] - 10https://gerrit.wikimedia.org/r/1135461 (owner: 10Gergő Tisza) [15:26:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2088.codfw.wmnet'] [15:27:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10726607 (10cmooney) Hi @Jhancock.wm @papaul as discussed in our call if you could get an old Juniper QFX5100 switch racked in A... [15:27:05] yeah confirmed, the weekly rebuilds are still on 2001 [15:27:18] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:28:02] having said that, I'd be curious to see deploy1003 on bookworm [15:28:10] to rule out the old dockerd theory [15:28:39] also there must be something that we crossed from the first occurrence of the bad blobs issue [15:28:52] nod. [15:29:07] (03CR) 10Ssingh: bird: Only specify interface for link-local peerings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:29:16] maybe number of layers, for some reason? [15:29:23] Unlikely. [15:29:51] definitely I am reasoning out loud, but we need to start getting creative I am afraid :D [15:29:58] nod. Understood. [15:30:10] Max layers is 253. We're in the low teens. [15:30:35] okok, I am more wondering on how the docker registry takes even tens of layers [15:30:50] The registry itself doesn't care too much. It's just a list of hashes. [15:30:58] the swift driver is old and they removed it in 3.x, in favor of the S3 one [15:31:03] (03CR) 10Majavah: [V:03+1] bird: Only specify interface for link-local peerings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:31:39] The manifest lists the layer hashes and that information is very small. [15:32:08] Also, the layers have to be uploaded first before the manifest [15:32:15] and it's the layer upload that is the problem [15:32:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:32:49] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:32:50] so, I think the main thing we know is that (at least as far as I'm aware) it's _always_ the the push for the large top-most layer on the mediawiki-multiversion image (i.e., only happens when we need to rebuild that - either due to the similarity threshold being crossed, or the base image changing, or something else) [15:33:19] I feel bad about leaving prod in an unclean deployment state. None of the code is live on any wiki yet, but still… And now I'm in meetings (though can of course still pay attention if needed). [15:33:47] James_F: Suggestion, revert the changes, merge, then run `scap prep auto` on the deploy server. [15:33:53] Ack. [15:34:06] James_F: when you have a moment we can retry to deploy, with the new docker config on deploy1003 [15:34:11] or revert [15:34:13] … or I can do that. [15:34:19] Retry is my preference. [15:34:24] let's try it [15:34:36] On deploy1003?\ [15:34:42] yep [15:34:52] Full scap or just scap prep auto? [15:35:10] full scap [15:35:17] Going [15:35:43] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:45] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1135084|Move to new async Parsoid fragment provision (T373253 T388546)]], [[gerrit:1135410|Switch out various old PHP aliases to the current class names]], [[gerrit:1126659|Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls]] [15:35:49] T373253: Develop semantic / distinct representation for wikifunctions output in Parsoid DOM - https://phabricator.wikimedia.org/T373253 [15:35:49] T388546: Once the ACF system exists, migrate the wikitext integration to it - https://phabricator.wikimedia.org/T388546 [15:36:09] (03CR) 10Clément Goubert: alertmanager: add task receivers for 4 teams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [15:36:49] I see double CPU usage in `dockerd` now. [15:37:39] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:38:32] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2088.codfw.wmnet'] [15:38:40] !log reprepro -C component/nginx-ech include bookworm-wikimedia openssl_3.4.1-1+ech1_amd64.changes: T205378 [15:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:42] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [15:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:26] (03CR) 10Brouberol: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [15:42:41] (03CR) 10Xcollazo: "Hmm.. not sure why this change was a no-op on `snapshot1016.eqiad.wmnet` as per PPC run." [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [15:43:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:43:11] (03CR) 10Nikerabbit: [C:03+1] AX: Enable entry-points on Asturian and Lombard wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135340 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [15:44:14] (03PS2) 10Tiziano Fogli: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [15:45:12] (03PS1) 10Hnowlan: jobrunner: clean up remaining cruft [puppet] - 10https://gerrit.wikimedia.org/r/1135465 [15:45:24] (03PS1) 10Bking: cirrussearch: add row D host as cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1135466 (https://phabricator.wikimedia.org/T388610) [15:45:49] (03CR) 10CI reject: [V:04-1] cirrussearch: add row D host as cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1135466 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:46:20] (03PS3) 10Tiziano Fogli: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [15:46:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/f9a53324a43f5d3bbad7f28a00afd0f69c4a8b569a5b32314e5655cd118ff3ac/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [15:46:37] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:47:04] lovely error on deploy1003 [15:47:07] (03PS4) 10Tiziano Fogli: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [15:47:24] Likely fallout from the scap issue? [15:47:31] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1135448 (owner: 10Majavah) [15:47:32] (03PS2) 10Bking: cirrussearch: add row D host as cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1135466 (https://phabricator.wikimedia.org/T388610) [15:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [15:47:42] It's because of the in-progress push. [15:47:54] FIRING: [2x] CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:47:55] (03CR) 10CI reject: [V:04-1] cirrussearch: add row D host as cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1135466 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:48:08] RESOLVED: [2x] CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:48:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2088.codfw.wmnet'] [15:48:12] Should I abort? [15:48:23] No. Let's let it try for a while longer. [15:48:29] Ack. [15:48:50] Apr 09 15:42:38 deploy1003 dockerd[3956212]: time="2025-04-09T15:42:38.522888715Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error" [15:48:52] (03PS5) 10Phedenskog: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) [15:49:00] that is from the registry [15:49:02] FIRING: [6x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:05] A 500 from the registry? [15:49:16] yes I've seen it before [15:49:32] (03PS3) 10Bking: cirrussearch: add row D host as cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1135466 (https://phabricator.wikimedia.org/T388610) [15:49:32] We also run an old version of the registry software, right? [15:49:37] it is part of the joy of the blob issue [15:49:38] (03CR) 10Nikerabbit: [C:03+1] AX: Enable Quick Surveys extension on Asturian and Lombard wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135337 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [15:50:27] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2088.codfw.wmnet with OS bullseye [15:50:31] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2088 [15:50:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2088 [15:50:34] dancy: nope, we run 2.8.2, that was the almost last one up to last week [15:50:40] I just noticed that they released 3.0! [15:50:42] oh good to know! [15:50:46] https://github.com/distribution/distribution/releases/tag/v3.0.0 [15:51:13] "oss and swift storage drivers are no longer supported" [15:51:15] (03CR) 10Bking: [C:03+2] "self-merging to fix an active problem" [puppet] - 10https://gerrit.wikimedia.org/r/1135466 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:51:41] yeah we need to move to s3 :( [15:51:46] I think that means that only the S3 driver is supported (which in theory should be compatible w/ swift) [15:52:01] problem is we cannot easily enable it for the mw-swift [15:52:23] we'll have to use another datastore like APUS with proper support (ceph + s3) [15:52:26] but it is a major move [15:53:07] ok no the 500 is not from the registry [15:53:09] /var/lib/nginx/body/0000000779" failed (28: No space left on device) [15:53:26] No space on the registry? [15:53:29] that is the nginx tmpfs (max 4GB) being saturated [15:53:53] Ah, if max is 4GB and we push two 2+GB images in parallel… [15:53:53] we have a tmpfs are for uploads, that is controlled by nginx [15:54:30] yeah [15:54:35] we removed the serialization [15:54:52] So have we switched one issue for another? [15:55:06] in this particular case, yes it seems so [15:55:55] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2068.codfw.wmnet with reason: host reimage [15:56:03] (03PS1) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [15:56:08] swfrench-wmf: ==^ [15:56:31] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [15:56:38] Aha. Movement. [15:56:55] Looks like a push finished? [15:57:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:57:16] Yes. [15:57:49] Now it's pushing mediawiki-multiversion-cli [16:00:45] Looks like pushes are done [16:00:54] oh nope.. still one [16:01:00] feels close! [16:01:00] Yeah. [16:01:06] Fingers very crossed. [16:01:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2068.codfw.wmnet with reason: host reimage [16:02:52] (going into a meeting, but I'll read) [16:02:57] (03PS2) 10Scott French: Profile::Mediawiki_deployment: add 'dir' field [puppet] - 10https://gerrit.wikimedia.org/r/1135464 (https://phabricator.wikimedia.org/T388761) [16:04:29] Progress again. [16:05:01] Now on sync-masters. [16:05:04] (03CR) 10Btullis: [V:03+1 C:03+2] Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135416 (https://phabricator.wikimedia.org/T389784) (owner: 10Btullis) [16:05:06] Success! (Maybe.) [16:05:16] (03CR) 10Dzahn: [C:03+1] site: revert releases to production role [puppet] - 10https://gerrit.wikimedia.org/r/1135444 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [16:05:34] now we see if the image can successfully be pulled [16:05:52] Let's not borrow pain from the future. [16:06:11] :) [16:06:23] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [16:06:52] And with the layer compression done, disk space fixes itself? [16:07:16] There wasn't an actual space issue.. [16:07:31] it was a permission denied issue while the space checker was running [16:07:36] `Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-04-09-153611-publish-81": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-04-09-153611-publish-81": failed commit on ref "layer-sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37": unexpected commit [16:07:36] digest sha256:0a63db0b23f34177711db7533fa52d0cd1091fec4c4e943b4b0910b97df156dd, expected sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37: failed precondition` [16:07:39] Oh, how odd. [16:07:52] swfrench-wmf: Oh dear. [16:08:36] James_F: so, the testservers update is going to fail (if you leave it, it will hit the 10m timeout) [16:09:54] however, what we've seen before is that the issue is transient, or at least cannot be reproduced later on [16:10:06] Hmm, so if I abort and re-scap it might work? [16:10:38] I believe so, yeah - let me pull the "bad" blob down manually and confirm whether it is indeed corrupt [16:10:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:11:11] (03PS2) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [16:11:14] (03PS1) 10Btullis: Revert "Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1135468 [16:11:40] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [16:13:45] (03CR) 10Btullis: [C:03+2] Revert "Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps" [puppet] - 10https://gerrit.wikimedia.org/r/1135468 (owner: 10Btullis) [16:14:29] (03PS3) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [16:14:30] (03PS1) 10CDobbins: geo-maps: add mapping for Peru [dns] - 10https://gerrit.wikimedia.org/r/1135469 [16:14:57] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [16:15:34] (03Abandoned) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [16:15:45] Coming up to the 10 min limit. [16:16:02] James_F: ack, thanks! [16:16:06] (03PS4) 10FNegri: openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) [16:16:33] (03CR) 10CI reject: [V:04-1] openstack: remove references to noauth-project [puppet] - 10https://gerrit.wikimedia.org/r/1135467 (https://phabricator.wikimedia.org/T391486) (owner: 10FNegri) [16:16:35] Lots of noise from helm, but scap seems to be continuing anyway? [16:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:17:16] dancy: elukey: James_F: I just pulled `/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37` from both registry active registry hosts, and the digest matches [16:17:21] Neat. [16:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [16:17:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2088.codfw.wmnet with reason: host reimage [16:18:07] which is to say, this is consistent with what we've seen before, in that the effect it either (1) transient or (2) (maybe) happening on the registry <- dragonfly -> worker side [16:18:16] scap test-rollout now at 5/12. [16:19:03] Wow this is slow. [16:20:05] Ah, and now it rolled back. [16:20:28] Should I try to scap again? [16:21:10] Go for it [16:21:11] yeah, basically what happens is update times out -> helm reverts to prior state -> scap reverts /etc/helmfile-defaults/mediawiki/release to prior state -> scap applies that [16:21:20] yeah, +1 to trying again [16:21:43] FYI, I need to step away for a few minutes [16:22:31] Doing so, but logmsgbot died. [16:22:41] Possibly with the huge rollback error log from scap? [16:23:13] I would like to know what logmsgbot's limits are so that we can adjust scap to not exceed them. [16:23:49] Line length is I think ~400 bytes, so maybe that? [16:24:30] `16:23:15 Finished build-and-push-container-images (duration: 00m 54s)` Yay for no-op-ish work. [16:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:29:09] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove x2, add ms{1,2,3} to profile::mariadb::section_ports: [puppet] - 10https://gerrit.wikimedia.org/r/1135385 (https://phabricator.wikimedia.org/T387332) (owner: 10Alexandros Kosiaris) [16:30:55] (03CR) 10Mvolz: "👀" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135428 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [16:32:43] (03CR) 10Alexandros Kosiaris: [C:03+2] scap.cfg.erb: Allow users in spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [16:32:46] PROBLEM - MariaDB read only pc4 #page on pc1014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:33:01] PROBLEM - MariaDB read only es7 on es2039 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:33:17] PROBLEM - MariaDB Event Scheduler pc4 on pc1014 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [16:33:21] What? [16:33:49] Two different hosts? [16:33:55] PROBLEM - MariaDB read only s5 on db1200 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:33:55] hi [16:33:58] !incidents [16:33:58] 6030 (UNACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:33:58] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:33:58] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:34:00] PROBLEM - MariaDB read only m1 #page on db2232 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:34:00] PROBLEM - MariaDB read only s7 on db2222 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:34:00] !ack 6030 [16:34:01] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:34:09] !incidents [16:34:09] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:34:09] 6031 (UNACKED) db2232/MariaDB read only m1 (paged) [16:34:10] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:34:10] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:34:12] !ack 6031 [16:34:12] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:34:15] akosiaris: page related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135385 ? [16:34:16] Another host [16:34:18] PROBLEM - MariaDB read only ms1 #page on db1152 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:34:18] PROBLEM - MariaDB read only s1 on db1184 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:34:19] Network issues? [16:34:28] marostegui: see that puppet change ^ [16:34:29] !incidents [16:34:29] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:34:29] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:34:29] 6032 (UNACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:34:29] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:34:30] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:34:36] !ack 6032 [16:34:36] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:34:44] They are up [16:34:47] PROBLEM - MariaDB read only s2 on db1233 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:34:47] PROBLEM - MariaDB read only s2 on db1229 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:34:47] PROBLEM - MariaDB read only s8 on db2152 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:17] PROBLEM - MariaDB read only s5 on db2192 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:18] PROBLEM - MariaDB read only s6 #page on db2229 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:18] PROBLEM - MariaDB Event Scheduler pc7 on pc2017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [16:35:21] PROBLEM - MariaDB Event Scheduler pc2 on pc2012 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [16:35:35] !incidents [16:35:36] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:35:36] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:35:36] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:35:36] 6033 (UNACKED) db2229 (paged)/MariaDB read only s6 (paged) [16:35:36] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:35:37] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:35:40] !ack 6033 [16:35:41] 6033 (ACKED) db2229 (paged)/MariaDB read only s6 (paged) [16:35:47] PROBLEM - MariaDB read only s4 on db1252 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:48] PROBLEM - MariaDB read only s3 #page on db1223 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:49] PROBLEM - MariaDB read only pc2 on pc2012 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:49] PROBLEM - MariaDB read only s8 on db2163 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:35:58] !incidents [16:35:58] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:35:59] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:35:59] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:35:59] 6033 (ACKED) db2229 (paged)/MariaDB read only s6 (paged) [16:35:59] 6034 (UNACKED) db1223 (paged)/MariaDB read only s3 (paged) [16:35:59] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:36:00] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:36:01] PROBLEM - MariaDB read only x1 on db2196 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:02] PROBLEM - MariaDB read only m5 #page on db2235 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:02] PROBLEM - MariaDB read only s1 on db2212 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:02] PROBLEM - MariaDB read only pc7 on pc2017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:09] so what is happening really [16:36:17] PROBLEM - MariaDB read only s4 on db2219 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:17] PROBLEM - MariaDB Event Scheduler pc6 on pc2016 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [16:36:31] some of these hosts look fine [16:36:32] this should be network I think [16:36:39] they all are random hosts [16:36:43] <_joe_> network of what? [16:36:43] PROBLEM - MariaDB read only s6 on db1180 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:45] PROBLEM - MariaDB read only s5 on db1161 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:55] PROBLEM - MariaDB read only es5 on es1045 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:36:59] PROBLEM - MariaDB read only es3 on es2027 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:00] all the hosts look up [16:37:01] PROBLEM - MariaDB read only pc6 on pc2016 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:37:17] PROBLEM - MariaDB read only s7 on db1170 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:17] PROBLEM - MariaDB read only s1 on db1206 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:18] !incidents [16:37:19] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:37:19] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:37:19] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:37:19] 6033 (ACKED) db2229 (paged)/MariaDB read only s6 (paged) [16:37:20] 6034 (UNACKED) db1223 (paged)/MariaDB read only s3 (paged) [16:37:20] 6035 (UNACKED) db2235/MariaDB read only m5 (paged) [16:37:20] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:37:20] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:37:27] PROBLEM - MariaDB read only s8 on db1209 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:27] !ack 6034 6035 [16:37:27] Could not ack the alert. Please check the parameters. [16:37:31] !ack 6034 [16:37:31] 6034 (ACKED) db1223 (paged)/MariaDB read only s3 (paged) [16:37:32] !ack 6035 [16:37:32] 6035 (ACKED) db2235/MariaDB read only m5 (paged) [16:37:40] (03PS1) 10Ssingh: Revert "Remove x2, add ms{1,2,3} to profile::mariadb::section_ports:" [puppet] - 10https://gerrit.wikimedia.org/r/1135471 [16:37:43] PROBLEM - MariaDB read only s3 on db1157 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:43] PROBLEM - MariaDB read only s6 on db1201 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:48] PROBLEM - MariaDB read only s4 #page on db1244 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:48] PROBLEM - MariaDB read only s8 on db1226 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:55] (03CR) 10Herron: [C:03+1] Revert "Remove x2, add ms{1,2,3} to profile::mariadb::section_ports:" [puppet] - 10https://gerrit.wikimedia.org/r/1135471 (owner: 10Ssingh) [16:37:55] misc host is not even related [16:37:59] PROBLEM - MariaDB read only s8 on db2243 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:37:59] (03CR) 10Marostegui: [C:03+1] Revert "Remove x2, add ms{1,2,3} to profile::mariadb::section_ports:" [puppet] - 10https://gerrit.wikimedia.org/r/1135471 (owner: 10Ssingh) [16:38:01] PROBLEM - MariaDB read only s5 on db2228 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:01] PROBLEM - MariaDB read only s6 on db2193 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:04] (03CR) 10Ssingh: [C:03+2] Revert "Remove x2, add ms{1,2,3} to profile::mariadb::section_ports:" [puppet] - 10https://gerrit.wikimedia.org/r/1135471 (owner: 10Ssingh) [16:38:17] PROBLEM - MariaDB read only backup1-eqiad on db1205 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:17] PROBLEM - MariaDB read only s7 on db1253 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:30] !log merging above change: CR 1135471 [16:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:34] !incidents [16:38:34] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:38:34] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:38:34] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:38:35] 6033 (ACKED) db2229 (paged)/MariaDB read only s6 (paged) [16:38:35] 6034 (ACKED) db1223 (paged)/MariaDB read only s3 (paged) [16:38:35] 6035 (ACKED) db2235/MariaDB read only m5 (paged) [16:38:35] 6036 (UNACKED) db1244 (paged)/MariaDB read only s4 (paged) [16:38:36] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:38:36] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:38:42] !ack 6036 [16:38:43] 6036 (ACKED) db1244 (paged)/MariaDB read only s4 (paged) [16:38:44] PROBLEM - MariaDB read only ms2 #page on db1151 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:48] PROBLEM - MariaDB read only pc1 #page on pc1011 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:48] PROBLEM - MariaDB read only es1 on es1029 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:48] PROBLEM - MariaDB read only es1 on es1027 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:51] PROBLEM - MariaDB read only s4 on db2147 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:51] PROBLEM - MariaDB read only es2 on es2031 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:38:55] PROBLEM - MariaDB Event Scheduler pc1 on pc1011 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [16:38:57] !ack 6037 [16:38:57] 6037 (ACKED) db1151 (paged)/MariaDB read only ms2 (paged) [16:38:59] !ack 6038 [16:39:00] 6038 (ACKED) pc1011 (paged)/MariaDB read only pc1 (paged) [16:39:01] PROBLEM - MariaDB read only x1 on db2231 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:17] PROBLEM - MariaDB read only s3 on db1175 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:17] PROBLEM - MariaDB read only s4 on db1247 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:17] PROBLEM - MariaDB read only es1 on es2030 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:47] PROBLEM - MariaDB read only s2 on db1197 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:47] PROBLEM - MariaDB read only s7 on db1181 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:47] PROBLEM - MariaDB read only es6 on es1036 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:47] PROBLEM - MariaDB read only es4 on es1042 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:39:47] PROBLEM - MariaDB read only s5 on db1210 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:40:17] PROBLEM - MariaDB read only s1 on db1169 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:41:00] PROBLEM - MariaDB read only s2 #page on db2207 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:41:17] PROBLEM - MariaDB read only s4 on db1190 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:41:17] PROBLEM - MariaDB read only s3 on db2194 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:41:49] PROBLEM - MariaDB read only s6 on db2169 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:42:01] (03PS1) 10Btullis: Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 [16:42:09] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 (owner: 10Btullis) [16:42:45] RECOVERY - MariaDB read only s6 on db1180 is OK: Version 10.6.20-MariaDB-log, Uptime 3655943s, read_only: True, event_scheduler: True, 5125.54 QPS, connection latency: 0.032303s, query latency: 0.000575s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:43:18] (03PS2) 10Btullis: Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 [16:43:28] (03CR) 10Btullis: [V:03+2] Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 (owner: 10Btullis) [16:44:19] RECOVERY - MariaDB read only s1 on db1169 is OK: Version 10.6.21-MariaDB-log, Uptime 3738154s, read_only: True, event_scheduler: True, 9177.71 QPS, connection latency: 0.022872s, query latency: 0.000499s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:44:47] !log forcing puppet run on db2229 [16:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] RECOVERY - MariaDB read only s6 #page on db2229 is OK: Version 10.6.20-MariaDB-log, Uptime 3664405s, read_only: True, event_scheduler: True, 128.09 QPS, connection latency: 0.028276s, query latency: 0.000624s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:45:21] sigh [16:45:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10726873 (10phaultfinder) [16:46:26] !log sudo cumin -b11 "O:mariadb::core" "run-puppet-agent" [16:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:52] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [16:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [16:47:49] RECOVERY - MariaDB read only s8 on db2152 is OK: Version 10.6.21-MariaDB-log, Uptime 2975534s, read_only: True, event_scheduler: True, 635.26 QPS, connection latency: 0.026145s, query latency: 0.000549s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:47:51] RECOVERY - MariaDB read only s4 on db2147 is OK: Version 10.6.21-MariaDB-log, Uptime 3148231s, read_only: True, event_scheduler: True, 1419.79 QPS, connection latency: 0.024724s, query latency: 0.000507s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:48:02] RECOVERY - MariaDB read only s2 #page on db2207 is OK: Version 10.6.20-MariaDB-log, Uptime 6775105s, read_only: True, event_scheduler: True, 177.52 QPS, connection latency: 0.030829s, query latency: 0.000863s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:48:31] nice one sukhe [16:48:31] RECOVERY - OpenSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 53, active_shards: 91, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 14, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [16:48:31] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:48:31] RECOVERY - OpenSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 53, active_shards: 91, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 14, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [16:48:31] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:48:41] !incidents [16:48:41] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:48:42] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:48:42] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:48:42] 6034 (ACKED) db1223 (paged)/MariaDB read only s3 (paged) [16:48:42] 6035 (ACKED) db2235/MariaDB read only m5 (paged) [16:48:42] 6036 (ACKED) db1244 (paged)/MariaDB read only s4 (paged) [16:48:43] 6037 (ACKED) db1151 (paged)/MariaDB read only ms2 (paged) [16:48:43] 6038 (ACKED) pc1011 (paged)/MariaDB read only pc1 (paged) [16:48:43] 6039 (RESOLVED) db2207 (paged)/MariaDB read only s2 (paged) [16:48:44] 6033 (RESOLVED) db2229 (paged)/MariaDB read only s6 (paged) [16:48:44] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:48:45] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:48:51] RECOVERY - MariaDB read only s8 on db2163 is OK: Version 10.6.21-MariaDB-log, Uptime 3307198s, read_only: True, event_scheduler: True, 2056.06 QPS, connection latency: 0.024623s, query latency: 0.000451s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:48:51] RECOVERY - MariaDB read only s6 on db2169 is OK: Version 10.6.20-MariaDB-log, Uptime 1414544s, read_only: True, event_scheduler: True, 1322.16 QPS, connection latency: 0.024697s, query latency: 0.000578s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:49:15] RECOVERY - OpenSearch health check for shards on 9200 on relforge1008 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 53, active_shards: 106, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [16:49:15] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.06542056074767 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:50:03] James_F: Everything ok? [16:50:09] Err. [16:50:11] Yes? [16:50:19] RECOVERY - MariaDB read only s5 on db2192 is OK: Version 10.6.21-MariaDB-log, Uptime 1318784s, read_only: True, event_scheduler: True, 37.36 QPS, connection latency: 0.030633s, query latency: 0.001189s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:50:42] scap finished but also there was a connection error for the log step. [16:50:46] But yeah, all seems good. [16:50:57] Do you have a transcript? I'd like to see that. [16:51:01] RECOVERY - MariaDB read only x1 on db2196 is OK: Version 10.6.21-MariaDB-log, Uptime 1393283s, read_only: True, event_scheduler: True, 1026.54 QPS, connection latency: 0.030658s, query latency: 0.000984s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:51:03] RECOVERY - MariaDB read only s6 on db2193 is OK: Version 10.6.20-MariaDB-log, Uptime 3580736s, read_only: True, event_scheduler: True, 637.22 QPS, connection latency: 0.033425s, query latency: 0.000912s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:51:19] RECOVERY - MariaDB read only s3 on db2194 is OK: Version 10.6.20-MariaDB-log, Uptime 6223789s, read_only: True, event_scheduler: True, 2635.15 QPS, connection latency: 0.030859s, query latency: 0.000901s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:51:41] dancy: https://phabricator.wikimedia.org/P74819 [16:52:01] RECOVERY - MariaDB read only s7 on db2222 is OK: Version 10.6.20-MariaDB-log, Uptime 5545888s, read_only: True, event_scheduler: True, 4296.62 QPS, connection latency: 0.036407s, query latency: 0.000733s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:52:03] RECOVERY - MariaDB read only s1 on db2212 is OK: Version 10.6.21-MariaDB-log, Uptime 1395348s, read_only: True, event_scheduler: True, 6542.76 QPS, connection latency: 0.025495s, query latency: 0.000997s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:52:19] RECOVERY - MariaDB read only s4 on db2219 is OK: Version 10.6.21-MariaDB-log, Uptime 3463561s, read_only: True, event_scheduler: True, 2557.32 QPS, connection latency: 0.034564s, query latency: 0.000956s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:53:03] RECOVERY - MariaDB read only x1 on db2231 is OK: Version 10.6.20-MariaDB-log, Uptime 7804689s, read_only: True, event_scheduler: True, 1182.87 QPS, connection latency: 0.031033s, query latency: 0.000763s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:53:34] !incidents [16:53:35] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [16:53:35] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [16:53:35] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [16:53:35] 6034 (ACKED) db1223 (paged)/MariaDB read only s3 (paged) [16:53:35] 6035 (ACKED) db2235/MariaDB read only m5 (paged) [16:53:36] 6036 (ACKED) db1244 (paged)/MariaDB read only s4 (paged) [16:53:36] 6037 (ACKED) db1151 (paged)/MariaDB read only ms2 (paged) [16:53:36] 6038 (ACKED) pc1011 (paged)/MariaDB read only pc1 (paged) [16:53:37] 6039 (RESOLVED) db2207 (paged)/MariaDB read only s2 (paged) [16:53:37] 6033 (RESOLVED) db2229 (paged)/MariaDB read only s6 (paged) [16:53:38] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:53:38] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [16:53:45] RECOVERY - MariaDB read only s3 on db1157 is OK: Version 10.6.20-MariaDB-log, Uptime 6232785s, read_only: True, event_scheduler: True, 8947.24 QPS, connection latency: 0.026545s, query latency: 0.000432s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:53:47] thanks herron. the other resolves should come in shortly [16:53:56] 40% done (76/193) [16:54:01] RECOVERY - MariaDB read only s8 on db2243 is OK: Version 10.6.21-MariaDB-log, Uptime 722082s, read_only: True, event_scheduler: True, 3970.16 QPS, connection latency: 0.028221s, query latency: 0.001047s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:54:12] sukhe: ahh ok [16:54:19] RECOVERY - MariaDB read only s7 on db1170 is OK: Version 10.6.20-MariaDB-log, Uptime 5277100s, read_only: True, event_scheduler: True, 12904.91 QPS, connection latency: 0.032859s, query latency: 0.000502s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:55:19] RECOVERY - MariaDB read only s3 on db1175 is OK: Version 10.6.20-MariaDB-log, Uptime 6172206s, read_only: True, event_scheduler: True, 8800.90 QPS, connection latency: 0.026943s, query latency: 0.000448s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:55:19] RECOVERY - MariaDB read only s1 on db1184 is OK: Version 10.6.21-MariaDB-log, Uptime 3147207s, read_only: True, event_scheduler: True, 13112.35 QPS, connection latency: 0.032849s, query latency: 0.000540s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:55:47] RECOVERY - MariaDB read only s7 on db1181 is OK: Version 10.6.20-MariaDB-log, Uptime 3726220s, read_only: True, event_scheduler: True, 15870.25 QPS, connection latency: 0.027167s, query latency: 0.000633s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:55:54] James_F, dancy going afk, thanks a lot for the long deployment! [16:56:02] +1, thanks all. [16:56:06] elukey: Thanks for the support! [16:56:19] RECOVERY - MariaDB read only s4 on db1190 is OK: Version 10.6.21-MariaDB-log, Uptime 3229248s, read_only: True, event_scheduler: True, 7015.20 QPS, connection latency: 0.024872s, query latency: 0.000488s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:56:47] RECOVERY - MariaDB read only s2 on db1197 is OK: Version 10.6.21-MariaDB-log, Uptime 2953785s, read_only: True, event_scheduler: True, 10624.34 QPS, connection latency: 0.025514s, query latency: 0.000556s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:57:19] RECOVERY - MariaDB read only s1 on db1206 is OK: Version 10.6.20-MariaDB-log, Uptime 3647337s, read_only: True, event_scheduler: True, 1065.71 QPS, connection latency: 0.034419s, query latency: 0.000896s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:57:29] RECOVERY - MariaDB read only s8 on db1209 is OK: Version 10.6.21-MariaDB-log, Uptime 2975542s, read_only: True, event_scheduler: True, 2497.37 QPS, connection latency: 0.026332s, query latency: 0.000990s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:57:45] RECOVERY - MariaDB read only s6 on db1201 is OK: Version 10.6.20-MariaDB-log, Uptime 3555429s, read_only: True, event_scheduler: True, 5350.28 QPS, connection latency: 0.026329s, query latency: 0.000497s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:57:49] RECOVERY - MariaDB read only s5 on db1210 is OK: Version 10.6.20-MariaDB-log, Uptime 6077206s, read_only: True, event_scheduler: True, 4263.19 QPS, connection latency: 0.033450s, query latency: 0.001106s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:57:57] RECOVERY - MariaDB read only s5 on db1200 is OK: Version 10.6.20-MariaDB-log, Uptime 6065672s, read_only: True, event_scheduler: True, 2739.60 QPS, connection latency: 0.023790s, query latency: 0.000606s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:58:49] RECOVERY - MariaDB read only s2 on db1229 is OK: Version 10.6.20-MariaDB-log, Uptime 5551021s, read_only: True, event_scheduler: True, 10511.30 QPS, connection latency: 0.034337s, query latency: 0.000905s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:58:50] RECOVERY - MariaDB read only s3 #page on db1223 is OK: Version 10.6.20-MariaDB-log, Uptime 6249849s, read_only: False, event_scheduler: True, 1595.86 QPS, connection latency: 0.030974s, query latency: 0.000916s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:58:50] RECOVERY - MariaDB read only s8 on db1226 is OK: Version 10.6.21-MariaDB-log, Uptime 3061761s, read_only: True, event_scheduler: True, 9106.40 QPS, connection latency: 0.042008s, query latency: 0.000919s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:59:50] RECOVERY - MariaDB read only s4 #page on db1244 is OK: Version 10.6.20-MariaDB-log, Uptime 3571017s, read_only: False, event_scheduler: True, 2065.08 QPS, connection latency: 0.034087s, query latency: 0.000898s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:59:50] RECOVERY - MariaDB read only s2 on db1233 is OK: Version 10.6.20-MariaDB-log, Uptime 2963799s, read_only: True, event_scheduler: True, 10341.41 QPS, connection latency: 0.035097s, query latency: 0.001073s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:59:53] !log [END] sudo cumin -b11 "O:mariadb::core" "run-puppet-agent" [16:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:59] !incidents [17:00:00] 6030 (ACKED) pc1014 (paged)/MariaDB read only pc4 (paged) [17:00:00] 6031 (ACKED) db2232/MariaDB read only m1 (paged) [17:00:00] 6032 (ACKED) db1152 (paged)/MariaDB read only ms1 (paged) [17:00:01] 6035 (ACKED) db2235/MariaDB read only m5 (paged) [17:00:01] 6037 (ACKED) db1151 (paged)/MariaDB read only ms2 (paged) [17:00:01] 6038 (ACKED) pc1011 (paged)/MariaDB read only pc1 (paged) [17:00:01] 6036 (RESOLVED) db1244 (paged)/MariaDB read only s4 (paged) [17:00:01] 6034 (RESOLVED) db1223 (paged)/MariaDB read only s3 (paged) [17:00:02] 6039 (RESOLVED) db2207 (paged)/MariaDB read only s2 (paged) [17:00:02] 6033 (RESOLVED) db2229 (paged)/MariaDB read only s6 (paged) [17:00:03] 6029 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [17:00:03] 6028 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [17:00:05] swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1700). [17:00:19] RECOVERY - MariaDB Event Scheduler pc4 on pc1014 is OK: Version 10.6.20-MariaDB-log, Uptime 7870662s, read_only: False, event_scheduler: True, 2061.24 QPS, connection latency: 0.031005s, query latency: 0.000638s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [17:00:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10726989 (10phaultfinder) [17:00:48] RECOVERY - MariaDB read only pc4 #page on pc1014 is OK: Version 10.6.20-MariaDB-log, Uptime 7870691s, read_only: False, event_scheduler: True, 1952.37 QPS, connection latency: 0.024553s, query latency: 0.000586s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:01:02] RECOVERY - MariaDB read only m1 #page on db2232 is OK: Version 10.6.20-MariaDB-log, Uptime 7448058s, read_only: True, event_scheduler: True, 314.27 QPS, connection latency: 0.033744s, query latency: 0.001121s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:01:03] RECOVERY - MariaDB read only es7 on es2039 is OK: Version 10.6.20-MariaDB-log, Uptime 3717500s, read_only: True, event_scheduler: True, 174.38 QPS, connection latency: 0.033422s, query latency: 0.000988s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:01:19] o/ I'm around but plan to take no action until the second part of the infra window [17:01:20] RECOVERY - MariaDB read only ms1 #page on db1152 is OK: Version 10.11.11-MariaDB-log, Uptime 39883s, read_only: False, event_scheduler: True, 4679.24 QPS, connection latency: 0.032308s, query latency: 0.000545s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:01:35] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2085.codfw.wmnet on all recursors [17:01:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2085.codfw.wmnet on all recursors [17:01:43] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2109.codfw.wmnet on all recursors [17:01:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2109.codfw.wmnet on all recursors [17:01:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [17:01:54] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [17:02:21] RECOVERY - MariaDB Event Scheduler pc2 on pc2012 is OK: Version 10.6.20-MariaDB-log, Uptime 6596937s, read_only: False, event_scheduler: True, 1195.92 QPS, connection latency: 0.024550s, query latency: 0.000551s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [17:02:51] RECOVERY - MariaDB read only pc2 on pc2012 is OK: Version 10.6.20-MariaDB-log, Uptime 6596967s, read_only: False, event_scheduler: True, 1531.30 QPS, connection latency: 0.025654s, query latency: 0.000763s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:03:02] RECOVERY - MariaDB read only m5 #page on db2235 is OK: Version 10.6.20-MariaDB-log, Uptime 7366957s, read_only: True, event_scheduler: True, 15.92 QPS, connection latency: 0.034120s, query latency: 0.000984s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:03:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135454 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup) [17:03:19] RECOVERY - MariaDB Event Scheduler pc7 on pc2017 is OK: Version 10.6.20-MariaDB-log, Uptime 4066606s, read_only: False, event_scheduler: True, 1365.58 QPS, connection latency: 0.031936s, query latency: 0.000811s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [17:03:49] RECOVERY - MariaDB read only s4 on db1252 is OK: Version 10.6.21-MariaDB-log, Uptime 3211771s, read_only: True, event_scheduler: True, 349.08 QPS, connection latency: 0.034394s, query latency: 0.001214s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:04:03] RECOVERY - MariaDB read only pc7 on pc2017 is OK: Version 10.6.20-MariaDB-log, Uptime 4066650s, read_only: False, event_scheduler: True, 1277.55 QPS, connection latency: 0.030408s, query latency: 0.001167s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:04:09] (03Merged) 10jenkins-bot: Increase max db connection count before circuit breaking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135454 (https://phabricator.wikimedia.org/T390510) (owner: 10Ladsgroup) [17:04:33] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [17:04:36] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135454|Increase max db connection count before circuit breaking (T390510)]] [17:04:37] !log forcing rechecks for pc1011 and db1151 [17:04:39] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [17:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:47] RECOVERY - MariaDB read only s5 on db1161 is OK: Version 10.6.20-MariaDB-log, Uptime 5994964s, read_only: True, event_scheduler: True, 2512.47 QPS, connection latency: 0.024819s, query latency: 0.000535s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:04:48] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [17:04:57] RECOVERY - MariaDB read only es5 on es1045 is OK: Version 10.6.20-MariaDB-log, Uptime 7202061s, read_only: True, event_scheduler: True, 531.91 QPS, connection latency: 0.032824s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:05:01] RECOVERY - MariaDB read only es3 on es2027 is OK: Version 10.6.20-MariaDB-log, Uptime 5637542s, read_only: True, event_scheduler: True, 101.97 QPS, connection latency: 0.025779s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:05:03] RECOVERY - MariaDB read only pc6 on pc2016 is OK: Version 10.6.20-MariaDB-log, Uptime 10208305s, read_only: False, event_scheduler: True, 1136.86 QPS, connection latency: 0.032917s, query latency: 0.001163s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:05:17] RECOVERY - MariaDB read only backup1-eqiad on db1205 is OK: Version 10.6.20-MariaDB-log, Uptime 6667466s, read_only: True, event_scheduler: True, 13.39 QPS, connection latency: 0.024426s, query latency: 0.000518s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:05:19] RECOVERY - MariaDB Event Scheduler pc6 on pc2016 is OK: Version 10.6.20-MariaDB-log, Uptime 10208321s, read_only: False, event_scheduler: True, 1287.49 QPS, connection latency: 0.041013s, query latency: 0.000988s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [17:06:03] RECOVERY - MariaDB read only s5 on db2228 is OK: Version 10.6.20-MariaDB-log, Uptime 7878124s, read_only: True, event_scheduler: True, 648.54 QPS, connection latency: 0.031917s, query latency: 0.001016s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:19] RECOVERY - MariaDB read only s7 on db1253 is OK: Version 10.6.21-MariaDB-log, Uptime 2934256s, read_only: True, event_scheduler: True, 10592.75 QPS, connection latency: 0.035753s, query latency: 0.000780s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:19] RECOVERY - MariaDB read only es1 on es2030 is OK: Version 10.6.20-MariaDB-log, Uptime 6162435s, read_only: True, event_scheduler: True, 61.24 QPS, connection latency: 0.028242s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:38] FIRING: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:06:45] RECOVERY - MariaDB Event Scheduler pc1 on pc1011 is OK: Version 10.11.11-MariaDB-log, Uptime 2622085s, read_only: False, event_scheduler: True, 1783.91 QPS, connection latency: 0.024874s, query latency: 0.000532s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [17:06:46] RECOVERY - MariaDB read only pc1 #page on pc1011 is OK: Version 10.11.11-MariaDB-log, Uptime 2622085s, read_only: False, event_scheduler: True, 1790.90 QPS, connection latency: 0.029441s, query latency: 0.000443s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:50] RECOVERY - MariaDB read only ms2 #page on db1151 is OK: Version 10.11.11-MariaDB-log, Uptime 123622s, read_only: False, event_scheduler: True, 4249.22 QPS, connection latency: 0.028017s, query latency: 0.000647s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:50] RECOVERY - MariaDB read only es1 on es1029 is OK: Version 10.6.16-MariaDB-log, Uptime 13413556s, read_only: True, event_scheduler: True, 159.68 QPS, connection latency: 0.025087s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:50] RECOVERY - MariaDB read only es1 on es1027 is OK: Version 10.6.20-MariaDB-log, Uptime 5306747s, read_only: True, event_scheduler: True, 126.37 QPS, connection latency: 0.032493s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:51] RECOVERY - MariaDB read only es6 on es1036 is OK: Version 10.6.20-MariaDB-log, Uptime 3723800s, read_only: True, event_scheduler: True, 614.35 QPS, connection latency: 0.024868s, query latency: 0.000935s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:51] RECOVERY - MariaDB read only es2 on es2031 is OK: Version 10.6.16-MariaDB-log, Uptime 11422941s, read_only: True, event_scheduler: True, 97.32 QPS, connection latency: 0.025255s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:57] cool, that's the list [17:07:11] cheers thanks sukhe [17:07:13] FIRING: [6x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2085:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:16] herron: <3 [17:07:18] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Unable to restore File:Blason_famille_fr_de-Lichy_(2).svg - https://phabricator.wikimedia.org/T387340#10727026 (10Ebrahim) Is it possible someone to either https://commons.wikimedia.org/wiki/File:Ettelaat13130918.pdf to the already deleted https://... [17:07:19] RECOVERY - MariaDB read only s4 on db1247 is OK: Version 10.6.21-MariaDB-log, Uptime 3200793s, read_only: True, event_scheduler: True, 4856.71 QPS, connection latency: 0.038296s, query latency: 0.000937s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:07:49] RECOVERY - MariaDB read only es4 on es1042 is OK: Version 10.6.20-MariaDB-log, Uptime 7815473s, read_only: True, event_scheduler: True, 553.48 QPS, connection latency: 0.032630s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:08:16] (03CR) 10Andrew Bogott: [C:03+2] magnum.conf: remove a bunch of marked-out config options [puppet] - 10https://gerrit.wikimedia.org/r/1135452 (owner: 10Andrew Bogott) [17:11:26] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1135454|Increase max db connection count before circuit breaking (T390510)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:11:29] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [17:11:38] RESOLVED: CirrusSearchJVMGCYoungPoolInsufficient: Elasticsearch instance cirrussearch2090-production-search-omega-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [17:12:27] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:12:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2090-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:17:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch2090 is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: yellow, timed_out: False, number_of_nodes: 61, number_of_data_nodes: 61, discovered_master: True, active_primary_shards: 1352, active_shards: 4164, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pendin [17:17:19] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.61722488038276 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:20] (03PS1) 10Bking: cirrussearch: remove all references to non-row-A hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135474 (https://phabricator.wikimedia.org/T388610) [17:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [17:17:49] Amir1: Sorry, didn't ping you when the broken scap finally landed – in case you still need to do something. [17:18:26] already doing it! [17:18:28] no worries [17:18:33] <3 [17:19:40] !log apt1002 - updating thirdparty/gitlab-bullseye gitlab-ce package version [17:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:24] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135454|Increase max db connection count before circuit breaking (T390510)]] (duration: 16m 47s) [17:21:27] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [17:21:54] (03CR) 10Bking: [C:04-1] "We need a newer version of this file, don't merge yet!" [puppet] - 10https://gerrit.wikimedia.org/r/1135474 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:22:03] RECOVERY - OpenSearch health check for shards on 9400 on cirrussearch2090 is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: green, timed_out: False, number_of_nodes: 30, number_of_data_nodes: 30, discovered_master: True, active_primary_shards: 1704, active_shards: 5111, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number [17:22:03] ing_tasks: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:25:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:26:47] (03PS2) 10Bking: cirrussearch: remove all references to non-row-A hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135474 (https://phabricator.wikimedia.org/T388610) [17:27:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:27:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [17:27:19] (03CR) 10Ssingh: "Thanks for the patch! Looks good, but discussing a bit more here to follow up on our conversation from yesterday and for posterity for fut" [dns] - 10https://gerrit.wikimedia.org/r/1135469 (owner: 10CDobbins) [17:29:08] (03CR) 10Bking: [C:03+2] "fixed regex, self-merging to unblock migration" [puppet] - 10https://gerrit.wikimedia.org/r/1135474 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:30:27] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2090:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:33:33] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [17:33:36] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [17:33:46] Amir1: are you finished with your backports for now? [17:33:58] swfrench-wmf: si [17:34:02] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [17:34:14] Amir1: great, thank you [17:36:00] (03PS1) 10Bking: Revert "cirrussearch: remove all references to non-row-A hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1135477 [17:36:14] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrussearch: remove all references to non-row-A hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1135477 (owner: 10Bking) [17:36:14] (03CR) 10Scott French: [C:03+2] scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [17:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [17:38:02] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [17:38:40] FYI, I'll be running puppet-agent and then scap on deploy1003 shortly [17:39:39] (03PS1) 10Eevans: corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1135478 (https://phabricator.wikimedia.org/T389664) [17:41:16] (03CR) 10Eevans: [C:03+2] corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1135478 (https://phabricator.wikimedia.org/T389664) (owner: 10Eevans) [17:45:41] !log swfrench@deploy1003 Started scap sync-world: Test stop-before-sync scap run after switching to PHP 8.1 container image for maintenance scripts - T390225 [17:45:44] T390225: Migrate scap's maintenance script invocations to PHP 8.1 - https://phabricator.wikimedia.org/T390225 [17:45:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [17:46:16] !log swfrench@deploy1003 Stopping before sync operations [17:47:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:47:11] !log swfrench@deploy1003 Started scap sync-world: Test scap run after switching to PHP 8.1 container image for maintenance scripts - T390225 [17:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [17:48:13] swfrench-wmf: Thank you again for all your work on 8.1. As you can imagine, I'm very much looking forward to the day we (I) can drop PHP 7.4 from CI and so save so much CPU time. :-) [17:50:04] !log swfrench@deploy1003 Finished scap sync-world: Test scap run after switching to PHP 8.1 container image for maintenance scripts - T390225 (duration: 03m 10s) [17:51:07] James_F: thank you for all your help as well! I am very much looking forward to when that happens :) [17:51:16] Sadly, not today. [17:51:18] But soon. [17:51:32] And then I can burn buster out of existence from CI, finally. [17:51:43] boulder continues to roll :) [17:52:26] Next up, moving CI (and prod) image from bullseye to bookworm. [17:52:38] There's always something. :-) [17:53:08] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10727268 (10Eevans) 05Open→03Resolved Done [17:53:34] the contint servers themselves could use some bookworm [18:00:05] brennen and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T1800). [18:00:13] o/ [18:02:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2068-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:02:41] are we, uh, currently in a train-deployable state? [18:04:19] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 9 hosts with reason: adding net-new role [18:04:56] brennen: Should be. [18:05:10] brennen: T390251 may return, but yeah you should be good to proceed. [18:05:10] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [18:05:26] ack, going ahead. [18:06:10] !log 1.44.0-wmf.24 train status (T386219): logs quiet, no current blockers, moving to group1 [18:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:13] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [18:06:30] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135480 (https://phabricator.wikimedia.org/T386219) [18:06:31] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135480 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [18:07:39] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135480 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [18:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [18:20:18] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.24 refs T386219 [18:20:22] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [18:20:43] FIRING: [5x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:39] (03PS1) 10Awight: Temporarily revoke ssh key for travel [puppet] - 10https://gerrit.wikimedia.org/r/1135481 [18:25:39] (03CR) 10Awight: [C:03+1] "Ideally this can be merged before April 12, sorry for the short notice! I'll try to be available over IRC and email if there are any ques" [puppet] - 10https://gerrit.wikimedia.org/r/1135481 (owner: 10Awight) [18:32:04] (03Abandoned) 10Jforrester: Revert "Add wikifunctionsclient dblist for production wikis that allow embedding Wikifunctions calls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135458 (owner: 10Jforrester) [18:32:08] (03Abandoned) 10Jforrester: Revert "Switch out various old PHP aliases to the current class names" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135459 (owner: 10Jforrester) [18:40:48] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [18:45:20] (03PS4) 10Scott French: [WIP] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [18:45:30] (03CR) 10Dzahn: [C:03+2] community_civicrm: add stub for dovecot_passwd [labs/private] - 10https://gerrit.wikimedia.org/r/1124204 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [18:45:31] (03CR) 10Dzahn: [V:03+2 C:03+2] community_civicrm: add stub for dovecot_passwd [labs/private] - 10https://gerrit.wikimedia.org/r/1124204 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [18:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [18:48:05] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [18:48:39] (03CR) 10CDobbins: "Oops, sorry about that! I just re-ran the notebook and the data should be fixed" [dns] - 10https://gerrit.wikimedia.org/r/1135469 (owner: 10CDobbins) [18:52:15] Zero (non-junk) errors in logspam-watch at the moment (15 minute interval). So blank. So beautiful. [18:57:37] (03CR) 10Genoveva Galarza: "Ohhh was this the issue?" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135459 (owner: 10Jforrester) [19:07:43] (03PS1) 10Andrew Bogott: facter: Disable cloud.provider and ec2 facts [puppet] - 10https://gerrit.wikimedia.org/r/1135482 [19:09:15] dancy: +1 [19:10:32] (03CR) 10Scott French: [C:03+1] mwcron: Allow setting ttlsecondsafterfinished [puppet] - 10https://gerrit.wikimedia.org/r/1135040 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [19:13:25] (03PS2) 10Andrew Bogott: facter: Disable cloud.provider and ec2 facts [puppet] - 10https://gerrit.wikimedia.org/r/1135482 [19:14:23] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [19:15:43] FIRING: [5x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:15] RECOVERY - OpenSearch health check for shards on 9200 on relforge1009 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 107, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flig [19:16:15] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:17:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:17:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [19:24:15] !log fab@deploy1003 Started deploy [airflow-dags/research@ea5f3de]: (no justification provided) [19:24:55] !log fab@deploy1003 Finished deploy [airflow-dags/research@ea5f3de]: (no justification provided) (duration: 00m 41s) [19:26:45] (03CR) 10JHathaway: [C:03+1] "look good, one request" [puppet] - 10https://gerrit.wikimedia.org/r/1135482 (owner: 10Andrew Bogott) [19:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10727523 (10phaultfinder) [19:29:39] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391520 (10phaultfinder) 03NEW [19:30:24] (03PS3) 10Andrew Bogott: facter: Disable cloud.provider and ec2 facts [puppet] - 10https://gerrit.wikimedia.org/r/1135482 [19:32:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:32:38] (03CR) 10JHathaway: [C:03+1] facter: Disable cloud.provider and ec2 facts [puppet] - 10https://gerrit.wikimedia.org/r/1135482 (owner: 10Andrew Bogott) [19:32:47] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:32:53] (03CR) 10Andrew Bogott: [C:03+2] facter: Disable cloud.provider and ec2 facts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135482 (owner: 10Andrew Bogott) [19:35:51] !log dancy@deploy1003 Installing scap version "4.153.0" for 2 host(s) [19:37:38] !log dancy@deploy1003 Installation of scap version "4.153.0" completed for 2 hosts [19:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:02] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:15] PROBLEM - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f93ed4ad1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [20:00:15] org/wiki/Search%23Administration [20:00:44] (03CR) 10Dwisehaupt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:00:50] (03CR) 10Dwisehaupt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:02:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319#10727593 (10Jclark-ctr) [20:02:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319#10727596 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr installed 2x ssd into each server [20:02:36] 06SRE-OnFire, 06Data-Persistence, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10727600 (10Eevans) [20:03:47] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:03:55] 06SRE-OnFire, 06Data-Persistence, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10727602 (10Eevans) > Although there are various ways we could make an alert on disk space fairly sophisticated (e.g., extrap... [20:04:02] FIRING: [2x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:13] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:10:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10727615 (10phaultfinder) [20:10:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10727614 (10phaultfinder) [20:12:24] (03PS6) 10Tiziano Fogli: perf/navtiming: Add CPU long task alert to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1135408 (https://phabricator.wikimedia.org/T325283) (owner: 10Phedenskog) [20:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10727618 (10phaultfinder) [20:17:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:19:15] RECOVERY - OpenSearch health check for shards on 9200 on relforge1009 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 107, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:19:15] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:20:19] (03PS1) 10Andrew Bogott: Revert "nova vendor data: don't upgrade packages during cloud-init." [puppet] - 10https://gerrit.wikimedia.org/r/1135486 [20:20:59] (03PS2) 10Andrew Bogott: Revert "nova vendor data: don't upgrade packages during cloud-init." [puppet] - 10https://gerrit.wikimedia.org/r/1135486 [20:22:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:26] (03CR) 10Jforrester: "No, just collateral." [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135459 (owner: 10Jforrester) [20:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10727642 (10phaultfinder) [20:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:29:14] (03CR) 10Andrew Bogott: [C:03+2] Revert "nova vendor data: don't upgrade packages during cloud-init." [puppet] - 10https://gerrit.wikimedia.org/r/1135486 (owner: 10Andrew Bogott) [20:34:06] 07Puppet: facter cli throws an exception on hosts with lvm - https://phabricator.wikimedia.org/T391526 (10jhathaway) 03NEW [20:35:00] 07Puppet: facter cli throws an exception on hosts with lvm - https://phabricator.wikimedia.org/T391526#10727701 (10jhathaway) p:05Triage→03Low a:03jhathaway [20:35:15] PROBLEM - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb9367dd1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [20:35:15] org/wiki/Search%23Administration [20:36:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3367 MB (3% inode=98%): /tmp 3367 MB (3% inode=98%): /var/tmp 3367 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [20:37:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [20:39:02] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:50] (03PS1) 10Cwhite: logstash: transform err field usage to hash [puppet] - 10https://gerrit.wikimedia.org/r/1135491 (https://phabricator.wikimedia.org/T228380) [20:41:26] (03PS2) 10Cwhite: logstash: transform err field usage to hash [puppet] - 10https://gerrit.wikimedia.org/r/1135491 [20:41:40] (03PS3) 10Cwhite: logstash: transform err field usage to hash [puppet] - 10https://gerrit.wikimedia.org/r/1135491 [20:49:15] RECOVERY - OpenSearch health check for shards on 9200 on relforge1009 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 107, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [20:49:15] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:52:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:44] (03CR) 10Cwhite: [C:03+2] logstash: transform err field usage to hash [puppet] - 10https://gerrit.wikimedia.org/r/1135491 (owner: 10Cwhite) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T2100) [21:00:14] (03PS1) 10QChris: Add .gitreview [debs/openssl-ech] - 10https://gerrit.wikimedia.org/r/1135497 [21:00:14] (03CR) 10QChris: [V:03+2 C:03+2] Add .gitreview [debs/openssl-ech] - 10https://gerrit.wikimedia.org/r/1135497 (owner: 10QChris) [21:00:55] 06SRE-OnFire, 06Data-Persistence, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10727773 (10Scott_French) >>! In T390630#10727602, @Eevans wrote: > No hurry on this part (the important bit is done), but I... [21:02:57] (03PS2) 10Jforrester: [test2wiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126660 (https://phabricator.wikimedia.org/T383106) [21:03:25] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_codfw_test_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:01] (03PS3) 10Jforrester: [test2wiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126660 (https://phabricator.wikimedia.org/T383106) [21:05:43] (03CR) 10JHathaway: [C:03+2] community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:05:49] (03CR) 10JHathaway: [C:03+2] community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:06:00] (03CR) 10JHathaway: [C:03+2] community_crm: Add trusted_host_patterns to settings template [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) (owner: 10Dwisehaupt) [21:06:05] (03CR) 10Jforrester: [C:03+2] [test2wiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126660 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester) [21:06:26] (03PS1) 10QChris: Add .gitreview [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135498 [21:06:26] (03CR) 10QChris: [V:03+2 C:03+2] Add .gitreview [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135498 (owner: 10QChris) [21:06:51] (03Merged) 10jenkins-bot: [test2wiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126660 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester) [21:07:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:25] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:53] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1126660|[test2wiki] Enable Wikifunctions client mode (T383106)]] [21:08:56] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [21:11:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:11:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-druid1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:13:47] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:14:02] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:29] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1126660|[test2wiki] Enable Wikifunctions client mode (T383106)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:32] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [21:15:45] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:16:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3600 MB (3% inode=98%): /tmp 3600 MB (3% inode=98%): /var/tmp 3600 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [21:18:25] RESOLVED: [3x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:15] (03PS1) 10Jforrester: MWMultiVersion: Recognise the new wikifunctionsclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135500 [21:19:33] (03PS1) 10Andrew Bogott: wmcs-image-create: fix logic error when detecting the VM is ready [puppet] - 10https://gerrit.wikimedia.org/r/1135501 [21:19:35] !log jforrester@deploy1003 Sync cancelled. [21:19:50] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-druid - jclark@cumin1002" [21:19:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-druid - jclark@cumin1002" [21:19:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:04] (03CR) 10Andrew Bogott: [C:03+2] wmcs-image-create: fix logic error when detecting the VM is ready [puppet] - 10https://gerrit.wikimedia.org/r/1135501 (owner: 10Andrew Bogott) [21:20:04] (03CR) 10CI reject: [V:04-1] MWMultiVersion: Recognise the new wikifunctionsclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135500 (owner: 10Jforrester) [21:20:13] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1007 [21:20:19] (03PS2) 10Jforrester: MWMultiVersion: Recognise the new wikifunctionsclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135500 [21:20:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-druid1007 [21:20:24] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-druid1006 [21:20:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-druid1006 [21:21:31] (03CR) 10Jforrester: [C:03+2] MWMultiVersion: Recognise the new wikifunctionsclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135500 (owner: 10Jforrester) [21:21:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135500 (owner: 10Jforrester) [21:22:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:22:18] (03Merged) 10jenkins-bot: MWMultiVersion: Recognise the new wikifunctionsclient dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135500 (owner: 10Jforrester) [21:22:44] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1126660|[test2wiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1135500|MWMultiVersion: Recognise the new wikifunctionsclient dblist]] [21:22:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-druid1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:22:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [21:22:52] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [21:22:56] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [21:23:07] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [21:23:10] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10727871 (10Jclark-ctr) [21:24:02] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:15] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10727875 (10Jclark-ctr) a:05Jclark-ctr→03BTullis @btullis handing over to you for updating puppet repo. also to verify that 1006/1007 is ok for names. [21:24:47] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:29:32] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1126660|[test2wiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1135500|MWMultiVersion: Recognise the new wikifunctionsclient dblist]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:35] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [21:32:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:32:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:08] !log jforrester@deploy1003 jforrester: Continuing with sync [21:34:47] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:37:13] RESOLVED: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in codfw - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=codfw - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [21:37:23] FIRING: [2x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:15] PROBLEM - OpenSearch health check for shards on 9200 on relforge1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1117b0d1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [21:39:15] org/wiki/Search%23Administration [21:40:45] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1126660|[test2wiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1135500|MWMultiVersion: Recognise the new wikifunctionsclient dblist]] (duration: 18m 01s) [21:40:48] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [21:41:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:42:13] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:15] RECOVERY - OpenSearch health check for shards on 9200 on relforge1009 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 53, active_shards: 107, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [21:49:15] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:52:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:52:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:54:02] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:55:47] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T2200) [22:02:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2085-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:03:19] (03PS1) 10Scott French: Revert "scap: Use PHP 8.1 when executing maintenance scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) [22:04:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:05:16] (03CR) 10CI reject: [V:04-1] Revert "scap: Use PHP 8.1 when executing maintenance scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [22:05:47] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:06:51] (03PS2) 10Scott French: Revert "scap: Use PHP 8.1 when executing maintenance scripts" [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) [22:07:13] FIRING: [4x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host druid1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:08:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for druid1012/1013 - jclark@cumin1002" [22:08:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for druid1012/1013 - jclark@cumin1002" [22:08:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:09:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host druid1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:09:56] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10728013 (10Jclark-ctr) [22:11:43] (03CR) 10Scott French: "FYI, since the override is no longer needed, I'll plan to merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1135509 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [22:14:02] FIRING: [3x] SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:07] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10728022 (10Jclark-ctr) a:05Jclark-ctr→03BTullis @BTullis please update puppet so these can be imaged Thanks John [22:16:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10728029 (10Jclark-ctr) relforge1010 is still pending RMA with supermicro. [22:20:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host druid1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:55] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391520#10728034 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [22:24:26] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391476#10728042 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [22:26:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10728044 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [22:29:50] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10728060 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [22:32:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:36:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3474 MB (3% inode=98%): /tmp 3474 MB (3% inode=98%): /var/tmp 3474 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:41:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728078 (10phaultfinder) [22:46:59] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [22:47:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:47:28] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release [22:47:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:01] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 109643 bytes in 2.131 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [22:48:26] this was a gitlab version upgrade for security [22:48:45] we should not get alerts because this is a cookbook [22:49:11] but here we are. and it's up after that short downtinme [22:50:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:51:07] (03PS1) 10Dwisehaupt: Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) [22:52:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:53:11] (03CR) 10CI reject: [V:04-1] Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:53:34] !log apt-staging2001 - sudo systemctl start gitlab-package-puller to fix monitoring alert [22:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10728104 (10phaultfinder) [22:55:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:25] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: relocate (3) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539 (10RobH) 03NEW p:05Triage→03High [22:59:18] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: relocate (3) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10728144 (10RobH) [22:59:18] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10728143 (10RobH) [23:00:16] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: relocate (3) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10728146 (10RobH) a:03Gehel @gehel, I think you'd be the person to triage this within #data-platform-sre and assign a point of contact for feedback on these... [23:02:03] (03PS2) 10Awight: Temporarily revoke ssh key for travel [puppet] - 10https://gerrit.wikimedia.org/r/1135481 [23:02:20] (03CR) 10Ladsgroup: [C:03+2] Temporarily revoke ssh key for travel [puppet] - 10https://gerrit.wikimedia.org/r/1135481 (owner: 10Awight) [23:02:22] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Temporarily revoke ssh key for travel [puppet] - 10https://gerrit.wikimedia.org/r/1135481 (owner: 10Awight) [23:03:08] (03PS1) 10Jforrester: Commit result of scap update-interwiki-cache --beta as of 2025-04-09Z23:00 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135514 [23:04:32] (03CR) 10Jforrester: "This seems wrong?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135514 (owner: 10Jforrester) [23:06:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540 (10RobH) 03NEW [23:06:21] (03CR) 10Dzahn: "the experimental build took 3 hours and finished after the change was already merged, heh. it's because it tried to compile this on every " [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [23:07:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10728171 (10RobH) a:03KOfori @kofori, I think you'd be the person to triage this within #data-persistence and assign a point of contact for feedback on these... [23:07:47] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10728177 (10RobH) [23:11:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10728193 (10Ladsgroup) [23:13:40] (03CR) 10Reedy: "Looks a lot like the prod one..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135514 (owner: 10Jforrester) [23:15:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10728198 (10Ladsgroup) `db1184` is a bit sensitive as it's the candidate master of English Wikipedia and might become master at any point. The rest can be moved... [23:15:43] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:17:24] (03CR) 10Jforrester: "Yeah, the comment at the top not being updated for over a year suggests real drift, but possibly scap is broken?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135514 (owner: 10Jforrester) [23:17:28] (03Abandoned) 10Jforrester: Commit result of scap update-interwiki-cache --beta as of 2025-04-09Z23:00 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135514 (owner: 10Jforrester) [23:19:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542 (10RobH) 03NEW [23:20:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search: relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10728211 (10RobH) a:03Gehel @gehel, I think you'd be the person to triage this within #discovery-search and assign a point of contact for feedback on th... [23:21:57] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10728220 (10RobH) [23:23:40] (03CR) 10Ladsgroup: alertmanager: add task receivers for 4 teams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [23:25:45] (03PS2) 10Ladsgroup: [WIP] MetaContactPages: Add affcom conflict reporting page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127958 (https://phabricator.wikimedia.org/T388919) [23:40:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135520 [23:40:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135520 (owner: 10TrainBranchBot) [23:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:43:25] (03PS1) 10Reedy: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135521 [23:43:36] James_F: ^ I'm guessing scap is doing something funky [23:43:44] that's running dumpInterwiki.php manually against labs [23:45:27] (03PS1) 10Creynolds: dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 [23:45:50] (03CR) 10CI reject: [V:04-1] dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds) [23:46:05] jouncebot: nowandnext [23:46:05] No deployments scheduled for the next 6 hour(s) and 13 minute(s) [23:46:06] In 6 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T0600) [23:46:06] In 6 hour(s) and 13 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250410T0600) [23:50:53] (03PS3) 10Ladsgroup: MetaContactPages: Add affcom conflict reporting page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127958 (https://phabricator.wikimedia.org/T388919) [23:51:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135520 (owner: 10TrainBranchBot) [23:56:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3042 MB (3% inode=98%): /tmp 3042 MB (3% inode=98%): /var/tmp 3042 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops