[00:05:42] (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005544 [00:39:27] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005544 (owner: 10TrainBranchBot) [01:00:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005544 (owner: 10TrainBranchBot) [01:01:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T357189)', diff saved to https://phabricator.wikimedia.org/P57857 and previous config saved to /var/cache/conftool/dbconfig/20240224-010152-arnaudb.json [01:01:59] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:10:03] (03CR) 10BCornwall: [C: 03+1] haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [01:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:16:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P57858 and previous config saved to /var/cache/conftool/dbconfig/20240224-011658-arnaudb.json [01:32:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P57859 and previous config saved to /var/cache/conftool/dbconfig/20240224-013205-arnaudb.json [01:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:47:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T357189)', diff saved to https://phabricator.wikimedia.org/P57860 and previous config saved to /var/cache/conftool/dbconfig/20240224-014711-arnaudb.json [01:47:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [01:47:19] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [01:47:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [01:47:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T357189)', diff saved to https://phabricator.wikimedia.org/P57861 and previous config saved to /var/cache/conftool/dbconfig/20240224-014734-arnaudb.json [01:47:56] !log Upload ncmonitor 0.0.3 to bookworm-wikimedia [01:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:42] (SystemdUnitFailed) resolved: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:41] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T357189)', diff saved to https://phabricator.wikimedia.org/P57862 and previous config saved to /var/cache/conftool/dbconfig/20240224-024722-arnaudb.json [02:47:29] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [03:02:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P57863 and previous config saved to /var/cache/conftool/dbconfig/20240224-030228-arnaudb.json [03:13:41] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:17:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P57864 and previous config saved to /var/cache/conftool/dbconfig/20240224-031735-arnaudb.json [03:32:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T357189)', diff saved to https://phabricator.wikimedia.org/P57865 and previous config saved to /var/cache/conftool/dbconfig/20240224-033241-arnaudb.json [03:32:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [03:32:48] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [03:32:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [03:33:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T357189)', diff saved to https://phabricator.wikimedia.org/P57866 and previous config saved to /var/cache/conftool/dbconfig/20240224-033304-arnaudb.json [03:48:42] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T357189)', diff saved to https://phabricator.wikimedia.org/P57867 and previous config saved to /var/cache/conftool/dbconfig/20240224-043801-arnaudb.json [04:38:09] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [04:53:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P57868 and previous config saved to /var/cache/conftool/dbconfig/20240224-045307-arnaudb.json [05:08:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P57869 and previous config saved to /var/cache/conftool/dbconfig/20240224-050814-arnaudb.json [05:08:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574233 (10phaultfinder) [05:23:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T357189)', diff saved to https://phabricator.wikimedia.org/P57870 and previous config saved to /var/cache/conftool/dbconfig/20240224-052320-arnaudb.json [05:23:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:23:27] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [05:23:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:17:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [06:17:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [06:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:12:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [07:12:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [07:12:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2106 (T357189)', diff saved to https://phabricator.wikimedia.org/P57871 and previous config saved to /var/cache/conftool/dbconfig/20240224-071221-arnaudb.json [07:12:28] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [07:48:42] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T357189)', diff saved to https://phabricator.wikimedia.org/P57872 and previous config saved to /var/cache/conftool/dbconfig/20240224-081631-arnaudb.json [08:31:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P57873 and previous config saved to /var/cache/conftool/dbconfig/20240224-083138-arnaudb.json [08:46:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P57874 and previous config saved to /var/cache/conftool/dbconfig/20240224-084644-arnaudb.json [09:01:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T357189)', diff saved to https://phabricator.wikimedia.org/P57875 and previous config saved to /var/cache/conftool/dbconfig/20240224-090150-arnaudb.json [09:01:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:01:57] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:02:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:02:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2110 (T357189)', diff saved to https://phabricator.wikimedia.org/P57876 and previous config saved to /var/cache/conftool/dbconfig/20240224-090212-arnaudb.json [10:00:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 132 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:05:12] PROBLEM - Host db2118 #page is DOWN: PING CRITICAL - Packet loss = 100% [10:06:49] PROBLEM - MariaDB Replica IO: s7 on db2100 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:52] PROBLEM - MariaDB Replica IO: s7 #page on db2159 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:06:55] PROBLEM - MariaDB Replica IO: s7 on db1181 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:10] PROBLEM - MariaDB Replica IO: s7 #page on db2150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:11] PROBLEM - MariaDB Replica IO: s7 #page on db2182 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:14] PROBLEM - MariaDB Replica IO: s7 #page on db2168 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:15] PROBLEM - MariaDB Replica IO: s7 #page on db2120 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:15] (MediaWikiHighErrorRate) firing: (3) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:07:16] PROBLEM - MariaDB Replica IO: s7 #page on db2122 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:17] PROBLEM - MariaDB Replica IO: s7 #page on db2108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:07:19] PROBLEM - MariaDB Replica IO: s7 #page on db2121 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:08:12] hello [10:08:23] taavi: thats s7 master [10:08:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T357189)', diff saved to https://phabricator.wikimedia.org/P57877 and previous config saved to /var/cache/conftool/dbconfig/20240224-100832-arnaudb.json [10:08:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:10:44] arnaudb: any chance this is related to your change? [10:10:59] !log powercycle db2118 [10:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:05] Description: CPU 1 machine check error detected. [10:11:32] sobanski: I doubt an automated schema change caused a host to reboot [10:12:12] Also not sure if arnaudb is actually around to reply anyway, they run nearly completely automated now and have for a while I think [10:12:15] (MediaWikiHighErrorRate) firing: (10) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:13:48] PROBLEM - MariaDB Replica Lag: s7 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:56] PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:01] PROBLEM - MariaDB Replica Lag: s7 #page on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:01] PROBLEM - MariaDB Replica Lag: s7 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:02] PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:04] PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:04] 2118 booting now [10:14:15] RECOVERY - Host db2118 #page is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [10:14:15] PROBLEM - MariaDB Replica Lag: s7 on db1181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:21] PROBLEM - MariaDB Replica Lag: s7 #page on db2182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:23] PROBLEM - MariaDB Replica Lag: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:25] PROBLEM - MariaDB Replica Lag: s7 #page on db2122 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 665.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:26] PROBLEM - MariaDB Replica Lag: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 665.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:27] PROBLEM - MariaDB Replica Lag: s7 #page on db2120 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 666.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:29] PROBLEM - MariaDB Replica Lag: s7 #page on db2121 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:29] although a CPU error sounds like we want to failover s7 asap [10:14:29] taavi: is it worth calling DBAs to consider an emergency failover? [10:14:30] PROBLEM - MariaDB Replica Lag: s7 #page on db2108 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:15:31] Event just with a reboot there is a question of data consistency [10:15:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 82 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:16:06] Hi. Anything I can help with? [10:16:09] PROBLEM - mysqld processes #page on db2118 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:16:10] PROBLEM - MariaDB read only s7 #page on db2118 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:16:17] I’ll give Manuel and Amir a ring [10:16:21] thanks sobanski [10:16:30] eoghan: if you could acknowledge the alerts [10:16:40] the host is back up, mariadb does not start automatically and I'm not doing that before someone tells me that's safe [10:16:44] I’m in a car and on my phone only [10:16:53] On it [10:18:01] and if someone with access could create a statuspage update [10:18:23] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574314 (10taavi) p:05Triage→03Unbreak! [10:18:32] m.arostegui will be online in a while [10:18:55] taavi: could someone update topic in here & -tech too [10:19:35] hi [10:19:38] Unfortunately I don't have access to statuspage (I've made a note to sort that on Monday) [10:19:38] hello [10:19:39] Please someone create a task [10:19:43] https://phabricator.wikimedia.org/T358421 [10:19:54] db2118 (s7 master) crashed due to a CPU error it seems [10:20:04] yep [10:20:06] I will get that fix [10:20:07] I rebooted it, it's now back up but I'm not starting mariadb unless someone tells me it's safe [10:20:07] fixed [10:21:13] RECOVERY - mysqld processes #page on db2118 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:21:15] my god you guys are so quick! thanks <3 [10:21:27] Please someone ACK all the alerts [10:21:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1006108 (https://phabricator.wikimedia.org/T358423) [10:21:32] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1006109 (https://phabricator.wikimedia.org/T358423) [10:21:33] RECOVERY - MariaDB Replica IO: s7 #page on db2150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:34] That was quick. Is it safe to just restart mariadb? [10:21:35] RECOVERY - MariaDB Replica IO: s7 #page on db2122 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:36] RECOVERY - MariaDB Replica IO: s7 #page on db2168 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:36] I do it [10:21:39] RECOVERY - MariaDB Replica IO: s7 #page on db2108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:40] RECOVERY - MariaDB Replica IO: s7 #page on db2121 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:21:44] marostegui: Eoghan is on alerts [10:21:45] slyngs: it's not usually [10:21:55] Amir1: go away [10:21:58] incidents all acked [10:21:58] Amir1: Noted :-) [10:21:58] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574324 (10Marostegui) Started it - InnoDB doing recovery, leaving it on RO. Once it's caught up I am switching it [10:22:14] RECOVERY - MariaDB Replica IO: s7 on db2100 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:17] RECOVERY - MariaDB Replica IO: s7 #page on db2159 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:17] RECOVERY - MariaDB Replica IO: s7 on db1181 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:25] +1 to Amir going away [10:22:34] okay [10:22:39] RECOVERY - MariaDB Replica IO: s7 #page on db2182 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:41] RECOVERY - MariaDB Replica IO: s7 #page on db2120 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:42] See you later <3 [10:22:55] Really feeling the love in here this morning. [10:23:18] RECOVERY - MariaDB Replica Lag: s7 on db2100 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:21] RECOVERY - MariaDB Replica Lag: s7 #page on db2159 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:21] RECOVERY - MariaDB Replica Lag: s7 on db1171 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:21] RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:22] RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:22] I'm also going to vanish, if that's OK? I'm meant to be on the bike in 5 mins [10:23:26] Emperor: bye [10:23:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T358423 [10:23:34] <3 [10:23:36] RECOVERY - MariaDB Replica Lag: s7 on db1181 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:36] T358423: Switchover s7 master (db2118 -> db2121) - https://phabricator.wikimedia.org/T358423 [10:23:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P57878 and previous config saved to /var/cache/conftool/dbconfig/20240224-102338-arnaudb.json [10:23:43] RECOVERY - MariaDB Replica Lag: s7 #page on db2182 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:44] RECOVERY - MariaDB Replica Lag: s7 #page on db2150 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:47] RECOVERY - MariaDB Replica Lag: s7 #page on db2122 is OK: OK slave_sql_lag Replication lag: 0.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:49] RECOVERY - MariaDB Replica Lag: s7 #page on db2168 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:51] RECOVERY - MariaDB Replica Lag: s7 #page on db2120 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:51] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574348 (10RhinosF1) [10:23:52] RECOVERY - MariaDB Replica Lag: s7 #page on db2121 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:53] RECOVERY - MariaDB Replica Lag: s7 #page on db2108 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:24:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2121 with weight 0 T358423', diff saved to https://phabricator.wikimedia.org/P57879 and previous config saved to /var/cache/conftool/dbconfig/20240224-102401-root.json [10:24:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T358423 [10:24:19] RECOVERY - MariaDB Replica Lag: s7 on dbstore1008 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:24:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1006108 (https://phabricator.wikimedia.org/T358423) (owner: 10Gerrit maintenance bot) [10:25:34] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574356 (10Marostegui) Even though mariadb is up, it is all in RO. I don't want to risk the data. [10:27:16] (MediaWikiHighErrorRate) firing: (10) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:28:18] 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574361 (10Marostegui) ` ------------------------------------------------------------------------------- Record: 26 Date/Time: 02/24/2024 10:08:18 Source: system Severity: Ok Description: A problem w... [10:30:08] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574362 (10Marostegui) @wiki_willy can we contact the vendor about this issue which caused a reboot? ` Record: 27 Date/Time: 02/24/2024 10:08:18 Source: system Seve... [10:30:14] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574365 (10Marostegui) [10:30:20] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574314 (10Marostegui) [10:32:17] marostegui: is there anything else that we can help with? [10:32:45] sobanski: Not at the moment no [10:33:28] ACK [10:38:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P57880 and previous config saved to /var/cache/conftool/dbconfig/20240224-103845-arnaudb.json [10:39:41] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 105 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:44:26] !log Starting s7 codfw emergency failover from db2118 to db2121 - T358423 [10:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:32] T358423: Switchover s7 master (db2118 -> db2121) - https://phabricator.wikimedia.org/T358423 [10:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s7 codfw as read-only for maintenance - T358423', diff saved to https://phabricator.wikimedia.org/P57881 and previous config saved to /var/cache/conftool/dbconfig/20240224-104440-marostegui.json [10:44:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 81 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:45:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2121 to s7 primary and set section read-write T358423', diff saved to https://phabricator.wikimedia.org/P57882 and previous config saved to /var/cache/conftool/dbconfig/20240224-104522-marostegui.json [10:45:34] everything should be back to normal [10:46:17] yep, I see edits flowing again [10:46:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2118 T358423', diff saved to https://phabricator.wikimedia.org/P57883 and previous config saved to /var/cache/conftool/dbconfig/20240224-104617-root.json [10:46:46] Nice work marostegui [10:48:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2121 from API', diff saved to https://phabricator.wikimedia.org/P57884 and previous config saved to /var/cache/conftool/dbconfig/20240224-104824-marostegui.json [10:48:38] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574382 (10Marostegui) p:05Unbreak!→03High [10:48:51] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574314 (10Marostegui) Everything should be back to normal now. [10:49:12] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1006109 (https://phabricator.wikimedia.org/T358423) (owner: 10Gerrit maintenance bot) [10:49:50] RECOVERY - MariaDB read only s7 #page on db2118 is OK: Version 10.4.25-MariaDB-log, Uptime 1754s, read_only: True, event_scheduler: True, 114.30 QPS, connection latency: 0.005950s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:51:38] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574388 (10Marostegui) [10:52:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:52:22] (03PS1) 10Marostegui: db2118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1006129 (https://phabricator.wikimedia.org/T358423) [10:53:37] (03CR) 10Marostegui: [C: 03+2] db2118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1006129 (https://phabricator.wikimedia.org/T358423) (owner: 10Marostegui) [10:53:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T357189)', diff saved to https://phabricator.wikimedia.org/P57885 and previous config saved to /var/cache/conftool/dbconfig/20240224-105351-arnaudb.json [10:53:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [10:53:58] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:54:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [10:54:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2119 (T357189)', diff saved to https://phabricator.wikimedia.org/P57886 and previous config saved to /var/cache/conftool/dbconfig/20240224-105413-arnaudb.json [10:56:43] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 95 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:06:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 85 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:08:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574405 (10phaultfinder) [11:33:45] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:40:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:40:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:48:42] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:45] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 84 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:01:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T357189)', diff saved to https://phabricator.wikimedia.org/P57887 and previous config saved to /var/cache/conftool/dbconfig/20240224-120150-arnaudb.json [12:02:03] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:05:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2196.codfw.wmnet with OS bookworm [12:05:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9574417 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm executed with errors: - db2196 (**... [12:07:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 109 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:16:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P57888 and previous config saved to /var/cache/conftool/dbconfig/20240224-121657-arnaudb.json [12:32:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P57889 and previous config saved to /var/cache/conftool/dbconfig/20240224-123203-arnaudb.json [12:45:04] 10SRE, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#9574453 (10cmooney) I note that a current draft in the IETF DNSOPS Working Group, aimed to replace RFC3901, //[[ https://datatracker.ietf.org/doc/html/draft-momoka-dnsop-3901bis-03#name-guidelines-for-dns-zone-con | d... [12:47:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T357189)', diff saved to https://phabricator.wikimedia.org/P57890 and previous config saved to /var/cache/conftool/dbconfig/20240224-124709-arnaudb.json [12:47:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [12:47:20] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:47:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [12:47:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T357189)', diff saved to https://phabricator.wikimedia.org/P57891 and previous config saved to /var/cache/conftool/dbconfig/20240224-124741-arnaudb.json [13:14:20] 10SRE, 10Wikimedia-Mailing-lists: Set up mailing list for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9574467 (10Sidishandsome) [13:38:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers mw2420.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, mw2435.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2029.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2055.codfw.wmnet, kuber [13:38:33] .codfw.wmnet, mw2313.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2434.codfw.wmnet, mw2353.codfw.wmnet, mw2394.codfw.wmnet, mw2356.codfw.wmnet, mw2440.codfw.wmnet, mw2293.codfw.wmnet, mw2444.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2380.codfw.wmnet, mw2292.codfw.wmnet, kubernetes2032.codfw.wmnet, mw2296.codfw.wmnet, mw2437.codfw.wmnet, mw2445.codfw.wmnet, [13:38:33] odfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2366.codfw.wmnet, mw2282.codfw.wmnet, mw2318.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2057.codfw.wmnet, mw2395.co https://wikitech.wikimedia.org/wiki/PyBal [13:38:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers mw2424.codfw.wmnet, kubernetes2032.codfw.wmnet, mw2294.codfw.wmnet, mw2312.codfw.wmnet, mw2296.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, mw2369.codfw.wmnet, mw2437.codfw.wmnet, mw2381.codfw.wmnet, kubernetes2047.co [13:38:35] , mw2435.codfw.wmnet, kubernetes2018.codfw.wmnet, mw2335.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2019.codfw.wmnet, mw2351.codfw.wmnet, mw2384.codfw.wmnet, mw2366.codfw.wmnet, kubernetes2055.codfw.wmnet, mw2425.codfw.wmnet, kubernetes2037.codfw.wmnet, mw2282.codfw.wmnet, mw2318.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2057.codfw.wmnet, mw2448.codfw.wmnet, kubernetes2060.codfw.wmnet, kubernetes2058.codfw.wm [13:38:35] rnetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2379.codfw.wmnet, mw2353.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmne https://wikitech.wikimedia.org/wiki/PyBal [13:40:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:40:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:42:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 87 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:49:53] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 124 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:02:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes2046.codfw.wmnet, mw2422.codfw.wmnet, mw2378.codfw.wmnet, mw2294.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2435.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes202 [14:02:45] mnet, mw2351.codfw.wmnet, mw2427.codfw.wmnet, mw2384.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2318.codfw.wmnet, mw2434.codfw.wmnet, mw2353.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2394.codfw.wmnet, mw2368.codfw.wmnet, mw2356.codfw.wmnet, mw2429.codfw.wmnet, mw2440.codfw.wmnet, mw2419.codfw.wmnet, kubernetes2036.codfw.wmnet, mw2444.codfw.wmnet, mw2450.codfw.wmnet, kubernetes2043. [14:02:45] et, mw2380.codfw.wmnet, mw2301.codfw.wmnet, mw2292.codfw.wmnet, mw2442.codfw.wmnet, kubernetes2017.codfw.wmnet, mw2296.codfw.wmnet, mw2423.codfw.wmnet, mw2445.codfw.wmnet, mw2335.codfw. https://wikitech.wikimedia.org/wiki/PyBal [14:02:49] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, mw2426.codfw.wmnet, mw2420.codfw.wmnet, mw2378.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2421.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2445.codfw.wmnet, mw2381.codfw.wmnet, kubernetes2018.co [14:02:49] , mw2431.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2427.codfw.wmnet, mw2384.codfw.wmnet, mw2366.codfw.wmnet, kubernetes2055.codfw.wmnet, mw2425.codfw.wmnet, kubernetes2037.codfw.wmnet, kubernetes2041.codfw.wmnet, mw2448.codfw.wmnet, mw2318.codfw.wmnet, mw2354.codfw.wmnet, mw2395.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2350.codfw.wmnet, kubernetes2045.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2039. [14:02:49] et, mw2428.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2379.codfw.wmnet, kubernetes2009.codfw.wmnet, mw2436.codfw.wmnet, mw2310.codfw.wmnet, kubernetes2038.co https://wikitech.wikimedia.org/wiki/PyBal [14:04:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:04:49] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:26:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T357189)', diff saved to https://phabricator.wikimedia.org/P57892 and previous config saved to /var/cache/conftool/dbconfig/20240224-142653-arnaudb.json [14:27:00] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:33:51] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574506 (10phaultfinder) [14:38:42] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P57893 and previous config saved to /var/cache/conftool/dbconfig/20240224-144200-arnaudb.json [14:57:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P57894 and previous config saved to /var/cache/conftool/dbconfig/20240224-145706-arnaudb.json [14:58:17] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:58:42] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:43] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:01:21] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:02:49] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/6 UP : 7 v2 P2P interfaces vs. 6 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:03:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:05:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T357189)', diff saved to https://phabricator.wikimedia.org/P57895 and previous config saved to /var/cache/conftool/dbconfig/20240224-151212-arnaudb.json [15:12:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:12:19] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:12:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:12:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T357189)', diff saved to https://phabricator.wikimedia.org/P57896 and previous config saved to /var/cache/conftool/dbconfig/20240224-151234-arnaudb.json [15:48:42] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:13] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:54:05] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:11:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T357189)', diff saved to https://phabricator.wikimedia.org/P57897 and previous config saved to /var/cache/conftool/dbconfig/20240224-161117-arnaudb.json [16:11:23] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:26:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P57898 and previous config saved to /var/cache/conftool/dbconfig/20240224-162623-arnaudb.json [16:28:51] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574542 (10phaultfinder) [16:41:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P57899 and previous config saved to /var/cache/conftool/dbconfig/20240224-164129-arnaudb.json [16:56:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T357189)', diff saved to https://phabricator.wikimedia.org/P57900 and previous config saved to /var/cache/conftool/dbconfig/20240224-165636-arnaudb.json [16:56:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:56:42] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:56:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:49:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [17:49:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [17:49:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T357189)', diff saved to https://phabricator.wikimedia.org/P57901 and previous config saved to /var/cache/conftool/dbconfig/20240224-174941-arnaudb.json [17:49:47] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:54:53] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 74 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:51:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T357189)', diff saved to https://phabricator.wikimedia.org/P57902 and previous config saved to /var/cache/conftool/dbconfig/20240224-185132-arnaudb.json [18:51:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:06:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P57903 and previous config saved to /var/cache/conftool/dbconfig/20240224-190638-arnaudb.json [19:21:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P57904 and previous config saved to /var/cache/conftool/dbconfig/20240224-192144-arnaudb.json [19:36:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T357189)', diff saved to https://phabricator.wikimedia.org/P57905 and previous config saved to /var/cache/conftool/dbconfig/20240224-193651-arnaudb.json [19:36:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [19:36:58] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:37:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [19:37:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T357189)', diff saved to https://phabricator.wikimedia.org/P57906 and previous config saved to /var/cache/conftool/dbconfig/20240224-193712-arnaudb.json [19:48:42] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T357189)', diff saved to https://phabricator.wikimedia.org/P57907 and previous config saved to /var/cache/conftool/dbconfig/20240224-203816-arnaudb.json [20:38:23] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:53:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P57908 and previous config saved to /var/cache/conftool/dbconfig/20240224-205323-arnaudb.json [21:08:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P57909 and previous config saved to /var/cache/conftool/dbconfig/20240224-210830-arnaudb.json [21:23:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T357189)', diff saved to https://phabricator.wikimedia.org/P57910 and previous config saved to /var/cache/conftool/dbconfig/20240224-212336-arnaudb.json [21:23:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [21:23:44] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:23:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [21:23:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:24:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:24:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T357189)', diff saved to https://phabricator.wikimedia.org/P57911 and previous config saved to /var/cache/conftool/dbconfig/20240224-212414-arnaudb.json [22:23:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T357189)', diff saved to https://phabricator.wikimedia.org/P57912 and previous config saved to /var/cache/conftool/dbconfig/20240224-222331-arnaudb.json [22:23:39] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:38:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P57913 and previous config saved to /var/cache/conftool/dbconfig/20240224-223837-arnaudb.json [22:53:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P57914 and previous config saved to /var/cache/conftool/dbconfig/20240224-225343-arnaudb.json [23:08:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T357189)', diff saved to https://phabricator.wikimedia.org/P57915 and previous config saved to /var/cache/conftool/dbconfig/20240224-230850-arnaudb.json [23:08:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:08:56] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:09:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [23:09:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T357189)', diff saved to https://phabricator.wikimedia.org/P57916 and previous config saved to /var/cache/conftool/dbconfig/20240224-230912-arnaudb.json [23:48:43] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed