[00:05:42] <jinxer-wm>	 (SystemdUnitFailed) firing: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:39:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005544
[00:39:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005544 (owner: 10TrainBranchBot)
[01:00:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1005544 (owner: 10TrainBranchBot)
[01:01:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T357189)', diff saved to https://phabricator.wikimedia.org/P57857 and previous config saved to /var/cache/conftool/dbconfig/20240224-010152-arnaudb.json
[01:01:59] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[01:10:03] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur)
[01:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:16:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P57858 and previous config saved to /var/cache/conftool/dbconfig/20240224-011658-arnaudb.json
[01:32:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P57859 and previous config saved to /var/cache/conftool/dbconfig/20240224-013205-arnaudb.json
[01:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:47:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T357189)', diff saved to https://phabricator.wikimedia.org/P57860 and previous config saved to /var/cache/conftool/dbconfig/20240224-014711-arnaudb.json
[01:47:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[01:47:19] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[01:47:28] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[01:47:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T357189)', diff saved to https://phabricator.wikimedia.org/P57861 and previous config saved to /var/cache/conftool/dbconfig/20240224-014734-arnaudb.json
[01:47:56] <brett>	 !log Upload ncmonitor 0.0.3 to bookworm-wikimedia
[01:48:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:00:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:38:41] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:47:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T357189)', diff saved to https://phabricator.wikimedia.org/P57862 and previous config saved to /var/cache/conftool/dbconfig/20240224-024722-arnaudb.json
[02:47:29] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[03:02:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P57863 and previous config saved to /var/cache/conftool/dbconfig/20240224-030228-arnaudb.json
[03:13:41] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:17:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P57864 and previous config saved to /var/cache/conftool/dbconfig/20240224-031735-arnaudb.json
[03:32:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T357189)', diff saved to https://phabricator.wikimedia.org/P57865 and previous config saved to /var/cache/conftool/dbconfig/20240224-033241-arnaudb.json
[03:32:44] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[03:32:48] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[03:32:57] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance
[03:33:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T357189)', diff saved to https://phabricator.wikimedia.org/P57866 and previous config saved to /var/cache/conftool/dbconfig/20240224-033304-arnaudb.json
[03:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:38:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T357189)', diff saved to https://phabricator.wikimedia.org/P57867 and previous config saved to /var/cache/conftool/dbconfig/20240224-043801-arnaudb.json
[04:38:09] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[04:53:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P57868 and previous config saved to /var/cache/conftool/dbconfig/20240224-045307-arnaudb.json
[05:08:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P57869 and previous config saved to /var/cache/conftool/dbconfig/20240224-050814-arnaudb.json
[05:08:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574233 (10phaultfinder)
[05:23:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T357189)', diff saved to https://phabricator.wikimedia.org/P57870 and previous config saved to /var/cache/conftool/dbconfig/20240224-052320-arnaudb.json
[05:23:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[05:23:27] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[05:23:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[06:17:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[06:17:17] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[06:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:12:02] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[07:12:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance
[07:12:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2106 (T357189)', diff saved to https://phabricator.wikimedia.org/P57871 and previous config saved to /var/cache/conftool/dbconfig/20240224-071221-arnaudb.json
[07:12:28] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[07:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:16:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T357189)', diff saved to https://phabricator.wikimedia.org/P57872 and previous config saved to /var/cache/conftool/dbconfig/20240224-081631-arnaudb.json
[08:31:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P57873 and previous config saved to /var/cache/conftool/dbconfig/20240224-083138-arnaudb.json
[08:46:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P57874 and previous config saved to /var/cache/conftool/dbconfig/20240224-084644-arnaudb.json
[09:01:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T357189)', diff saved to https://phabricator.wikimedia.org/P57875 and previous config saved to /var/cache/conftool/dbconfig/20240224-090150-arnaudb.json
[09:01:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[09:01:57] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[09:02:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance
[09:02:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2110 (T357189)', diff saved to https://phabricator.wikimedia.org/P57876 and previous config saved to /var/cache/conftool/dbconfig/20240224-090212-arnaudb.json
[10:00:31] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 132 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:05:12] <icinga-wm_>	 PROBLEM - Host db2118 #page is DOWN: PING CRITICAL - Packet loss = 100%
[10:06:49] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 on db2100 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:52] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2159 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:06:55] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 on db1181 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:10] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2150 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:11] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2182 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:14] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2168 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:15] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2120 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (3) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:07:16] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2122 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:17] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:07:19] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s7 #page on db2121 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2118.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2118.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:08:12] <taavi>	 hello
[10:08:23] <RhinosF1>	 taavi: thats s7 master
[10:08:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T357189)', diff saved to https://phabricator.wikimedia.org/P57877 and previous config saved to /var/cache/conftool/dbconfig/20240224-100832-arnaudb.json
[10:08:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[10:10:44] <sobanski>	 arnaudb: any chance this is related to your change?
[10:10:59] <taavi>	 !log powercycle db2118
[10:11:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:05] <taavi>	 Description: CPU 1 machine check error detected.
[10:11:32] <RhinosF1>	 sobanski: I doubt an automated schema change caused a host to reboot
[10:12:12] <RhinosF1>	 Also not sure if arnaudb is actually around to reply anyway, they run nearly completely automated now and have for a while I think
[10:12:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (10) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:13:48] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 on dbstore1008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:13:56] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:01] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:01] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:02] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:04] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 645.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:04] <taavi>	 2118 booting now
[10:14:15] <icinga-wm_>	 RECOVERY - Host db2118 #page is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[10:14:15] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 on db1181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:21] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2182 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:23] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:25] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2122 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 665.73 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:26] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2168 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 665.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:27] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2120 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 666.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:29] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2121 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:29] <taavi>	 although a CPU error sounds like we want to failover s7 asap
[10:14:29] <RhinosF1>	 taavi: is it worth calling DBAs to consider an emergency failover?
[10:14:30] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s7 #page on db2108 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 670.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:15:31] <sobanski>	 Event just with a reboot there is a question of data consistency
[10:15:32] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 82 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:16:06] <eoghan>	 Hi. Anything I can help with? 
[10:16:09] <icinga-wm_>	 PROBLEM - mysqld processes #page on db2118 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:16:10] <icinga-wm_>	 PROBLEM - MariaDB read only s7 #page on db2118 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:16:17] <sobanski>	 I’ll give Manuel and Amir a ring
[10:16:21] <taavi>	 thanks sobanski 
[10:16:30] <sobanski>	 eoghan: if you could acknowledge the alerts
[10:16:40] <taavi>	 the host is back up, mariadb does not start automatically and I'm not doing that before someone tells me that's safe
[10:16:44] <sobanski>	 I’m in a car and on my phone only
[10:16:53] <eoghan>	 On it
[10:18:01] <taavi>	 and if someone with access could create a statuspage update
[10:18:23] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574314 (10taavi) p:05Triage→03Unbreak!
[10:18:32] <sobanski>	 m.arostegui will be online in a while
[10:18:55] <RhinosF1>	 taavi: could someone update topic in here & -tech too
[10:19:35] <marostegui>	 hi
[10:19:38] <eoghan>	 Unfortunately I don't have access to statuspage (I've made a note to sort that on Monday)
[10:19:38] <taavi>	 hello
[10:19:39] <marostegui>	 Please someone create a task
[10:19:43] <taavi>	 https://phabricator.wikimedia.org/T358421
[10:19:54] <taavi>	 db2118 (s7 master) crashed due to a CPU error it seems
[10:20:04] <marostegui>	 yep
[10:20:06] <marostegui>	 I will get that fix
[10:20:07] <taavi>	 I rebooted it, it's now back up but I'm not starting mariadb unless someone tells me it's safe
[10:20:07] <marostegui>	 fixed
[10:21:13] <icinga-wm_>	 RECOVERY - mysqld processes #page on db2118 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:21:15] <arnaudb>	 my god you guys are so quick! thanks <3
[10:21:27] <marostegui>	 Please someone ACK all the alerts
[10:21:28] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1006108 (https://phabricator.wikimedia.org/T358423)
[10:21:32] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1006109 (https://phabricator.wikimedia.org/T358423)
[10:21:33] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2150 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:21:34] <slyngs>	 That was quick. Is it safe to just restart mariadb?
[10:21:35] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2122 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:21:36] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2168 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:21:36] <Amir1>	 I do it
[10:21:39] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:21:40] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2121 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:21:44] <sobanski>	 marostegui: Eoghan is on alerts
[10:21:45] <Amir1>	 slyngs: it's not usually
[10:21:55] <marostegui>	 Amir1: go away
[10:21:58] <Emperor>	 incidents all acked
[10:21:58] <slyngs>	 Amir1: Noted :-)
[10:21:58] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574324 (10Marostegui) Started it - InnoDB doing recovery, leaving it on RO. Once it's caught up I am switching it
[10:22:14] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 on db2100 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:22:17] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2159 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:22:17] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 on db1181 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:22:25] <sobanski>	 +1 to Amir going away
[10:22:34] <Amir1>	 okay
[10:22:39] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2182 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:22:41] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s7 #page on db2120 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:22:42] <Amir1>	 See you later <3
[10:22:55] <eoghan>	 Really feeling the love in here this morning.
[10:23:18] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 on db2100 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:21] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2159 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:21] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 on db1171 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:21] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:22] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:22] <Emperor>	 I'm also going to vanish, if that's OK? I'm meant to be on the bike in 5 mins
[10:23:26] <marostegui>	 Emperor: bye
[10:23:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T358423
[10:23:34] <Emperor>	 <3
[10:23:36] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 on db1181 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:36] <stashbot>	 T358423: Switchover s7 master (db2118 -> db2121) - https://phabricator.wikimedia.org/T358423
[10:23:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P57878 and previous config saved to /var/cache/conftool/dbconfig/20240224-102338-arnaudb.json
[10:23:43] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2182 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:44] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2150 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:47] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2122 is OK: OK slave_sql_lag Replication lag: 0.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:49] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2168 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:51] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2120 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:51] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574348 (10RhinosF1)
[10:23:52] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2121 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:53] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 #page on db2108 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:24:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2121 with weight 0 T358423', diff saved to https://phabricator.wikimedia.org/P57879 and previous config saved to /var/cache/conftool/dbconfig/20240224-102401-root.json
[10:24:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T358423
[10:24:19] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s7 on dbstore1008 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:24:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1006108 (https://phabricator.wikimedia.org/T358423) (owner: 10Gerrit maintenance bot)
[10:25:34] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574356 (10Marostegui) Even though mariadb is up, it is all in RO. I don't want to risk the data.
[10:27:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (10) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:28:18] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident: db2118 crashed - https://phabricator.wikimedia.org/T358421#9574361 (10Marostegui) ` ------------------------------------------------------------------------------- Record:      26 Date/Time:   02/24/2024 10:08:18 Source:      system Severity:    Ok Description: A problem w...
[10:30:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574362 (10Marostegui) @wiki_willy can we contact the vendor about this issue which caused a reboot? ` Record:      27 Date/Time:   02/24/2024 10:08:18 Source:      system Seve...
[10:30:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574365 (10Marostegui)
[10:30:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574314 (10Marostegui)
[10:32:17] <sobanski>	 marostegui: is there anything else that we can help with?
[10:32:45] <marostegui>	 sobanski: Not at the moment no
[10:33:28] <sobanski>	 ACK
[10:38:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P57880 and previous config saved to /var/cache/conftool/dbconfig/20240224-103845-arnaudb.json
[10:39:41] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 105 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:44:26] <marostegui>	 !log Starting s7 codfw emergency failover from db2118 to db2121 - T358423
[10:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:32] <stashbot>	 T358423: Switchover s7 master (db2118 -> db2121) - https://phabricator.wikimedia.org/T358423
[10:44:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s7 codfw as read-only for maintenance - T358423', diff saved to https://phabricator.wikimedia.org/P57881 and previous config saved to /var/cache/conftool/dbconfig/20240224-104440-marostegui.json
[10:44:41] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 81 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:45:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2121 to s7 primary and set section read-write T358423', diff saved to https://phabricator.wikimedia.org/P57882 and previous config saved to /var/cache/conftool/dbconfig/20240224-104522-marostegui.json
[10:45:34] <marostegui>	 everything should be back to normal
[10:46:17] <taavi>	 yep, I see edits flowing again
[10:46:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2118 T358423', diff saved to https://phabricator.wikimedia.org/P57883 and previous config saved to /var/cache/conftool/dbconfig/20240224-104617-root.json
[10:46:46] <eoghan>	 Nice work marostegui
[10:48:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2121 from API', diff saved to https://phabricator.wikimedia.org/P57884 and previous config saved to /var/cache/conftool/dbconfig/20240224-104824-marostegui.json
[10:48:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574382 (10Marostegui) p:05Unbreak!→03High
[10:48:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574314 (10Marostegui) Everything should be back to normal now.
[10:49:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1006109 (https://phabricator.wikimedia.org/T358423) (owner: 10Gerrit maintenance bot)
[10:49:50] <icinga-wm_>	 RECOVERY - MariaDB read only s7 #page on db2118 is OK: Version 10.4.25-MariaDB-log, Uptime 1754s, read_only: True, event_scheduler: True, 114.30 QPS, connection latency: 0.005950s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:51:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9574388 (10Marostegui)
[10:52:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:52:22] <wikibugs>	 (03PS1) 10Marostegui: db2118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1006129 (https://phabricator.wikimedia.org/T358423)
[10:53:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1006129 (https://phabricator.wikimedia.org/T358423) (owner: 10Marostegui)
[10:53:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T357189)', diff saved to https://phabricator.wikimedia.org/P57885 and previous config saved to /var/cache/conftool/dbconfig/20240224-105351-arnaudb.json
[10:53:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[10:53:58] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[10:54:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[10:54:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2119 (T357189)', diff saved to https://phabricator.wikimedia.org/P57886 and previous config saved to /var/cache/conftool/dbconfig/20240224-105413-arnaudb.json
[10:56:43] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 95 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:06:42] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 85 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:08:49] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574405 (10phaultfinder)
[11:33:45] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:40:49] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:40:57] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:48:45] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 84 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:01:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T357189)', diff saved to https://phabricator.wikimedia.org/P57887 and previous config saved to /var/cache/conftool/dbconfig/20240224-120150-arnaudb.json
[12:02:03] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[12:05:54] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2196.codfw.wmnet with OS bookworm
[12:05:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9574417 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2196.codfw.wmnet with OS bookworm executed with errors: - db2196 (**...
[12:07:49] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 109 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:16:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P57888 and previous config saved to /var/cache/conftool/dbconfig/20240224-121657-arnaudb.json
[12:32:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P57889 and previous config saved to /var/cache/conftool/dbconfig/20240224-123203-arnaudb.json
[12:45:04] <wikibugs>	 10SRE, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#9574453 (10cmooney) I note that a current draft in the IETF  DNSOPS Working Group, aimed to replace RFC3901, //[[ https://datatracker.ietf.org/doc/html/draft-momoka-dnsop-3901bis-03#name-guidelines-for-dns-zone-con | d...
[12:47:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T357189)', diff saved to https://phabricator.wikimedia.org/P57890 and previous config saved to /var/cache/conftool/dbconfig/20240224-124709-arnaudb.json
[12:47:12] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[12:47:20] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[12:47:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[12:47:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T357189)', diff saved to https://phabricator.wikimedia.org/P57891 and previous config saved to /var/cache/conftool/dbconfig/20240224-124741-arnaudb.json
[13:14:20] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Set up mailing list for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9574467 (10Sidishandsome)
[13:38:33] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers mw2420.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, mw2435.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2029.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2055.codfw.wmnet, kuber
[13:38:33] <icinga-wm_>	 .codfw.wmnet, mw2313.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2434.codfw.wmnet, mw2353.codfw.wmnet, mw2394.codfw.wmnet, mw2356.codfw.wmnet, mw2440.codfw.wmnet, mw2293.codfw.wmnet, mw2444.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2380.codfw.wmnet, mw2292.codfw.wmnet, kubernetes2032.codfw.wmnet, mw2296.codfw.wmnet, mw2437.codfw.wmnet, mw2445.codfw.wmnet,
[13:38:33] <icinga-wm_>	 odfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2366.codfw.wmnet, mw2282.codfw.wmnet, mw2318.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2057.codfw.wmnet, mw2395.co https://wikitech.wikimedia.org/wiki/PyBal
[13:38:35] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers mw2424.codfw.wmnet, kubernetes2032.codfw.wmnet, mw2294.codfw.wmnet, mw2312.codfw.wmnet, mw2296.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, mw2369.codfw.wmnet, mw2437.codfw.wmnet, mw2381.codfw.wmnet, kubernetes2047.co
[13:38:35] <icinga-wm_>	 , mw2435.codfw.wmnet, kubernetes2018.codfw.wmnet, mw2335.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2019.codfw.wmnet, mw2351.codfw.wmnet, mw2384.codfw.wmnet, mw2366.codfw.wmnet, kubernetes2055.codfw.wmnet, mw2425.codfw.wmnet, kubernetes2037.codfw.wmnet, mw2282.codfw.wmnet, mw2318.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2057.codfw.wmnet, mw2448.codfw.wmnet, kubernetes2060.codfw.wmnet, kubernetes2058.codfw.wm
[13:38:35] <icinga-wm_>	 rnetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2379.codfw.wmnet, mw2353.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmne https://wikitech.wikimedia.org/wiki/PyBal
[13:40:33] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:40:35] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:42:51] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 87 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:49:53] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 124 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:02:45] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes2046.codfw.wmnet, mw2422.codfw.wmnet, mw2378.codfw.wmnet, mw2294.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2435.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes202
[14:02:45] <icinga-wm_>	 mnet, mw2351.codfw.wmnet, mw2427.codfw.wmnet, mw2384.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2318.codfw.wmnet, mw2434.codfw.wmnet, mw2353.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2394.codfw.wmnet, mw2368.codfw.wmnet, mw2356.codfw.wmnet, mw2429.codfw.wmnet, mw2440.codfw.wmnet, mw2419.codfw.wmnet, kubernetes2036.codfw.wmnet, mw2444.codfw.wmnet, mw2450.codfw.wmnet, kubernetes2043.
[14:02:45] <icinga-wm_>	 et, mw2380.codfw.wmnet, mw2301.codfw.wmnet, mw2292.codfw.wmnet, mw2442.codfw.wmnet, kubernetes2017.codfw.wmnet, mw2296.codfw.wmnet, mw2423.codfw.wmnet, mw2445.codfw.wmnet, mw2335.codfw. https://wikitech.wikimedia.org/wiki/PyBal
[14:02:49] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, mw2426.codfw.wmnet, mw2420.codfw.wmnet, mw2378.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2421.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2445.codfw.wmnet, mw2381.codfw.wmnet, kubernetes2018.co
[14:02:49] <icinga-wm_>	 , mw2431.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2427.codfw.wmnet, mw2384.codfw.wmnet, mw2366.codfw.wmnet, kubernetes2055.codfw.wmnet, mw2425.codfw.wmnet, kubernetes2037.codfw.wmnet, kubernetes2041.codfw.wmnet, mw2448.codfw.wmnet, mw2318.codfw.wmnet, mw2354.codfw.wmnet, mw2395.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2350.codfw.wmnet, kubernetes2045.codfw.wmnet, kubernetes2058.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2039.
[14:02:49] <icinga-wm_>	 et, mw2428.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2379.codfw.wmnet, kubernetes2009.codfw.wmnet, mw2436.codfw.wmnet, mw2310.codfw.wmnet, kubernetes2038.co https://wikitech.wikimedia.org/wiki/PyBal
[14:04:47] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:04:49] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:26:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T357189)', diff saved to https://phabricator.wikimedia.org/P57892 and previous config saved to /var/cache/conftool/dbconfig/20240224-142653-arnaudb.json
[14:27:00] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[14:33:51] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574506 (10phaultfinder)
[14:38:42] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P57893 and previous config saved to /var/cache/conftool/dbconfig/20240224-144200-arnaudb.json
[14:57:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P57894 and previous config saved to /var/cache/conftool/dbconfig/20240224-145706-arnaudb.json
[14:58:17] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:58:42] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:00:43] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:01:21] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:02:49] <icinga-wm_>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/6 UP : 7 v2 P2P interfaces vs. 6 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:03:49] <icinga-wm_>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:05:47] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:12:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T357189)', diff saved to https://phabricator.wikimedia.org/P57895 and previous config saved to /var/cache/conftool/dbconfig/20240224-151212-arnaudb.json
[15:12:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[15:12:19] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[15:12:28] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[15:12:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T357189)', diff saved to https://phabricator.wikimedia.org/P57896 and previous config saved to /var/cache/conftool/dbconfig/20240224-151234-arnaudb.json
[15:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:51:13] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:54:05] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:11:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T357189)', diff saved to https://phabricator.wikimedia.org/P57897 and previous config saved to /var/cache/conftool/dbconfig/20240224-161117-arnaudb.json
[16:11:23] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[16:26:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P57898 and previous config saved to /var/cache/conftool/dbconfig/20240224-162623-arnaudb.json
[16:28:51] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9574542 (10phaultfinder)
[16:41:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P57899 and previous config saved to /var/cache/conftool/dbconfig/20240224-164129-arnaudb.json
[16:56:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T357189)', diff saved to https://phabricator.wikimedia.org/P57900 and previous config saved to /var/cache/conftool/dbconfig/20240224-165636-arnaudb.json
[16:56:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[16:56:42] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[16:56:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:49:22] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[17:49:35] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance
[17:49:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T357189)', diff saved to https://phabricator.wikimedia.org/P57901 and previous config saved to /var/cache/conftool/dbconfig/20240224-174941-arnaudb.json
[17:49:47] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[17:54:53] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 74 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:51:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T357189)', diff saved to https://phabricator.wikimedia.org/P57902 and previous config saved to /var/cache/conftool/dbconfig/20240224-185132-arnaudb.json
[18:51:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[19:06:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P57903 and previous config saved to /var/cache/conftool/dbconfig/20240224-190638-arnaudb.json
[19:21:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P57904 and previous config saved to /var/cache/conftool/dbconfig/20240224-192144-arnaudb.json
[19:36:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T357189)', diff saved to https://phabricator.wikimedia.org/P57905 and previous config saved to /var/cache/conftool/dbconfig/20240224-193651-arnaudb.json
[19:36:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[19:36:58] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[19:37:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[19:37:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T357189)', diff saved to https://phabricator.wikimedia.org/P57906 and previous config saved to /var/cache/conftool/dbconfig/20240224-193712-arnaudb.json
[19:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:38:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T357189)', diff saved to https://phabricator.wikimedia.org/P57907 and previous config saved to /var/cache/conftool/dbconfig/20240224-203816-arnaudb.json
[20:38:23] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[20:53:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P57908 and previous config saved to /var/cache/conftool/dbconfig/20240224-205323-arnaudb.json
[21:08:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P57909 and previous config saved to /var/cache/conftool/dbconfig/20240224-210830-arnaudb.json
[21:23:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T357189)', diff saved to https://phabricator.wikimedia.org/P57910 and previous config saved to /var/cache/conftool/dbconfig/20240224-212336-arnaudb.json
[21:23:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[21:23:44] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[21:23:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[21:23:55] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:24:08] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:24:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T357189)', diff saved to https://phabricator.wikimedia.org/P57911 and previous config saved to /var/cache/conftool/dbconfig/20240224-212414-arnaudb.json
[22:23:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T357189)', diff saved to https://phabricator.wikimedia.org/P57912 and previous config saved to /var/cache/conftool/dbconfig/20240224-222331-arnaudb.json
[22:23:39] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[22:38:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P57913 and previous config saved to /var/cache/conftool/dbconfig/20240224-223837-arnaudb.json
[22:53:44] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P57914 and previous config saved to /var/cache/conftool/dbconfig/20240224-225343-arnaudb.json
[23:08:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T357189)', diff saved to https://phabricator.wikimedia.org/P57915 and previous config saved to /var/cache/conftool/dbconfig/20240224-230850-arnaudb.json
[23:08:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[23:08:56] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[23:09:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[23:09:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T357189)', diff saved to https://phabricator.wikimedia.org/P57916 and previous config saved to /var/cache/conftool/dbconfig/20240224-230912-arnaudb.json
[23:48:43] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed