[00:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008066 [00:38:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008066 (owner: 10TrainBranchBot) [00:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:48:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:00:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008066 (owner: 10TrainBranchBot) [01:48:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:51:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:03] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:58:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:12:31] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:57] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 648.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:32:59] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:01:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:01:49] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T354015)', diff saved to https://phabricator.wikimedia.org/P58310 and previous config saved to /var/cache/conftool/dbconfig/20240303-050905-marostegui.json [05:09:09] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [05:24:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P58311 and previous config saved to /var/cache/conftool/dbconfig/20240303-052411-marostegui.json [05:27:59] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:03] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P58312 and previous config saved to /var/cache/conftool/dbconfig/20240303-053918-marostegui.json [05:39:35] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:54:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T354015)', diff saved to https://phabricator.wikimedia.org/P58313 and previous config saved to /var/cache/conftool/dbconfig/20240303-055424-marostegui.json [05:54:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [05:54:29] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [05:54:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [05:54:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T354015)', diff saved to https://phabricator.wikimedia.org/P58314 and previous config saved to /var/cache/conftool/dbconfig/20240303-055447-marostegui.json [06:51:41] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:52:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:50] 06SRE, 10Wikimedia-Mailing-lists: Set up mailing list ipbe-zh for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9593857 (10SunAfterRain) [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:27:49] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:34:31] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:55] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240303T0800) [08:00:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:01] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:09:01] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:03] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:07] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 357.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:39:35] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:56:13] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:27:25] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.78 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:51:33] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:47:04] (03PS4) 10TheDJ: Remove deprecated X-Webkit-CSP-Report-Only response header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003108 (https://phabricator.wikimedia.org/T357479) [12:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T354015)', diff saved to https://phabricator.wikimedia.org/P58315 and previous config saved to /var/cache/conftool/dbconfig/20240303-123955-marostegui.json [12:40:00] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [12:55:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P58316 and previous config saved to /var/cache/conftool/dbconfig/20240303-125502-marostegui.json [13:10:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P58317 and previous config saved to /var/cache/conftool/dbconfig/20240303-131008-marostegui.json [13:25:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T354015)', diff saved to https://phabricator.wikimedia.org/P58318 and previous config saved to /var/cache/conftool/dbconfig/20240303-132514-marostegui.json [13:25:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance [13:25:19] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [13:25:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1228.eqiad.wmnet with reason: Maintenance [13:25:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T354015)', diff saved to https://phabricator.wikimedia.org/P58319 and previous config saved to /var/cache/conftool/dbconfig/20240303-132536-marostegui.json [14:01:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:38:03] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:03] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:34] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9594304 (10cmooney) a:03cmooney >>! In T358658#9591261, @mpopov wrote: > Thank you @cmooney for taking this non-standard case on and helping KC out! T... [16:34:46] (03CR) 10Mainframe98: "This won't work then. Back to the drawing board." [puppet] - 10https://gerrit.wikimedia.org/r/1008001 (https://phabricator.wikimedia.org/T358940) (owner: 10Mainframe98) [17:13:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:18:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:32:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:42:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:44:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:49:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:49:45] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 48.53% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:54:30] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 43.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:20:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:25:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:26:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 49.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:30:30] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 49.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:35:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:40:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 43.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:40:36] (03PS1) 10Kosta Harlan: throttle: Allow for overriding temp account creation limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) [19:47:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T354015)', diff saved to https://phabricator.wikimedia.org/P58320 and previous config saved to /var/cache/conftool/dbconfig/20240303-194721-marostegui.json [19:47:28] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [20:02:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P58321 and previous config saved to /var/cache/conftool/dbconfig/20240303-200228-marostegui.json [20:17:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P58322 and previous config saved to /var/cache/conftool/dbconfig/20240303-201734-marostegui.json [20:18:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:20:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:21:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:22:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.916 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:22:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51594 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T354015)', diff saved to https://phabricator.wikimedia.org/P58323 and previous config saved to /var/cache/conftool/dbconfig/20240303-203240-marostegui.json [20:32:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [20:32:45] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [20:32:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1232.eqiad.wmnet with reason: Maintenance [20:33:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T354015)', diff saved to https://phabricator.wikimedia.org/P58324 and previous config saved to /var/cache/conftool/dbconfig/20240303-203302-marostegui.json [21:29:07] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status