[00:02:48] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:09:14] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024820 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [00:11:46] (03CR) 10Eevans: [C:03+2] cassandra_dev: rename surrogate user [puppet] - 10https://gerrit.wikimedia.org/r/1024820 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [00:13:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:20:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:41] (03PS1) 10Eevans: cassandra-dev: comment the cassandra_dev DDL (no-op change) [puppet] - 10https://gerrit.wikimedia.org/r/1024821 (https://phabricator.wikimedia.org/T355730) [00:25:28] (03CR) 10Eevans: [C:03+2] cassandra-dev: comment the cassandra_dev DDL (no-op change) [puppet] - 10https://gerrit.wikimedia.org/r/1024821 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [00:51:51] !log rebooting puppetserver1001.eqiad.wmnet via drac [00:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:40] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:00:25] (SystemdUnitFailed) firing: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:38] 06SRE: puppetserver1001.eqiad.wmnet is unresponsive - https://phabricator.wikimedia.org/T363615#9750242 (10Peachey88) [01:08:43] 06SRE: puppetserver1001.eqiad.wmnet is unresponsive - https://phabricator.wikimedia.org/T363615#9750243 (10Eevans) Restarted via the drac and everything seems OK now. I skimmed the logs and didn't see anything that seemed unusual prior to the event. [01:09:20] 06SRE, 06Infrastructure-Foundations: puppetserver1001.eqiad.wmnet is unresponsive - https://phabricator.wikimedia.org/T363615#9750244 (10Eevans) [01:13:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:14:28] (03PS1) 10Eevans: cassandra-dev: use the correct `CREATE ROLE` syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024822 (https://phabricator.wikimedia.org/T355730) [01:15:25] (SystemdUnitFailed) resolved: sync-puppet-volatile.service on puppetserver1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:31] (03CR) 10Eevans: [C:03+2] cassandra-dev: use the correct `CREATE ROLE` syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024822 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [01:25:15] (03PS1) 10Eevans: cassandra_dev: use correct path to credentials file [puppet] - 10https://gerrit.wikimedia.org/r/1024823 (https://phabricator.wikimedia.org/T355730) [01:28:06] (03PS1) 10Andrea Denisse: ssl: Remove unnecessary dummy key from thanos-query hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024824 (https://phabricator.wikimedia.org/T360414) [01:28:16] (03CR) 10Eevans: [C:03+2] cassandra_dev: use correct path to credentials file [puppet] - 10https://gerrit.wikimedia.org/r/1024823 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [01:31:06] (03PS1) 10Eevans: Rename cassandra user [labs/private] - 10https://gerrit.wikimedia.org/r/1024825 [01:33:49] (03CR) 10Eevans: [V:03+2 C:03+2] Rename cassandra user [labs/private] - 10https://gerrit.wikimedia.org/r/1024825 (owner: 10Eevans) [01:44:36] (03PS1) 10Eevans: cassandra_dev: fix permissions on credentials file [puppet] - 10https://gerrit.wikimedia.org/r/1024826 (https://phabricator.wikimedia.org/T355730) [01:50:19] (03CR) 10Eevans: [C:03+2] cassandra_dev: fix permissions on credentials file [puppet] - 10https://gerrit.wikimedia.org/r/1024826 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [02:00:07] (03PS1) 10Eevans: cassandra_dev: force ssl for cqlsh sessions (surrogate user) [puppet] - 10https://gerrit.wikimedia.org/r/1024828 (https://phabricator.wikimedia.org/T355730) [02:38:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:25] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:10:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:32:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:16] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:25] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:49:25] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P61266 and previous config saved to /var/cache/conftool/dbconfig/20240427-065728-ladsgroup.json [06:57:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:03:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T352010)', diff saved to https://phabricator.wikimedia.org/P61267 and previous config saved to /var/cache/conftool/dbconfig/20240427-070648-ladsgroup.json [07:07:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:12:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P61268 and previous config saved to /var/cache/conftool/dbconfig/20240427-071235-ladsgroup.json [07:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:21:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P61269 and previous config saved to /var/cache/conftool/dbconfig/20240427-072155-ladsgroup.json [07:27:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P61270 and previous config saved to /var/cache/conftool/dbconfig/20240427-072742-ladsgroup.json [07:37:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P61271 and previous config saved to /var/cache/conftool/dbconfig/20240427-073703-ladsgroup.json [07:42:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P61272 and previous config saved to /var/cache/conftool/dbconfig/20240427-074250-ladsgroup.json [07:42:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [07:43:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [07:43:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:52:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T352010)', diff saved to https://phabricator.wikimedia.org/P61273 and previous config saved to /var/cache/conftool/dbconfig/20240427-075210-ladsgroup.json [07:52:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [07:52:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [07:52:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:52:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T352010)', diff saved to https://phabricator.wikimedia.org/P61274 and previous config saved to /var/cache/conftool/dbconfig/20240427-075233-ladsgroup.json [08:58:11] !log restarted uwsgi on netbox1002 to pickup the latest wmflib with magru [08:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:21] (03CR) 10Volans: [V:03+2 C:03+2] "This works and it's indeed needed now. Merging." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [09:41:24] (03PS1) 10Volans: Change build image user from root to nobody [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1024837 [09:41:24] (03PS1) 10Volans: Update dependencies [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1024838 [09:44:04] (03CR) 10Volans: [V:03+2 C:03+2] "Already merged on master and tested locally. Merging." [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1024837 (owner: 10Volans) [09:48:15] (03CR) 10Volans: "Not merging it now due to many major version changes, will test it next week on -dev first." [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1024838 (owner: 10Volans) [09:51:16] !log manually upgraded wmflib in netbox1002/2002's Netbox's venv [09:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:50:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:03:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:50] (03PS1) 10Ayounsi: Add AS65007 to confederation [homer/public] - 10https://gerrit.wikimedia.org/r/1024848 (https://phabricator.wikimedia.org/T362421) [11:10:50] (03CR) 10Ayounsi: [C:03+2] Add AS65007 to confederation [homer/public] - 10https://gerrit.wikimedia.org/r/1024848 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [11:11:20] (03Merged) 10jenkins-bot: Add AS65007 to confederation [homer/public] - 10https://gerrit.wikimedia.org/r/1024848 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [11:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:52:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:57:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:04:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:06:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:06:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51782 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:07:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:09:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:20:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:22:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51782 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:24:32] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:25:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:28:15] (03CR) 10Eevans: [C:03+2] cassandra_dev: force ssl for cqlsh sessions (surrogate user) [puppet] - 10https://gerrit.wikimedia.org/r/1024828 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [13:30:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:43:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 40% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:47:46] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:53:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:03:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:08:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:10:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:15:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:20:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:20:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:38:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Maintenance [14:46:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Maintenance [14:46:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T352010)', diff saved to https://phabricator.wikimedia.org/P61275 and previous config saved to /var/cache/conftool/dbconfig/20240427-144642-ladsgroup.json [14:47:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:48:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:53:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:54:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:58:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:02:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:04:53] (03PS1) 10Lucas Werkmeister: query_service: Add Access-Control-Allow-Header [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) [15:06:02] (03CR) 10Lucas Werkmeister: "Disclaimer: I have no idea if this works or not. (It also leaves out OAuth mode entirely, because that one’s barely possible to use extern" [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) (owner: 10Lucas Werkmeister) [15:06:16] (03PS2) 10Lucas Werkmeister: query_service: Add Access-Control-Allow-Headers [puppet] - 10https://gerrit.wikimedia.org/r/1024884 (https://phabricator.wikimedia.org/T362570) [15:12:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:22:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:27:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:29:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:34:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.92% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:35:26] (ProbeDown) firing: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:38:52] (ProbeDown) resolved: Service citoid:4003 has failed probes (http_citoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#citoid:4003 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.05% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:45:04] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:46:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 36.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:56:06] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:56:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.94% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:03:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:23:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:25:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:55:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 35.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:56:45] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:01:42] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [17:03:46] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru - ayounsi@cumin1002" [17:06:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru - ayounsi@cumin1002" [17:06:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:28:31] (03PS1) 10Ayounsi: Add magru to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/1024894 (https://phabricator.wikimedia.org/T362421) [17:28:33] (03PS1) 10Ayounsi: Add magru network to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1024895 (https://phabricator.wikimedia.org/T362421) [17:29:46] (03CR) 10Ayounsi: [C:03+2] Add magru to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/1024894 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [17:39:52] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [17:42:02] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru oob - ayounsi@cumin1002" [17:54:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru oob - ayounsi@cumin1002" [17:54:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:11:45] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:12:45] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:20:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:25:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51782 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:34:44] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:38:11] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for magru PDUs - cmooney@cumin1002" [19:39:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for magru PDUs - cmooney@cumin1002" [19:39:04] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:39:30] (Storage /var over 50%) firing: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [19:40:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T352010)', diff saved to https://phabricator.wikimedia.org/P61277 and previous config saved to /var/cache/conftool/dbconfig/20240427-194017-ladsgroup.json [19:40:34] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:43:36] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:45:46] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for magru PDUs - cmooney@cumin1002" [19:46:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for magru PDUs - cmooney@cumin1002" [19:46:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:49:30] (Storage /var over 50%) firing: (2) Alert for device asw1-b3-magru.mgmt.magru.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [19:55:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P61278 and previous config saved to /var/cache/conftool/dbconfig/20240427-195524-ladsgroup.json [20:10:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P61279 and previous config saved to /var/cache/conftool/dbconfig/20240427-201031-ladsgroup.json [20:14:02] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:15:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:19:30] (Storage /var over 50%) firing: (2) Device asw1-b3-magru.mgmt.magru.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [20:20:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:22:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7001 DNS add - pt1979@cumin2002" [20:23:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp7001 DNS add - pt1979@cumin2002" [20:23:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:24:30] (Storage /var over 50%) resolved: Device asw1-b4-magru.mgmt.magru.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [20:25:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cp7001.mgmt.magru.wmnet with reboot policy FORCED [20:25:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp7001.mgmt.magru.wmnet with reboot policy FORCED [20:25:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T352010)', diff saved to https://phabricator.wikimedia.org/P61280 and previous config saved to /var/cache/conftool/dbconfig/20240427-202539-ladsgroup.json [20:25:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [20:25:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:25:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [20:26:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T352010)', diff saved to https://phabricator.wikimedia.org/P61281 and previous config saved to /var/cache/conftool/dbconfig/20240427-202602-ladsgroup.json [20:29:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 850.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:29:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cp7001.mgmt.magru.wmnet with reboot policy FORCED [20:34:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 850.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:41:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7001.mgmt.magru.wmnet with reboot policy FORCED [20:41:47] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9750788 (10Papaul) [20:42:32] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7001'] [20:49:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:54:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51783 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:54:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 85436576 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:57:22] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 41608 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:01:23] (03PS1) 10Papaul: Add cp7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024902 (https://phabricator.wikimedia.org/T362729) [21:01:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7001'] [21:04:13] (03CR) 10Papaul: [C:03+2] Add cp7001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024902 (https://phabricator.wikimedia.org/T362729) (owner: 10Papaul) [21:06:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7001'] [21:14:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp7001'] [21:16:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cp7001.magru.wmnet with OS bullseye [21:16:53] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9750798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp7001.magru.wmnet with OS bullseye [22:07:50] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:20:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T352010)', diff saved to https://phabricator.wikimedia.org/P61282 and previous config saved to /var/cache/conftool/dbconfig/20240427-222548-ladsgroup.json [22:25:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:40:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P61283 and previous config saved to /var/cache/conftool/dbconfig/20240427-224057-ladsgroup.json [22:44:29] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7001.magru.wmnet with OS bullseye [22:44:39] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9750831 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp7001.magru.wmnet with OS bullseye executed with errors: - cp7001 (**FAIL**) - Removed fr... [22:56:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P61284 and previous config saved to /var/cache/conftool/dbconfig/20240427-225604-ladsgroup.json [22:58:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:11:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T352010)', diff saved to https://phabricator.wikimedia.org/P61285 and previous config saved to /var/cache/conftool/dbconfig/20240427-231112-ladsgroup.json [23:11:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [23:11:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:11:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [23:11:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T352010)', diff saved to https://phabricator.wikimedia.org/P61286 and previous config saved to /var/cache/conftool/dbconfig/20240427-231136-ladsgroup.json [23:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024748 [23:38:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024748 (owner: 10TrainBranchBot) [23:58:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024748 (owner: 10TrainBranchBot)