[00:28:40] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 113 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:35:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 662 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:38:56] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:48] (ProbeHttpFailed) firing: (4) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [00:50:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [00:58:50] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:32] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:37] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:26:08] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:40:16] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:49:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [02:51:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:11:18] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 51.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:13:52] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:19:30] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:47:54] (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [03:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [03:55:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:20] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:02:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:30] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:10:58] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2022-02-28 03:21:02 (605 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:26:32] (03PS1) 10Ladsgroup: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766320 (https://phabricator.wikimedia.org/T302185) [04:27:45] (03PS2) 10Ladsgroup: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766320 (https://phabricator.wikimedia.org/T302185) [04:27:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766320 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [04:29:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [04:29:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [04:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T302185)', diff saved to https://phabricator.wikimedia.org/P21541 and previous config saved to /var/cache/conftool/dbconfig/20220228-043003-ladsgroup.json [04:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:14] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [04:35:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1172.eqiad.wmnet with OS bullseye [04:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [04:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [04:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:48] (ProbeHttpFailed) firing: (4) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [04:55:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:55:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:56:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1172.eqiad.wmnet with OS bullseye [05:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T302185)', diff saved to https://phabricator.wikimedia.org/P21542 and previous config saved to /var/cache/conftool/dbconfig/20220228-051016-ladsgroup.json [05:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:23] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:18:16] (03PS1) 10Ladsgroup: db1178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766323 (https://phabricator.wikimedia.org/T302185) [05:18:58] (03CR) 10Ladsgroup: [C: 03+2] db1178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766323 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [05:18:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:19:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300992)', diff saved to https://phabricator.wikimedia.org/P21543 and previous config saved to /var/cache/conftool/dbconfig/20220228-051905-ladsgroup.json [05:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:14] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [05:19:18] (03PS1) 10Ladsgroup: Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766135 [05:19:23] (03PS2) 10Ladsgroup: Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766135 [05:19:26] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766135 (owner: 10Ladsgroup) [05:20:55] (03PS1) 10Ladsgroup: ContentHandler: Use ParserOutputAccess for accessing ParserOutput [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766136 (https://phabricator.wikimedia.org/T302620) [05:21:04] (03CR) 10Ladsgroup: [C: 03+2] ContentHandler: Use ParserOutputAccess for accessing ParserOutput [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766136 (https://phabricator.wikimedia.org/T302620) (owner: 10Ladsgroup) [05:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P21544 and previous config saved to /var/cache/conftool/dbconfig/20220228-052521-ladsgroup.json [05:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:37] (03Abandoned) 10Ladsgroup: Enable wmgEmergencyCaptcha everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763870 (owner: 10Ladsgroup) [05:36:29] (03Merged) 10jenkins-bot: ContentHandler: Use ParserOutputAccess for accessing ParserOutput [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766136 (https://phabricator.wikimedia.org/T302620) (owner: 10Ladsgroup) [05:37:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300992)', diff saved to https://phabricator.wikimedia.org/P21545 and previous config saved to /var/cache/conftool/dbconfig/20220228-053721-ladsgroup.json [05:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:39] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [05:38:39] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.23/includes/content/ContentHandler.php: Backport: [[gerrit:766136|ContentHandler: Use ParserOutputAccess for accessing ParserOutput (T302620)]] (duration: 00m 49s) [05:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:45] T302620: 8m duplicate parses per day after wmf.23 rollout - https://phabricator.wikimedia.org/T302620 [05:40:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P21546 and previous config saved to /var/cache/conftool/dbconfig/20220228-054025-ladsgroup.json [05:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:43:48] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P21547 and previous config saved to /var/cache/conftool/dbconfig/20220228-055226-ladsgroup.json [05:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:00] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:10] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:55:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T302185)', diff saved to https://phabricator.wikimedia.org/P21548 and previous config saved to /var/cache/conftool/dbconfig/20220228-055530-ladsgroup.json [05:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:37] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [05:56:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [05:56:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [05:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T302185)', diff saved to https://phabricator.wikimedia.org/P21549 and previous config saved to /var/cache/conftool/dbconfig/20220228-055626-ladsgroup.json [05:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1178.eqiad.wmnet with OS bullseye [06:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P21550 and previous config saved to /var/cache/conftool/dbconfig/20220228-060731-ladsgroup.json [06:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: host reimage [06:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: host reimage [06:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300992)', diff saved to https://phabricator.wikimedia.org/P21551 and previous config saved to /var/cache/conftool/dbconfig/20220228-062236-ladsgroup.json [06:22:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:22:42] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:22:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [06:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [06:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1178.eqiad.wmnet with OS bullseye [06:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T302185)', diff saved to https://phabricator.wikimedia.org/P21552 and previous config saved to /var/cache/conftool/dbconfig/20220228-063800-ladsgroup.json [06:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:06] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:42:35] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:42:42] !log configure BGP between codfw and eqdfw [06:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:37] RECOVERY - Host cp6008 is UP: PING WARNING - Packet loss = 90%, RTA = 147.04 ms [06:43:37] RECOVERY - Host cp6004 is UP: PING OK - Packet loss = 0%, RTA = 155.28 ms [06:43:39] RECOVERY - Host cp6001 is UP: PING OK - Packet loss = 0%, RTA = 147.19 ms [06:43:39] RECOVERY - Host cp6006 is UP: PING OK - Packet loss = 0%, RTA = 148.87 ms [06:43:39] RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 147.46 ms [06:43:39] RECOVERY - Host cp6016 is UP: PING OK - Packet loss = 0%, RTA = 148.83 ms [06:43:39] RECOVERY - Host cp6005 is UP: PING OK - Packet loss = 0%, RTA = 148.87 ms [06:43:39] RECOVERY - Host lvs6003 is UP: PING OK - Packet loss = 0%, RTA = 147.23 ms [06:43:40] RECOVERY - Host cp6015 is UP: PING OK - Packet loss = 0%, RTA = 148.85 ms [06:43:40] RECOVERY - Host cp6010 is UP: PING OK - Packet loss = 0%, RTA = 148.85 ms [06:43:41] RECOVERY - Host cp6011 is UP: PING OK - Packet loss = 0%, RTA = 148.91 ms [06:43:41] RECOVERY - Host cp6012 is UP: PING OK - Packet loss = 0%, RTA = 146.94 ms [06:43:42] RECOVERY - Host cp6013 is UP: PING OK - Packet loss = 0%, RTA = 148.93 ms [06:43:42] RECOVERY - Host cp6014 is UP: PING OK - Packet loss = 0%, RTA = 146.91 ms [06:43:43] RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 149.14 ms [06:43:43] RECOVERY - Host lvs6002 is UP: PING OK - Packet loss = 0%, RTA = 148.94 ms [06:43:44] RECOVERY - Host cp6002 is UP: PING OK - Packet loss = 0%, RTA = 148.91 ms [06:43:44] RECOVERY - Host cp6003 is UP: PING OK - Packet loss = 0%, RTA = 147.17 ms [06:43:45] RECOVERY - Host ganeti6004 is UP: PING OK - Packet loss = 0%, RTA = 148.80 ms [06:43:45] RECOVERY - Host ganeti6003 is UP: PING OK - Packet loss = 0%, RTA = 148.94 ms [06:43:46] RECOVERY - Host ganeti6001 is UP: PING OK - Packet loss = 0%, RTA = 147.08 ms [06:43:46] RECOVERY - Host cp6009 is UP: PING OK - Packet loss = 0%, RTA = 147.00 ms [06:43:47] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 147.99 ms [06:43:47] RECOVERY - Host durum6001 is UP: PING OK - Packet loss = 0%, RTA = 147.45 ms [06:43:48] RECOVERY - Host cp6007 is UP: PING OK - Packet loss = 0%, RTA = 148.90 ms [06:43:51] RECOVERY - Host ganeti6002 is UP: PING OK - Packet loss = 0%, RTA = 146.94 ms [06:44:07] RECOVERY - Host lvs6001 is UP: PING OK - Packet loss = 0%, RTA = 149.59 ms [06:44:13] RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 149.10 ms [06:44:13] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 147.85 ms [06:44:35] RECOVERY - Host ganeti6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.54 ms [06:45:49] RECOVERY - Host asw1-b12-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.35 ms [06:45:55] RECOVERY - Host asw1-b13-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.41 ms [06:46:11] RECOVERY - Host cp6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.45 ms [06:46:11] RECOVERY - Host cp6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.70 ms [06:46:11] RECOVERY - Host cp6006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.36 ms [06:46:11] RECOVERY - Host cp6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.59 ms [06:46:11] RECOVERY - Host cp6005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.60 ms [06:46:11] RECOVERY - Host cp6008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.33 ms [06:46:11] RECOVERY - Host cp6007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.41 ms [06:46:12] RECOVERY - Host cp6010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 153.93 ms [06:46:12] RECOVERY - Host cp6011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 148.65 ms [06:46:13] RECOVERY - Host cp6009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.98 ms [06:46:13] RECOVERY - Host cp6004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 148.74 ms [06:46:14] RECOVERY - Host cp6013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.35 ms [06:46:14] RECOVERY - Host cp6014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.62 ms [06:46:15] RECOVERY - Host cp6012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.38 ms [06:46:15] RECOVERY - Host cp6015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.38 ms [06:46:16] RECOVERY - Host cp6016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.33 ms [06:46:16] RECOVERY - Host cr1-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.42 ms [06:46:17] RECOVERY - Host cr2-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.48 ms [06:46:33] RECOVERY - Host dns6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.56 ms [06:46:41] RECOVERY - Host dns6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.52 ms [06:47:14] (03PS1) 10Ladsgroup: Revert "db1178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766137 [06:47:20] (03PS2) 10Ladsgroup: Revert "db1178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766137 [06:47:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766137 (owner: 10Ladsgroup) [06:48:39] RECOVERY - Host ganeti6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.53 ms [06:48:39] RECOVERY - Host ganeti6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.77 ms [06:48:39] RECOVERY - Host ganeti6004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.64 ms [06:48:39] (03PS1) 10Ladsgroup: db1177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766555 (https://phabricator.wikimedia.org/T302185) [06:49:17] RECOVERY - Host lvs6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.71 ms [06:49:17] RECOVERY - Host lvs6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 149.61 ms [06:49:17] RECOVERY - Host lvs6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 147.46 ms [06:49:17] (03CR) 10Ladsgroup: [C: 03+2] db1177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766555 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [06:49:44] (03CR) 10Ayounsi: [C: 03+1] New function and changes to wmf-netbox plugin to support EVPN config. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [06:50:23] RECOVERY - Host scs-drmrs is UP: PING OK - Packet loss = 0%, RTA = 149.23 ms [06:50:50] (03CR) 10Ayounsi: [C: 03+1] "Looks great to me!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [06:53:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P21553 and previous config saved to /var/cache/conftool/dbconfig/20220228-065304-ladsgroup.json [06:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:44] RECOVERY - puppet last run on thanos-be1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:57:00] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:10] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:01:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:01:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300992)', diff saved to https://phabricator.wikimedia.org/P21554 and previous config saved to /var/cache/conftool/dbconfig/20220228-070148-ladsgroup.json [07:02:38] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:58] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:08:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P21555 and previous config saved to /var/cache/conftool/dbconfig/20220228-070809-ladsgroup.json [07:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:02] (03PS2) 10Elukey: knative-serving: keep only the last two revisions by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/764799 [07:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300992)', diff saved to https://phabricator.wikimedia.org/P21556 and previous config saved to /var/cache/conftool/dbconfig/20220228-072045-ladsgroup.json [07:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:52] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:21:59] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766138 (owner: 10Juan90264) [07:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T302185)', diff saved to https://phabricator.wikimedia.org/P21557 and previous config saved to /var/cache/conftool/dbconfig/20220228-072314-ladsgroup.json [07:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:21] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [07:23:55] (03PS3) 10Juan90264: Change temporary logo for slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766138 (https://phabricator.wikimedia.org/T302661) [07:24:53] (03CR) 10Ayounsi: [C: 04-1] "Added reviews for patch-set 7 to 10." [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [07:25:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:25:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T302185)', diff saved to https://phabricator.wikimedia.org/P21558 and previous config saved to /var/cache/conftool/dbconfig/20220228-072546-ladsgroup.json [07:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1177.eqiad.wmnet with OS bullseye [07:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P21559 and previous config saved to /var/cache/conftool/dbconfig/20220228-073550-ladsgroup.json [07:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:18] (03CR) 10Elukey: [C: 03+2] knative-serving: keep only the last two revisions by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/764799 (owner: 10Elukey) [07:42:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage [07:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:48] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:44:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage [07:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:54] (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [07:50:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P21560 and previous config saved to /var/cache/conftool/dbconfig/20220228-075054-ladsgroup.json [07:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:52] Waiting to backport [07:52:40] (NodeTextfileStale) firing: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [07:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [07:58:26] !log drain instances off ganeti2007 for eventual decom [07:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:52] (03CR) 10Ladsgroup: [C: 03+1] "It won't break any existing functionality so I think it's good." [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [08:00:05] Amir1, awight, Urbanecm, and taavi: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220228T0800). [08:00:05] Juan_90264: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] good morning! I can deploy today [08:00:17] !log enable notifications for thanos-be1003 in icinga and clear up /srv/swift-storage/sdm1 since it was filling up / [08:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:34] Thanks taavi and good morning [08:00:39] I'm here too if needed. [08:01:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1177.eqiad.wmnet with OS bullseye [08:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:30] (03CR) 10Majavah: [C: 03+2] Change temporary logo for slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766138 (https://phabricator.wikimedia.org/T302661) (owner: 10Juan90264) [08:02:26] Good morning guys! [08:02:29] (03Merged) 10jenkins-bot: Change temporary logo for slwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766138 (https://phabricator.wikimedia.org/T302661) (owner: 10Juan90264) [08:02:40] (NodeTextfileStale) resolved: (2) Stale textfile for thanos-be1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [08:02:55] Juan_90264: can you test on mwdebug1001 please? [08:03:13] Yes, I can [08:05:57] Taavi: I tested and approved [08:06:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300992)', diff saved to https://phabricator.wikimedia.org/P21561 and previous config saved to /var/cache/conftool/dbconfig/20220228-080559-ladsgroup.json [08:06:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:06:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:06:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:06] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:06:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300992)', diff saved to https://phabricator.wikimedia.org/P21562 and previous config saved to /var/cache/conftool/dbconfig/20220228-080613-ladsgroup.json [08:06:14] thanks! syncing [08:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T302185)', diff saved to https://phabricator.wikimedia.org/P21563 and previous config saved to /var/cache/conftool/dbconfig/20220228-080710-ladsgroup.json [08:07:11] !log taavi@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:766138|Change temporary logo for slwiki (T302661)]] (duration: 00m 50s) [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:27] I liked these new backport times, I can now be available at any backport time now [08:07:51] (03PS1) 10Ladsgroup: db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766561 (https://phabricator.wikimedia.org/T302185) [08:08:04] !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:766138|Change temporary logo for slwiki (T302661)]] (duration: 00m 48s) [08:08:14] (03PS1) 10Muehlenhoff: Remove ganeti2007 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/766562 [08:08:18] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [08:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:37] (03CR) 10Ladsgroup: [C: 03+2] db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766561 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [08:08:59] These new backport schedules still left the "UTC late backport window" great, I believe there is no lack of deployers at this time [08:09:06] T302661: Requesting temporary logo change for sl.wikipedia.org - https://phabricator.wikimedia.org/T302661 [08:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:11] !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:766138|Change temporary logo for slwiki (T302661)]] (duration: 00m 48s) [08:09:13] and your change is now live! [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] you can thank urbanecm for the new times, I like them too :-) [08:09:52] anyone have anything else to deploy? [08:10:05] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [08:10:24] !log UTC morning deploys done [08:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:35] taavi: Juan_90264: I'm glad you like the timings :) [08:11:41] Urbanecm: Thanks for the change that put these new times [08:12:53] Change working, thanks Taavi! [08:14:40] 10SRE-swift-storage: Alert on unmounted swift partitions - https://phabricator.wikimedia.org/T225079 (10fgiunchedi) This happened again on thanos-be1003 - where the `sdm1` filesystem had errors and was unmounted, though no notifications specific notifications were issued. I've remounted the filesystem for now th... [08:22:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P21564 and previous config saved to /var/cache/conftool/dbconfig/20220228-082215-ladsgroup.json [08:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:11] (03CR) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/763486 (owner: 10Giuseppe Lavagetto) [08:32:09] (03PS1) 10Kevin Bazira: ml-services: add lvwiki, nlwiki & nowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/766565 (https://phabricator.wikimedia.org/T301415) [08:34:42] (03PS1) 10Kevin Bazira: ml-services: add lvwiki, nlwiki & nowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/766566 (https://phabricator.wikimedia.org/T301415) [08:37:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P21566 and previous config saved to /var/cache/conftool/dbconfig/20220228-083720-ladsgroup.json [08:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:42] (03PS2) 10Kevin Bazira: ml-services: add lvwiki, nlwiki & nowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/766565 (https://phabricator.wikimedia.org/T301415) [08:39:27] (03PS1) 10Volans: Revert "setup.py: temporary limit prospector version" [cookbooks] - 10https://gerrit.wikimedia.org/r/766139 [08:39:38] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [08:39:39] (03Abandoned) 10Kevin Bazira: ml-services: add lvwiki, nlwiki & nowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/766566 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [08:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:46] (03PS10) 10Giuseppe Lavagetto: conftool: add request-actions / request-patterns [puppet] - 10https://gerrit.wikimedia.org/r/763486 [08:39:48] (03PS8) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [08:39:50] (03CR) 10Volans: [C: 03+2] "Restoring previous state." [cookbooks] - 10https://gerrit.wikimedia.org/r/766139 (owner: 10Volans) [08:40:28] (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [08:42:40] (03Merged) 10jenkins-bot: Revert "setup.py: temporary limit prospector version" [cookbooks] - 10https://gerrit.wikimedia.org/r/766139 (owner: 10Volans) [08:43:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300992)', diff saved to https://phabricator.wikimedia.org/P21567 and previous config saved to /var/cache/conftool/dbconfig/20220228-084316-ladsgroup.json [08:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:22] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:45:34] (03CR) 10Jgiannelos: [C: 03+1] maps: disable kartotherian on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/764353 (https://phabricator.wikimedia.org/T301664) (owner: 10Hnowlan) [08:47:32] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:48] (ProbeHttpFailed) firing: (4) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [08:49:55] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:46] !log installing expat security updates [08:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T302185)', diff saved to https://phabricator.wikimedia.org/P21570 and previous config saved to /var/cache/conftool/dbconfig/20220228-085224-ladsgroup.json [08:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:30] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [08:53:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [08:53:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [08:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T302185)', diff saved to https://phabricator.wikimedia.org/P21571 and previous config saved to /var/cache/conftool/dbconfig/20220228-085329-ladsgroup.json [08:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P21572 and previous config saved to /var/cache/conftool/dbconfig/20220228-085820-ladsgroup.json [08:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1126.eqiad.wmnet with OS bullseye [09:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:55] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:02:17] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:10:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1126.eqiad.wmnet with reason: host reimage [09:10:03] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:54] (03PS1) 10Muehlenhoff: Also include staging server in analytics-tools Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/766567 [09:12:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1126.eqiad.wmnet with reason: host reimage [09:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P21573 and previous config saved to /var/cache/conftool/dbconfig/20220228-091325-ladsgroup.json [09:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:32] !log restarting turnilo to pick up expat security updates [09:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:35] !log restarting Hue to pick up expat security updates [09:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:34] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:11] (03PS1) 10Giuseppe Lavagetto: mx: use https when connecting to the mw api [puppet] - 10https://gerrit.wikimedia.org/r/766571 (https://phabricator.wikimedia.org/T287820) [09:26:14] (03PS1) 10Giuseppe Lavagetto: api: remove monitoring from http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/766572 (https://phabricator.wikimedia.org/T244843) [09:26:17] (03PS1) 10Giuseppe Lavagetto: api: remove http endpoint from pybal [puppet] - 10https://gerrit.wikimedia.org/r/766573 (https://phabricator.wikimedia.org/T244843) [09:26:21] (03PS1) 10Giuseppe Lavagetto: api: remove non-https endpoint from backends [puppet] - 10https://gerrit.wikimedia.org/r/766574 (https://phabricator.wikimedia.org/T244843) [09:26:23] (03PS1) 10Giuseppe Lavagetto: appservers: remove monitoring for http-only [puppet] - 10https://gerrit.wikimedia.org/r/766575 (https://phabricator.wikimedia.org/T244843) [09:26:25] (03PS1) 10Giuseppe Lavagetto: appserver: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/766576 (https://phabricator.wikimedia.org/T244843) [09:26:27] (03PS1) 10Giuseppe Lavagetto: appserver: remove http pool from backends [puppet] - 10https://gerrit.wikimedia.org/r/766577 (https://phabricator.wikimedia.org/T244843) [09:26:29] (03PS1) 10Giuseppe Lavagetto: conftool: remove http pools for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/766578 (https://phabricator.wikimedia.org/T244843) [09:26:35] <_joe_> Amir1: ^^ [09:26:44] <_joe_> and don't say I don't do things for you [09:27:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1126.eqiad.wmnet with OS bullseye [09:27:12] (03CR) 10Ladsgroup: [C: 03+1] mx: use https when connecting to the mw api [puppet] - 10https://gerrit.wikimedia.org/r/766571 (https://phabricator.wikimedia.org/T287820) (owner: 10Giuseppe Lavagetto) [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300992)', diff saved to https://phabricator.wikimedia.org/P21574 and previous config saved to /var/cache/conftool/dbconfig/20220228-092830-ladsgroup.json [09:28:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:28:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:36] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:56] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:52] (03PS1) 10Ladsgroup: varnish: Block 3.237.242.253 without UA [puppet] - 10https://gerrit.wikimedia.org/r/766580 [09:31:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:10] _joe_: Thanks <3 [09:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T302185)', diff saved to https://phabricator.wikimedia.org/P21575 and previous config saved to /var/cache/conftool/dbconfig/20220228-093212-ladsgroup.json [09:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:19] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [09:34:44] (03CR) 10Ladsgroup: [C: 03+1] api: remove monitoring from http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/766572 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:35:08] (03CR) 10Ladsgroup: [C: 03+1] api: remove http endpoint from pybal [puppet] - 10https://gerrit.wikimedia.org/r/766573 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:35:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:37:53] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:39:17] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01809 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:39:42] <_joe_> uhm what's going on there [09:40:12] looks drmrs mostly? [09:40:20] drmrs hosts [09:40:41] I don't know why we don't link https://puppetboard.wikimedia.org/nodes?status=failed btw [09:42:48] <_joe_> volans: patches welcome [09:44:11] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001084 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:45:23] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:45:37] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:47:17] (03CR) 10JMeybohm: [C: 03+1] miscweb: Update envoy to 1.15.5-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/766208 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [09:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P21576 and previous config saved to /var/cache/conftool/dbconfig/20220228-094717-ladsgroup.json [09:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:44] (03CR) 10JMeybohm: [C: 03+1] "You could drop the line from values-staging.yaml in this patch again (as it will be inherited from values.yaml). But fine this way as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766209 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [09:50:08] (03Abandoned) 10Ladsgroup: varnish: Block 3.237.242.253 without UA [puppet] - 10https://gerrit.wikimedia.org/r/766580 (owner: 10Ladsgroup) [09:50:11] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:50:21] (03CR) 10Kormat: "Hmm. This seems likely to break some working-assumptions. Can you briefly explain the motivation?" [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [09:50:26] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) I am also experiencing unreliability. Particularly when trying to save edits. In logstash I am seein... [09:50:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:50:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300992)', diff saved to https://phabricator.wikimedia.org/P21577 and previous config saved to /var/cache/conftool/dbconfig/20220228-095056-ladsgroup.json [09:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:07] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:54:31] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) [09:54:41] (03PS1) 10Ladsgroup: db1114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766582 (https://phabricator.wikimedia.org/T302185) [09:57:39] (03PS1) 10Kormat: dbtools: Use section in deployment cal template. [software] - 10https://gerrit.wikimedia.org/r/766584 [09:58:06] (03PS1) 10Ladsgroup: Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766140 [09:58:18] (03PS2) 10Kormat: dbtools/switchover-tmpl.sh Use section in deployment cal template. [software] - 10https://gerrit.wikimedia.org/r/766584 [09:58:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mx: use https when connecting to the mw api [puppet] - 10https://gerrit.wikimedia.org/r/766571 (https://phabricator.wikimedia.org/T287820) (owner: 10Giuseppe Lavagetto) [09:59:21] (03CR) 10Ladsgroup: [C: 03+2] dbtools/switchover-tmpl.sh Use section in deployment cal template. [software] - 10https://gerrit.wikimedia.org/r/766584 (owner: 10Kormat) [09:59:36] (03PS2) 10Ladsgroup: Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766140 [09:59:40] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1126: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766140 (owner: 10Ladsgroup) [09:59:59] (03Merged) 10jenkins-bot: dbtools/switchover-tmpl.sh Use section in deployment cal template. [software] - 10https://gerrit.wikimedia.org/r/766584 (owner: 10Kormat) [10:00:05] (03PS2) 10Ladsgroup: db1114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766582 (https://phabricator.wikimedia.org/T302185) [10:00:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766582 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [10:02:17] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) [10:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P21578 and previous config saved to /var/cache/conftool/dbconfig/20220228-100221-ladsgroup.json [10:02:25] (03PS1) 10Majavah: P:configmaster: parametrise server names [puppet] - 10https://gerrit.wikimedia.org/r/766585 [10:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:46] (03PS2) 10Majavah: P:configmaster: parametrise server names [puppet] - 10https://gerrit.wikimedia.org/r/766585 [10:06:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34001/console" [puppet] - 10https://gerrit.wikimedia.org/r/766585 (owner: 10Majavah) [10:06:51] (03CR) 10Emil Chetty: [C: 03+1] "Looks good to me :)" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [10:09:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300992)', diff saved to https://phabricator.wikimedia.org/P21579 and previous config saved to /var/cache/conftool/dbconfig/20220228-100933-ladsgroup.json [10:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:40] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:09:57] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:17:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T302185)', diff saved to https://phabricator.wikimedia.org/P21580 and previous config saved to /var/cache/conftool/dbconfig/20220228-101726-ladsgroup.json [10:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [10:17:36] (03PS1) 10Vgutierrez: site: Reimage cp1088 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766586 (https://phabricator.wikimedia.org/T290005) [10:18:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:18:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [10:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T302185)', diff saved to https://phabricator.wikimedia.org/P21581 and previous config saved to /var/cache/conftool/dbconfig/20220228-101815-ladsgroup.json [10:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:19] (03CR) 10Ayounsi: "I don't know what kind of safety nets there are with service_auto_restart." [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:23:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1114.eqiad.wmnet with OS bullseye [10:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P21582 and previous config saved to /var/cache/conftool/dbconfig/20220228-102438-ladsgroup.json [10:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:59] (03PS1) 10Elukey: install_server: set new partman recipe for new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) [10:27:32] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1088 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766586 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:28:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1088.eqiad.wmnet with OS buster [10:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:38] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster [10:31:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1114.eqiad.wmnet with reason: host reimage [10:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:46] (03PS2) 10Elukey: install_server: set new partman recipe for new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) [10:33:37] 10SRE, 10Traffic: Remove component/varnish6 repo reference in Varnish Test Dockerfile - https://phabricator.wikimedia.org/T302579 (10MMandere) 05Open→03Resolved a:03MMandere Varnish Containerized test correctly pulls packages from `main` component and has dropped `component/varnish6` from the repolist. [10:34:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1114.eqiad.wmnet with reason: host reimage [10:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:29] (03PS3) 10Elukey: install_server: set new partman recipe for new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) [10:35:37] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:36:51] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for auditd [puppet] - 10https://gerrit.wikimedia.org/r/766589 (https://phabricator.wikimedia.org/T135991) [10:38:02] (03PS4) 10Elukey: install_server: set new partman recipe for new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) [10:38:11] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) ` root@deployment-cache-text06:/var/log/trafficserver# for i in {1..5}; do nc -zv -w 5 deployment-med... [10:39:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P21583 and previous config saved to /var/cache/conftool/dbconfig/20220228-103942-ladsgroup.json [10:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:38] (03PS1) 10AikoChou: httpbb: Add some tests for ores [puppet] - 10https://gerrit.wikimedia.org/r/766590 [10:42:32] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:43:10] (03PS2) 10AikoChou: httpbb: Add some tests for ores [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) [10:44:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1088.eqiad.wmnet with reason: host reimage [10:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:37] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:48:07] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1088.eqiad.wmnet with reason: host reimage [10:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1114.eqiad.wmnet with OS bullseye [10:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:37] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:53:10] (03CR) 10Muehlenhoff: Add Cumin alias to match core-test role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765562 (owner: 10Muehlenhoff) [10:54:19] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300992)', diff saved to https://phabricator.wikimedia.org/P21584 and previous config saved to /var/cache/conftool/dbconfig/20220228-105447-ladsgroup.json [10:54:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:54:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:54] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:04] (03CR) 10Elukey: httpbb: Add some tests for ores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [10:55:37] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T302185)', diff saved to https://phabricator.wikimedia.org/P21585 and previous config saved to /var/cache/conftool/dbconfig/20220228-105716-ladsgroup.json [10:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:23] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [10:57:30] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) apache2 is currently screaming on deploiyment-mediawiki11: ` Feb 28 10:54:28 deployment-mediawiki11 a... [10:58:02] (03CR) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:58:04] (03PS4) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) [10:58:59] XioNoX: please LMK when good to restore the smokeping config for drmrs [10:59:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:37] (JobUnavailable) firing: (3) Reduced availability for job cache_envoy in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:02:46] that's expected :) [11:09:40] !log pool cp1088 running HAProxy as TLS termination layer - T290005 T271421 [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:49] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:09:50] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [11:10:44] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for auditd [puppet] - 10https://gerrit.wikimedia.org/r/766589 (https://phabricator.wikimedia.org/T135991) [11:12:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P21586 and previous config saved to /var/cache/conftool/dbconfig/20220228-111221-ladsgroup.json [11:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1088.eqiad.wmnet with OS buster [11:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster c... [11:13:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766589 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:14:09] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:16:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:16:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300992)', diff saved to https://phabricator.wikimedia.org/P21587 and previous config saved to /var/cache/conftool/dbconfig/20220228-111700-ladsgroup.json [11:17:01] PROBLEM - puppet last run on deneb is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:10] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:21:12] (03PS1) 10Vgutierrez: site: Reimage cp5011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766597 (https://phabricator.wikimedia.org/T290005) [11:21:34] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) [11:22:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:27:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P21588 and previous config saved to /var/cache/conftool/dbconfig/20220228-112726-ladsgroup.json [11:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:27] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5011 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/766597 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [11:29:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5011.eqsin.wmnet with OS buster [11:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:31] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster [11:33:23] (03CR) 10Hnowlan: [C: 03+2] maps: disable kartotherian on maps masters [puppet] - 10https://gerrit.wikimedia.org/r/764353 (https://phabricator.wikimedia.org/T301664) (owner: 10Hnowlan) [11:35:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300992)', diff saved to https://phabricator.wikimedia.org/P21589 and previous config saved to /var/cache/conftool/dbconfig/20220228-113525-ladsgroup.json [11:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:32] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:35:37] (JobUnavailable) firing: (4) Reduced availability for job cache_envoy in eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:40:37] (JobUnavailable) firing: (4) Reduced availability for job cache_envoy in eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:41:59] (03CR) 10Volans: "I did a full pass and left some comments inline. Feel free to ping me on IRC if something is not clear and you need any additional informa" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [11:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T302185)', diff saved to https://phabricator.wikimedia.org/P21590 and previous config saved to /var/cache/conftool/dbconfig/20220228-114230-ladsgroup.json [11:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:43] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [11:46:52] (03CR) 10Hnowlan: [C: 03+2] restbase: change endpoint for deployment-prep to new host [puppet] - 10https://gerrit.wikimedia.org/r/765532 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [11:47:22] (03PS1) 10Ladsgroup: db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766600 (https://phabricator.wikimedia.org/T302185) [11:47:53] (03PS2) 10Ladsgroup: db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766600 (https://phabricator.wikimedia.org/T302185) [11:47:56] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766600 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [11:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P21591 and previous config saved to /var/cache/conftool/dbconfig/20220228-115030-ladsgroup.json [11:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [11:53:21] (03CR) 10Muehlenhoff: Enable profile::auto_restarts::service for anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:54:33] (03CR) 10Volans: [C: 03+1] "LGTM, 2 minor nits inline, not a blocker." [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:55:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5011.eqsin.wmnet with reason: host reimage [11:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5011.eqsin.wmnet with reason: host reimage [11:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:41] (03CR) 10JMeybohm: [C: 03+1] "Nit about one additional Bug # in commit message" [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [12:05:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P21592 and previous config saved to /var/cache/conftool/dbconfig/20220228-120535-ladsgroup.json [12:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:56] (03PS3) 10AikoChou: httpbb: Add some tests for ores [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) [12:06:16] (03PS1) 10Hnowlan: Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) [12:14:56] godog: it's good to restore it now [12:20:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300992)', diff saved to https://phabricator.wikimedia.org/P21593 and previous config saved to /var/cache/conftool/dbconfig/20220228-122039-ladsgroup.json [12:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:46] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:22:03] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:41] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia - T290005 [12:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:47] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:24:02] !log pool cp5011 running HAProxy as TLS termination layer - T290005 T271421 [12:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:09] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [12:25:06] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5011.eqsin.wmnet with OS buster [12:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:19] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster c... [12:30:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [12:30:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [12:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T302185)', diff saved to https://phabricator.wikimedia.org/P21594 and previous config saved to /var/cache/conftool/dbconfig/20220228-123008-ladsgroup.json [12:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:17] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [12:34:04] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.2971 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [12:34:28] here [12:34:34] <_joe_> me too [12:34:44] right when I put food on teh table... [12:34:46] * volans here too [12:34:57] (03PS1) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) [12:35:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:35:07] (03Abandoned) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [12:35:29] I acked the alert [12:35:33] <_joe_> ok so [12:35:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1111.eqiad.wmnet with OS bullseye [12:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:21] here too [12:36:41] <_joe_> [0x00007f869641ec70] query() /srv/mediawiki/php-1.38.0-wmf.23/includes/libs/rdbms/database/DatabaseMysqli.php:49 [12:36:46] here as well [12:36:53] <_joe_> it's slow connecting to some database [12:37:07] <_joe_> I can't be more precise but a dba taking a look there might help [12:37:16] can it be because I depooled one host in s8? [12:37:20] that shouldn't be it [12:37:21] <_joe_> possible [12:37:28] <_joe_> take a look at s8 [12:37:35] let me see what's the weight [12:38:10] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5032 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [12:38:12] traffic didn't spike on eqiad BTW [12:38:26] it's a rather large one but I depooled and repooled even bigger weight earlier today [12:38:29] <_joe_> yeah it's a db slowness [12:38:52] <_joe_> and yes it's wikibase [12:39:12] the median and 95 percentile is not recovered yet [12:39:28] <_joe_> latency is still high yes [12:39:33] we are still returning errors too [12:39:36] hmm, okay, should I give different weights to replicas? [12:39:48] <_joe_> so we need to either repool a server or find out which queries are killing us [12:40:01] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:40:10] that db host is halfway through bullseye upgradew [12:40:16] <_joe_> sigh [12:40:24] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2265 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [12:40:29] <_joe_> and indeed [12:40:42] <_joe_> ok where is the slow query log I used to look at on tendril? [12:40:51] https://logstash.wikimedia.org/app/dashboards#/view/43fcccd0-4df5-11ec-81e9-e1226573bad4?_g=h@42b0d52&_a=h@26799ee [12:41:19] it's in the term store [12:41:40] those are the current weights for s8 [12:41:40] https://phabricator.wikimedia.org/P21595 [12:41:43] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:42:10] <_joe_> SELECT /* Wikimedia\Rdbms\DatabaseMysqlBase::fetchSecondsSinceHeartbeat */ TIMESTAMPDIFF(MICROSECOND,ts,UTC_TIMESTAMP(N)) AS us_ago FROM heartbeat.heartbeat WHERE shard = 'X' ORDER BY ts DESC LIMIT N [12:42:20] <_joe_> ths query is being slow [12:42:21] I've just added the special sections weights too [12:42:35] <_joe_> wwhcih tells me that some db is just overwhelmed [12:42:38] I'd increase the weight of db1126 [12:43:01] <_joe_> no [12:43:08] <_joe_> db1126 is the source of all slow queries [12:43:10] <_joe_> AFAICT [12:43:51] oh yeah [12:44:04] <_joe_> Amir1: so reduce that for now? [12:44:08] sure [12:44:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1111.eqiad.wmnet with reason: host reimage [12:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:33] 7078 connections to db1126 [12:44:45] most of them in sleep from wikidatawiki [12:44:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Your commit message', diff saved to https://phabricator.wikimedia.org/P21596 and previous config saved to /var/cache/conftool/dbconfig/20220228-124454-ladsgroup.json [12:44:58] s/from/for/ [12:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:04] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2576 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [12:45:15] you need to give it a minute [12:45:31] _joe_: are you sure that the db is slow? most conns seems to be in sleep, not runnnig queries [12:46:07] already looking better on grafana fwiw [12:46:10] <_joe_> volans: that's what the slow query dashboard tells me [12:46:20] (03PS1) 10Hokwelum: Bringyour Mirror IP was updated [puppet] - 10https://gerrit.wikimedia.org/r/766617 [12:46:30] yeah load spikedto 40 [12:46:40] (03CR) 10jerkins-bot: [V: 04-1] Bringyour Mirror IP was updated [puppet] - 10https://gerrit.wikimedia.org/r/766617 (owner: 10Hokwelum) [12:46:40] it didn't matches a query I did before with cumin [12:46:42] double checking [12:46:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1111.eqiad.wmnet with reason: host reimage [12:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:38] it's recovering [12:47:41] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:47:42] yeah confirmed, was a typo on my end, db1126 is the one with most spikes in loadavg [12:48:33] <_joe_> I need to update my bookmarks heh I still had the old tendril stuff :/ [12:48:57] it might be that the the host doesn't have them in memory cache [12:49:03] and needs to be warmed up [12:49:19] <_joe_> very probable [12:49:31] <_joe_> but also I guess volans and I can go back to our meals :D [12:49:47] enjoy [12:49:53] _joe_: you should have ended up on the landing page that links to the new dashboard, did that not work? [12:50:01] (an after-meal question) [12:50:09] <_joe_> sobanski: I didn't even try [12:50:13] Ah [12:50:20] <_joe_> I just realized I had the stale link [12:51:04] <_joe_> it works! [12:52:49] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:54:12] (03PS2) 10Hokwelum: Bringyour Mirror IP was updated [puppet] - 10https://gerrit.wikimedia.org/r/766617 [12:54:23] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:54:30] (03PS1) 10Filippo Giunchedi: Revert "smokeping: temp mute drmrs" [puppet] - 10https://gerrit.wikimedia.org/r/766618 [12:54:58] I'm going to give any s8 replica that got repooled a bit more time before moving to the next host [12:55:26] it already repool them gradually and over course of an hour but I make it even slower [12:56:43] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "smokeping: temp mute drmrs" [puppet] - 10https://gerrit.wikimedia.org/r/766618 (owner: 10Filippo Giunchedi) [12:56:45] (03CR) 10ArielGlenn: [C: 03+2] Bringyour Mirror IP was updated [puppet] - 10https://gerrit.wikimedia.org/r/766617 (owner: 10Hokwelum) [12:57:17] XioNoX: {{done}} [12:57:19] so the situation is resolved, sorry for this [12:57:45] no worries Amir1, it happens [12:58:19] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:58:50] I do it serial, and it never caused issues in other section but s8 is special :/ [12:59:04] godog: thanks [13:01:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1111.eqiad.wmnet with OS bullseye [13:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:18] Amir1: _joe_: I guess this warrants an incident report? [13:01:57] I can try writing something [13:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T302185)', diff saved to https://phabricator.wikimedia.org/P21597 and previous config saved to /var/cache/conftool/dbconfig/20220228-130644-ladsgroup.json [13:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:51] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [13:09:44] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7707 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [13:09:51] (03CR) 10Filippo Giunchedi: WIP: new module alertmanager (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [13:09:53] (03PS4) 10Filippo Giunchedi: WIP: new module alertmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [13:13:43] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:14:32] !log restarting apache on puppet masters to pick up expat security update [13:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: new module alertmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [13:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P21598 and previous config saved to /var/cache/conftool/dbconfig/20220228-132148-ladsgroup.json [13:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:42] (03CR) 10Elukey: "Just a nit about extra spaces, but I think we should be good!" [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [13:33:41] (03PS5) 10Elukey: install_server: set new partman recipe for new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) [13:33:50] (03CR) 10Elukey: install_server: set new partman recipe for new k8s nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [13:34:05] Is Logstash non-functional for anyone else with Firefox? (I'm on Firefox 99) [13:35:23] kostajh: hi :) it works for me, do you mean logstash.wikimedia.org? If so, what do you get as non-functional part? [13:36:23] elukey: I see empty pages for any dashboard. Works in Safari. Maybe it's something with Firefox Nightly, I don't know. No errors in the browser console. [13:36:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P21599 and previous config saved to /var/cache/conftool/dbconfig/20220228-133653-ladsgroup.json [13:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:59] I just checked the ORES dashboard and it works for me [13:37:16] (firefox from bullseye) [13:37:38] (03CR) 10Elukey: [C: 03+2] install_server: set new partman recipe for new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/766588 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [13:37:46] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01518 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:37:49] elukey: ok, thanks for verifying :) [13:37:55] np!! [13:39:53] (03PS5) 10Filippo Giunchedi: WIP: new modules alertmanager / alerting [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [13:40:22] the puppet alert is caused by the apache restarts on puppet masters, will recover soon [13:41:17] moritzm: maybe we could run https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [13:41:30] (03CR) 10Filippo Giunchedi: WIP: new modules alertmanager / alerting (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [13:46:03] (03CR) 10Volans: "reply to question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [13:46:26] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:47:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: new modules alertmanager / alerting [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [13:48:09] volans: doesn't really make a difference, usually these restarts don't even trip the alert threshold, but with various drmrs ones failing, it went above it this time [13:48:50] puppetboard was reporting 47 hosts failing when I looked few minutes ago [13:48:54] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:48:58] is down to 41 no [13:49:00] *now [13:49:22] and there are no drmrs hosts currently failing [13:50:30] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2018.codfw.wmnet with OS bullseye [13:50:31] moritzm: I see failures as early as 14:19 and as late as 14:35 [13:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:51] if that helps to pinpoint something [13:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T302185)', diff saved to https://phabricator.wikimedia.org/P21600 and previous config saved to /var/cache/conftool/dbconfig/20220228-135158-ladsgroup.json [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:05] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [13:54:06] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766764 (https://phabricator.wikimedia.org/T287034) [13:54:21] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766764 (https://phabricator.wikimedia.org/T287034) (owner: 10Kosta Harlan) [13:55:06] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01193 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:56:14] (03PS6) 10Filippo Giunchedi: WIP: new modules alertmanager / alerting [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [13:56:54] (03CR) 10Filippo Giunchedi: WIP: new modules alertmanager / alerting (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [13:57:55] (03PS1) 10David Caro: wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 [13:57:57] (03PS1) 10David Caro: wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 [13:58:23] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766764 (https://phabricator.wikimedia.org/T287034) (owner: 10Kosta Harlan) [13:59:50] (03Abandoned) 10Muehlenhoff: Add drmrs to Hiera list of datacentres [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff) [14:00:02] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@75e8eb7]: (no justification provided) [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220228T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:16] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [14:00:16] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@75e8eb7]: (no justification provided) (duration: 00m 14s) [14:00:19] indeed, nothing to do [14:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] yup, looks like nothing to do [14:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:53] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 (owner: 10David Caro) [14:00:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 (owner: 10David Caro) [14:02:15] (03CR) 10jerkins-bot: [V: 04-1] WIP: new modules alertmanager / alerting [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [14:03:56] !log update gitlab-ce to 14.7.4 on all GitLab hosts [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:30] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:05:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2018.codfw.wmnet with reason: host reimage [14:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:56] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0005423 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:07:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2018.codfw.wmnet with reason: host reimage [14:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:38] (03PS4) 10AikoChou: httpbb: Add some tests for ores [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) [14:09:50] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [14:09:53] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [14:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:19] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003796 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:13:39] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:15:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34002/console" [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [14:16:42] (03CR) 10Elukey: [C: 03+1] "LGTM! Added also Reuven for a final pass :)" [puppet] - 10https://gerrit.wikimedia.org/r/766590 (https://phabricator.wikimedia.org/T300195) (owner: 10AikoChou) [14:18:34] (03PS3) 10JMeybohm: deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578) (owner: 10Giuseppe Lavagetto) [14:18:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2018.codfw.wmnet with OS bullseye [14:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:34] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2019.codfw.wmnet with OS bullseye [14:20:37] (03PS4) 10JMeybohm: deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578) (owner: 10Giuseppe Lavagetto) [14:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:09] (03PS8) 10JHathaway: Add SmartNotHealthy to monitor for disk smart alerts [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [14:27:22] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34003/console" [puppet] - 10https://gerrit.wikimedia.org/r/766572 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:27:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:28:45] (03CR) 10Ssingh: [C: 03+1] "Thank you for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/766589 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:29:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34004/console" [puppet] - 10https://gerrit.wikimedia.org/r/766572 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:32:10] (03PS1) 10Ladsgroup: Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766786 [14:32:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:32:46] (03CR) 10JHathaway: [C: 03+2] Add SmartNotHealthy to monitor for disk smart alerts [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [14:33:04] (03PS2) 10Ladsgroup: Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766786 [14:33:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/766786 (owner: 10Ladsgroup) [14:33:52] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd-staging2001.codfw.wmnet [14:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:51] (03PS1) 10Vgutierrez: haproxy::tls_terminator: Log Host header [puppet] - 10https://gerrit.wikimedia.org/r/766770 (https://phabricator.wikimedia.org/T290005) [14:34:53] (03PS1) 10Vgutierrez: mtail::cache_haproxy: Provide haproxy_client_healthcheck_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/766771 (https://phabricator.wikimedia.org/T290005) [14:35:13] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:22] (03Merged) 10jenkins-bot: Add SmartNotHealthy to monitor for disk smart alerts [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [14:36:43] (03PS1) 10Ladsgroup: db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766772 (https://phabricator.wikimedia.org/T302185) [14:36:58] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Provide haproxy_client_healthcheck_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/766771 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:37:16] (03PS2) 10Ladsgroup: db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766772 (https://phabricator.wikimedia.org/T302185) [14:37:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2019.codfw.wmnet with reason: host reimage [14:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:04] (03CR) 10Ladsgroup: [C: 03+2] db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/766772 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [14:38:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] api: remove monitoring from http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/766572 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:39:54] (03PS4) 10Krinkle: Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) [14:39:57] (03CR) 10Krinkle: [C: 03+2] Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) (owner: 10Krinkle) [14:40:50] (03Merged) 10jenkins-bot: Expand log level of DBConnection messages from 'error' to 'warning' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/764359 (https://phabricator.wikimedia.org/T281451) (owner: 10Krinkle) [14:43:49] !log klausman@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-etcd-staging2001.codfw.wmnet [14:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:32] Urgh [14:44:59] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:09] (03CR) 10Ssingh: Enable profile::auto_restarts::service for anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:45:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578) (owner: 10Giuseppe Lavagetto) [14:46:05] (03CR) 10JMeybohm: [C: 03+2] deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578) (owner: 10Giuseppe Lavagetto) [14:46:16] (03PS7) 10Filippo Giunchedi: WIP: new modules alertmanager / alerting [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 [14:48:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2019.codfw.wmnet with OS bullseye [14:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:06] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2020.codfw.wmnet with OS bullseye [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:06] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:23] (03CR) 10jerkins-bot: [V: 04-1] WIP: new modules alertmanager / alerting [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [15:01:25] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6013 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36086 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:01:25] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6011 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36086 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:01:27] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6013 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36088 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:01:27] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6012 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36088 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:01:27] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6001 is CRITICAL: 2.098e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6001 [15:01:41] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6006 is CRITICAL: 2.099e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6006 [15:01:48] (03CR) 10Filippo Giunchedi: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [15:01:58] (03PS2) 10David Caro: wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 [15:02:00] (03PS2) 10David Caro: wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 [15:02:02] !log milimetric@deploy1002 Started deploy [analytics/refinery@84a0770]: Add a few wikis to the sqoop list [15:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:23] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6003 is CRITICAL: 2.099e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6003 [15:02:23] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6009 is CRITICAL: 2.099e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6009 [15:02:35] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6014 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36156 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:02:36] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I616f56388eee9df21e (duration: 00m 49s) [15:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:41] PROBLEM - Check systemd state on durum6001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:43] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6005 is CRITICAL: 2.099e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6005 [15:02:43] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6012 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36164 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:02:43] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6011 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36164 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:02:45] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6014 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36166 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:02:55] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.099e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [15:04:33] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [15:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:43] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.grid.get_cluster_status: improve yaml output [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766765 (owner: 10David Caro) [15:04:45] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/766766 (owner: 10David Caro) [15:04:49] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6013 is CRITICAL: 2.1e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6013 [15:05:05] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6009 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -36306 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:05:05] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6015 is CRITICAL: 2.1e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6015 [15:05:27] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:05:32] (03PS2) 10Giuseppe Lavagetto: api: remove http endpoint from pybal [puppet] - 10https://gerrit.wikimedia.org/r/766573 (https://phabricator.wikimedia.org/T244843) [15:06:20] !log ntsako@deploy1002 Started deploy [airflow-dags/analytics@0a2ffb8]: (no justification provided) [15:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:28] !log ntsako@deploy1002 Finished deploy [airflow-dags/analytics@0a2ffb8]: (no justification provided) (duration: 00m 07s) [15:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2020.codfw.wmnet with reason: host reimage [15:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6007 is CRITICAL: 2.103e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6007 [15:08:38] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6004 is CRITICAL: 2.103e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6004 [15:08:53] (03PS1) 10JMeybohm: Add linkrecommendation listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/766777 (https://phabricator.wikimedia.org/T302719) [15:10:50] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10ssingh) a:03ssingh [15:11:05] (03CR) 10Herron: rsyslog: add 00-load_modules.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [15:11:14] (03PS3) 10Herron: rsyslog: add 00-load_modules.conf [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) [15:12:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add linkrecommendation listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/766777 (https://phabricator.wikimedia.org/T302719) (owner: 10JMeybohm) [15:13:18] ccccccvelljdrrvuchuntdjfbrnifuucnvvinjbgfdvr [15:13:22] (03CR) 10Herron: [C: 03+2] rsyslog: add 00-load_modules.conf [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [15:13:28] #yubifail [15:13:32] :p [15:13:50] (03CR) 10Filippo Giunchedi: "CI failures seem unrelated to this change" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (owner: 10Filippo Giunchedi) [15:15:27] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6002 is CRITICAL: 2.107e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6002 [15:16:35] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.108e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [15:16:46] godog: ack, I'll reopen the upstream bug that was supposed to be fixed in prospector 1.7.3 [15:17:33] volans: ok! FWIW I can't reproduce those errors locally with a py39 venv [15:17:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6008 is CRITICAL: 2.108e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6008 [15:17:37] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6010 is CRITICAL: 2.108e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6010 [15:17:38] haven't checked the versions though [15:17:38] as for bandit, there was a new release today... :( [15:17:39] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6015 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -37060 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:18:04] ah mhh maybe I should nuke my local venv [15:18:05] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (12) node(s) change every puppet run: cp6002, cp6008, cp6010, cp6016, lvs6002, cp6003, lvs6001, cp6005, ganeti6002, cloudcontrol1005, cloudcontrol1003, cloudcontrol1004 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:18:27] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host kubernetes2020.codfw.wmnet with OS bullseye [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:34] looks at cp6 puppets, etc [15:18:37] *looking [15:18:45] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6009 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -37125 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:19:53] OCSP alerts are triggered by puppet not being able to run there [15:19:55] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6015 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has -37196 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [15:19:56] yeah [15:20:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34005/console" [puppet] - 10https://gerrit.wikimedia.org/r/766777 (https://phabricator.wikimedia.org/T302719) (owner: 10JMeybohm) [15:20:10] we had a transport outage there over the weekend, which seems at least partially resolved [15:20:17] but not completely, apparently [15:20:44] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Add linkrecommendation listener to service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/766777 (https://phabricator.wikimedia.org/T302719) (owner: 10JMeybohm) [15:23:07] 10SRE, 10Observability-Logging, 10User-ema: rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10herron) To complicate matters, rsyslog also appears to throw errors when a module is loaded but not actively used, e.g.: ` $ /usr/sbin/rsyslogd -N1 -f /etc/rsyslog.conf r... [15:23:19] !log milimetric@deploy1002 Finished deploy [analytics/refinery@84a0770]: Add a few wikis to the sqoop list (duration: 21m 18s) [15:23:20] yeah I donno, drmrs transport is still very flaky at least [15:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:55] (03PS1) 10Herron: Revert "rsyslog: add 00-load_modules.conf" [puppet] - 10https://gerrit.wikimedia.org/r/766789 [15:25:14] (03PS2) 10Herron: Revert "rsyslog: add 00-load_modules.conf" [puppet] - 10https://gerrit.wikimedia.org/r/766789 [15:25:49] !log milimetric@deploy1002 Started deploy [analytics/refinery@84a0770] (thin): Add a few wikis to the sqoop list [15:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] !log milimetric@deploy1002 Finished deploy [analytics/refinery@84a0770] (thin): Add a few wikis to the sqoop list (duration: 00m 08s) [15:25:58] (03PS3) 10Herron: Revert "rsyslog: add 00-load_modules.conf" [puppet] - 10https://gerrit.wikimedia.org/r/766789 (https://phabricator.wikimedia.org/T292175) [15:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:15] !log milimetric@deploy1002 Started deploy [analytics/refinery@84a0770] (hadoop-test): Add a few wikis to the sqoop list [15:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:17] (03PS1) 10JMeybohm: Use service-proxy to connect to linkrecommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766780 (https://phabricator.wikimedia.org/T302719) [15:27:54] (03CR) 10Herron: [C: 03+2] Revert "rsyslog: add 00-load_modules.conf" [puppet] - 10https://gerrit.wikimedia.org/r/766789 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [15:28:23] (03PS2) 10Vgutierrez: mtail::cache_haproxy: Provide haproxy_client_healthcheck_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/766771 (https://phabricator.wikimedia.org/T290005) [15:29:22] (03PS1) 10Hashar: gerrit: on login page add link to reset password [puppet] - 10https://gerrit.wikimedia.org/r/766781 (https://phabricator.wikimedia.org/T60205) [15:30:28] (03CR) 10Hashar: "I am going through Phabricator tasks against #gerrit and that one looks like an easy fix now that GerritSiteFooter.html is only used for t" [puppet] - 10https://gerrit.wikimedia.org/r/766781 (https://phabricator.wikimedia.org/T60205) (owner: 10Hashar) [15:30:55] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:06] 10SRE, 10Observability-Metrics, 10Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) Some dashboards, e.g https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1 have their datasource set to `[eqiad codfw] prometheus/global` and contains user defi... [15:33:32] !log milimetric@deploy1002 Finished deploy [analytics/refinery@84a0770] (hadoop-test): Add a few wikis to the sqoop list (duration: 07m 16s) [15:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:49] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6011 is CRITICAL: 2.118e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6011 [15:37:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:49] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:40:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Use service-proxy to connect to linkrecommendation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766780 (https://phabricator.wikimedia.org/T302719) (owner: 10JMeybohm) [15:44:33] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:27] (03CR) 10MSantos: [C: 03+1] maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [15:46:47] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:09] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail::cache_haproxy: Provide haproxy_client_healthcheck_ttfb histogram [puppet] - 10https://gerrit.wikimedia.org/r/766771 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:47:19] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Volans) Because people have been randomly running cookbooks from cumin2001 with unknown results, I've manually edited `/usr/bin/cookbook` to prevent execution for now. [15:48:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:26] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2021.codfw.wmnet with OS bullseye [15:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:34] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: on login page add link to reset password [puppet] - 10https://gerrit.wikimedia.org/r/766781 (https://phabricator.wikimedia.org/T60205) (owner: 10Hashar) [15:52:05] 10SRE, 10ops-codfw, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q3): Decom centrallog2001 - https://phabricator.wikimedia.org/T298994 (10Papaul) [15:52:13] 10SRE, 10ops-codfw, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q3): Decom centrallog2001 - https://phabricator.wikimedia.org/T298994 (10Papaul) 05Open→03Resolved [15:52:47] (Processor usage over 85%) firing: (3) Alert for device scs-a1-codfw.mgmt.codfw.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [15:52:47] !log klausman@cumin2002 START - Cookbook sre.hosts.decommission for hosts ml-etcd-staging2001 [15:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:06] !log klausman@cumin2002 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts ml-etcd-staging2001 [15:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:22] !log klausman@cumin2002 START - Cookbook sre.hosts.decommission for hosts ml-etcd-staging2001 [15:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:13] (03CR) 10MSantos: Disable triggering tile pregeneration on OSM syncs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753111 (https://phabricator.wikimedia.org/T298246) (owner: 10Jgiannelos) [15:54:16] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:19] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir6002 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikispecies.net has 25481 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:55:23] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir6001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikispecies.net has 25476 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:55:45] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir6002 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikimedia.is has 3854 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:55:50] (03PS1) 10Volans: sre.ganeti.makevm: confirm on dns failure [cookbooks] - 10https://gerrit.wikimedia.org/r/766785 [15:55:52] (03PS1) 10Volans: sre.hosts.decommission: convert call to dns [cookbooks] - 10https://gerrit.wikimedia.org/r/766806 [15:56:07] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir6001 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikimedia.is has 3832 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [15:56:57] !log rolling upgrade to HAProxy 2.4.14 on HAProxy caching nodes - T290005 [15:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:03] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:58:47] !log klausman@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ml-etcd-staging2001 [15:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:31] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:48] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:01] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:37] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage [16:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:46] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) >>! In T276589#7727280, @Joe wrote: > Any update on this? This upgrade is blocking serviceops who needs bullseye for the kubernetes python libraries and cookbooks. I'm working on the... [16:07:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2021.codfw.wmnet with reason: host reimage [16:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:58] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [16:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:43] !log klausman@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:09] (03PS1) 10Elukey: Enable overlayfs for kubernetes20[18-22] [puppet] - 10https://gerrit.wikimedia.org/r/766808 (https://phabricator.wikimedia.org/T302208) [16:17:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2021.codfw.wmnet with OS bullseye [16:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:58] (03CR) 10Klausman: [C: 03+1] sre.ganeti.makevm: confirm on dns failure [cookbooks] - 10https://gerrit.wikimedia.org/r/766785 (owner: 10Volans) [16:19:08] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: confirm on dns failure [cookbooks] - 10https://gerrit.wikimedia.org/r/766785 (owner: 10Volans) [16:19:37] (03PS2) 10Elukey: Enable overlayfs for kubernetes20[18-22] [puppet] - 10https://gerrit.wikimedia.org/r/766808 (https://phabricator.wikimedia.org/T302208) [16:21:03] (03CR) 10Elukey: "Janis: we can enable overlay and then modify manually grub's config and reboot (since the puppet code for that is in the kubernetes worker" [puppet] - 10https://gerrit.wikimedia.org/r/766808 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [16:22:56] (03Merged) 10jenkins-bot: sre.ganeti.makevm: confirm on dns failure [cookbooks] - 10https://gerrit.wikimedia.org/r/766785 (owner: 10Volans) [16:25:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (14) node(s) change every puppet run: cp6002, deneb, lvs6002, cloudcontrol1005, cloudcontrol1003, cp6016, netflow6001, ganeti6002, cp6010, cp6003, cloudcontrol1004, lvs6001, cp6008, cp6005 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:27:31] !log klausman@cumin2002 START - Cookbook sre.ganeti.makevm for new host ml-staging-etcd2001.codfw.wmnet [16:27:32] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [16:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220228T1630). [16:32:33] (03CR) 10JMeybohm: [C: 03+1] Enable overlayfs for kubernetes20[18-22] [puppet] - 10https://gerrit.wikimedia.org/r/766808 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [16:32:35] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:40] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:59] !log klausman@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:33:02] !log klausman@cumin2002 START - Cookbook sre.dns.netbox [16:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:48] (03PS2) 10Ryan Kemper: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [16:35:10] (03CR) 10AGueyte: Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [16:37:19] (03PS17) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [16:37:35] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:03] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:42:09] (03CR) 10Muehlenhoff: Enable profile::auto_restarts::service for anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:42:22] (03Abandoned) 10Muehlenhoff: Enable profile::auto_restarts::service for anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/766581 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:42:33] !log klausman@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:42:34] !log klausman@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-staging-etcd2001.codfw.wmnet [16:42:36] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for auditd [puppet] - 10https://gerrit.wikimedia.org/r/766589 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:48] !log rebooting scs-a1-codfw to clear librenms alert [16:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:31] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:59] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:51:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:47] (Processor usage over 85%) firing: (3) Device scs-a1-codfw.mgmt.codfw.wmnet recovered from Processor usage over 85% - https://alerts.wikimedia.org [16:53:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:04] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2022.codfw.wmnet with OS bullseye [16:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:27] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1002.eqiad.wmnet [16:56:28] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [16:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:07] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:27] (03PS1) 10Volans: bandit: ignore hardcoded password in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/766813 [17:00:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:34] (03CR) 10Volans: "This should fix the bandit issue. As for CI will fail on prospector, I've re-commented on https://github.com/PyCQA/prospector/issues/491 T" [software/spicerack] - 10https://gerrit.wikimedia.org/r/766813 (owner: 10Volans) [17:02:44] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/766814 [17:05:20] !log razzi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:05:24] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [17:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] (03CR) 10jerkins-bot: [V: 04-1] bandit: ignore hardcoded password in tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/766813 (owner: 10Volans) [17:06:29] (03CR) 10Dzahn: [C: 03+2] gerrit: on login page add link to reset password [puppet] - 10https://gerrit.wikimedia.org/r/766781 (https://phabricator.wikimedia.org/T60205) (owner: 10Hashar) [17:07:25] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage [17:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:45] (03PS1) 10AOkoth: otrs: rename module variables [puppet] - 10https://gerrit.wikimedia.org/r/766815 (https://phabricator.wikimedia.org/T293942) [17:11:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2022.codfw.wmnet with reason: host reimage [17:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:47] (03PS1) 10David Caro: wmcs-cinder-backups: Increase timeout and decrease frequency [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) [17:14:50] (03PS2) 10AOkoth: otrs: rename module variables [puppet] - 10https://gerrit.wikimedia.org/r/766815 (https://phabricator.wikimedia.org/T293942) [17:15:23] !log razzi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:15:24] !log razzi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host datahubsearch1002.eqiad.wmnet [17:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:20] razzi: they will fail, because of the problems in drms, I'm making a patch to make them work again [17:16:42] see -dcops for context (last few minutes) [17:17:01] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34007/" [puppet] - 10https://gerrit.wikimedia.org/r/766815 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:18:56] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10akosiaris) Hi! This resurfaced during the weekend. It is not a single issue (despite appearances), rather the message **"upstream con... [17:20:19] (03PS1) 10Volans: sre.dns.netbox: confirm on failure on the authdns [cookbooks] - 10https://gerrit.wikimedia.org/r/766818 [17:21:15] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:30] !log manual trigger of cirrus SaneitizeJobs for with 2hr refresh [17:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2022.codfw.wmnet with OS bullseye [17:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:25] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6013 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 387454 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:22:32] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:22:53] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: convert call to dns [cookbooks] - 10https://gerrit.wikimedia.org/r/766806 (owner: 10Volans) [17:22:59] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6013 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 387420 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:23:01] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir6001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 214618 seconds left:Certificate wikimedia.is valid until 2022-04-22 16:07:29 +0000 (expires in 52 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:23:12] (03CR) 10Volans: [C: 03+2] "As agreed on IRC, proceeding to unblock deploys" [cookbooks] - 10https://gerrit.wikimedia.org/r/766818 (owner: 10Volans) [17:23:14] (03CR) 10Dzahn: [C: 03+1] "compiler output looks good and since the variables are already renamed in the profile, the lookup should not be influenced." [puppet] - 10https://gerrit.wikimedia.org/r/766815 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:23:17] (03CR) 10David Caro: [C: 04-1] wmcs-cinder-backups: Increase timeout and decrease frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766816 (https://phabricator.wikimedia.org/T302720) (owner: 10David Caro) [17:23:41] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01437 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:25:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:01] (03Merged) 10jenkins-bot: sre.hosts.decommission: convert call to dns [cookbooks] - 10https://gerrit.wikimedia.org/r/766806 (owner: 10Volans) [17:26:55] (03Merged) 10jenkins-bot: sre.dns.netbox: confirm on failure on the authdns [cookbooks] - 10https://gerrit.wikimedia.org/r/766818 (owner: 10Volans) [17:27:49] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir6001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 235932 seconds left:Certificate *.wikispecies.net valid until 2022-04-15 10:01:02 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:28:16] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:07] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir6002 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 235793 seconds left:Certificate *.wikispecies.net valid until 2022-04-15 10:01:02 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:30:15] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir6002 is OK: SSL OK - OCSP staple validity for wikimedia.is has 214184 seconds left:Certificate wikimedia.is valid until 2022-04-22 16:07:29 +0000 (expires in 52 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:30:35] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6009 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386964 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:30:35] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6014 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386964 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:30:55] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6012 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386945 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:31:11] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6009 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386928 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:31:12] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6012 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386928 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:31:37] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:51] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6014 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386889 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:33:17] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:33:25] (ProbeHttpFailed) resolved: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [17:34:01] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6011 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386759 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:34:50] (03CR) 10AOkoth: [C: 03+2] otrs: rename module variables [puppet] - 10https://gerrit.wikimedia.org/r/766815 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:35:09] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001596 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:35:20] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:39] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1002.eqiad.wmnet [17:35:41] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [17:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:25] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6011 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386615 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:38:37] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6015 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386483 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:38:51] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:40:15] !log rolling restart of anycast-hc.service on doh* hosts for security updates [17:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:47] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp6015 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386352 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 34 days) https://wikitech.wikimedia.org/wiki/HTTPS [17:40:57] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:46] (03CR) 10Herron: "Since https://gerrit.wikimedia.org/r/c/operations/puppet/+/766789/ was a dud, trying a different approach" [puppet] - 10https://gerrit.wikimedia.org/r/766814 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [17:42:49] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:01] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:52] (03CR) 10Eevans: [C: 03+1] Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [17:45:33] !log rolling restart of anycast-hc.service on durum* hosts for security updates [17:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:01] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:48:14] !log lvs1017-20 (all eqiad lvs) - stopping puppet to attempt deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/765311 [17:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:21] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:48:44] ^ any ideas what is this about? [17:48:57] I thought this was because of the anycast-hc restarts I did but it doesn't seem to be [17:51:41] (03PS3) 10BBlack: eqiad lvs: add interfaces and IPs for rows E and F [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) [17:51:44] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1147.mgmt.eqiad.wmnet with reboot policy FORCED [17:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:33] (03CR) 10BBlack: [C: 03+2] eqiad lvs: add interfaces and IPs for rows E and F [puppet] - 10https://gerrit.wikimedia.org/r/765311 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [17:54:08] (03CR) 10Herron: "Updated to support site name in the public subdomain, and instance name at the root. LMKWYT" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [17:54:25] sukhe: I'm considering you guilty until proven innocent :P [17:54:33] hahah [17:54:33] but give me a sec let me check what's up [17:54:41] * sukhe works on his case [17:55:10] on the cr3-ulsfo router I see a lot of logs related to durum nodes [17:55:15] What's "anycast-hc" ? [17:55:21] anycast-healthchecker [17:55:44] Would that have restarted Bird? [17:55:50] if anycast-hc dies, bird dies, yeah I think [17:55:55] Either way all seems to be ok right now. [17:55:58] yes but I don't think it should have died [17:56:03] hmm [17:56:07] that's a bit weird though [17:56:08] it's intentional [17:56:25] because if anycast-hc is dead (even temporarily), then we don't know if the service is alive, and therefore need to withdraw the route [17:56:40] But I can see all Anycast sessions apart from dns4001/dns4002 are less than 15 mins old [17:56:47] that seems about right [17:56:50] https://www.irccloud.com/pastebin/INsvwzsv/ [17:57:01] yeah so I'd say it just reset and triggered the alert, but status is ok now. [17:57:02] bblack: so the temporary part is probably the time between restarts? [17:57:44] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) @CDanis I had not, here is my attempt at a comparison between git submodules and subtrees == Git subtrees == With the `git-subtree(1)` command you merge another repository into yo... [17:57:45] topranks: I think that sukhe is proven guilty this time [17:57:57] :D [17:58:05] haha yep he is going down :D [17:58:09] * sukhe prepares for sentencing [17:58:15] elukey: thanks for jumping in too btw :) [17:58:20] let me just grab the delicious lunch I made [17:58:31] happy to go down after it :P [17:58:50] (03CR) 10Elukey: [C: 03+2] Enable overlayfs for kubernetes20[18-22] [puppet] - 10https://gerrit.wikimedia.org/r/766808 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [17:59:05] convicts last meal :) [17:59:10] !log phabricator/diffusion - disable http and ssh URIs for source repo "iltools" - T296022 - https://commons.wikimedia.org/wiki/User_talk:Inductiveload#c-Inductiveload-2022-02-25T22%3A26%3A00.000Z-Mutante-2022-02-25T20%3A37%3A00.000Z [17:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:16] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [17:59:17] maybe a better question is: why do those router alerts never show recoveries on IRC? [17:59:23] btw there was a warning in Icinga about that router / check already (external peering - separate) [17:59:44] My gut feeling is that a CRITICAL -> WARNING level status change doesn't fire any recovery here. [17:59:48] It would need to go OK for that. [17:59:49] another question would be that perhaps the time to alert can be after a one second failure after two tries or something, similar to what we do for bird [17:59:53] that's just a guess though [18:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220228T1800). [18:00:27] the WARNs are filtered, that seems the correct reason [18:00:36] ok yeah [18:00:52] And we often have a WARN status on those BGP due to IXP peers with routers down, maintenance etc. [18:01:25] sukhe: I'd agree on the checks, not 100% sure how they work but that makes sense to me. [18:02:31] topranks: I will see if we can tune it since the false alerts do leave me a bit scared :) [18:02:50] This was a real alert [18:03:07] I agree it maybe shouldn't have fired here though [18:03:25] I meant in the sense that if the service was restarted successfully, then it probably shouldn't alert [18:03:38] (03PS1) 10BBlack: LVS: add new eqiad private tagged_subnets [puppet] - 10https://gerrit.wikimedia.org/r/766824 (https://phabricator.wikimedia.org/T301419) [18:04:31] bblack: just noticed, and sorry should have said. [18:04:37] (03CR) 10BBlack: [C: 03+2] LVS: add new eqiad private tagged_subnets [puppet] - 10https://gerrit.wikimedia.org/r/766824 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [18:04:37] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host datahubsearch1002.eqiad.wmnet [18:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:52] bblack: private1-e4-eqiad and private1-f4-eqiad aren't being configured. [18:04:54] chatops thoughts: if one bot just talked about CRITs and another bot just about WARNs you could have both but each user could filter differently. maybe give them appropriate names, serious and paranoid bot [18:05:03] those racks have been repurposed for WMCS [18:05:14] oh [18:05:25] they're still in other puppet data, too [18:05:31] mutante: +1 for that would be good. [18:06:01] topranks: will do a cleanup patch on the lvs parts anyways [18:06:07] bblack: that's my bad I should have said. [18:07:29] (03PS1) 10Razzi: dhcpd: add datahubsearch1002 [puppet] - 10https://gerrit.wikimedia.org/r/766825 (https://phabricator.wikimedia.org/T301383) [18:07:55] hmmm [18:08:31] yeah so aside from LVS, the eqiad private1-[ef]4 networks also exist in: netbox, modules/network/data/data.yaml, and the modules/install_server dhcp bits and related [18:09:37] indeed yeah I'll tidy that up [18:09:47] I removed the lvs17-20 address reservations in netbox, but didn't touch the subnets themselves [18:09:51] was unsure exactly how much to remove given they may be used eventually. [18:10:04] But certainly for right now they don't exist, no point setting up on lvs [18:10:32] thanks, we need a further discussion on whether to remove completely (and from netbox), or keep for when/if we do use them [18:11:08] (03CR) 10Razzi: [C: 03+2] dhcpd: add datahubsearch1002 [puppet] - 10https://gerrit.wikimedia.org/r/766825 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [18:11:59] (03PS1) 10BBlack: Eqiad LVS: remove [ef]4 vlans from config [puppet] - 10https://gerrit.wikimedia.org/r/766826 (https://phabricator.wikimedia.org/T301419) [18:12:55] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Apologies on the confusion around E4/F4." [puppet] - 10https://gerrit.wikimedia.org/r/766826 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [18:15:15] (03CR) 10BBlack: [C: 03+2] Eqiad LVS: remove [ef]4 vlans from config [puppet] - 10https://gerrit.wikimedia.org/r/766826 (https://phabricator.wikimedia.org/T301419) (owner: 10BBlack) [18:19:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2007.codfw.wmnet with reason: Remove from Ganeti cluster for decom [18:19:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2007.codfw.wmnet with reason: Remove from Ganeti cluster for decom [18:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:17] !log phabricator/diffusion - disable IO and hide http and ssh URIs for source repo 'word2vec' - it's still possible to pull and push via https (operation/debs/word2vec) - https://phabricator.wikimedia.org/source/word2vec/ - https://en.wikipedia.org/wiki/Word2vec T296022 [18:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:23] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [18:21:34] (03PS2) 10Muehlenhoff: Remove ganeti2007 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/766562 [18:24:18] (03CR) 10Btullis: "I don't know why the helm-lint CI step is failing with a rake error." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [18:24:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti2007 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/766562 (owner: 10Muehlenhoff) [18:25:40] (03CR) 10Btullis: "Still WIP but I don't know how to add comments unless it's marked as active." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [18:26:16] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti2007.codfw.wmnet [18:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:19] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [18:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:53] (03PS1) 10Razzi: analytics_cluster::datahub::opensearch: add datahubsearch1002 [puppet] - 10https://gerrit.wikimedia.org/r/766828 (https://phabricator.wikimedia.org/T301383) [18:38:07] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on datahubsearch1002.eqiad.wmnet with reason: Node is being set up for first time and puppet run failed [18:38:08] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 7 days, 0:00:00 on datahubsearch1002.eqiad.wmnet with reason: Node is being set up for first time and puppet run failed [18:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:34] (03CR) 10Razzi: "I added datahubsearch1002 and realized in the process I'd been using "rack" when I meant "row"." [puppet] - 10https://gerrit.wikimedia.org/r/766828 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [18:40:57] (03PS1) 10Urbanecm: Mentor dashboard: Mark mentor-tools as stable [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766794 (https://phabricator.wikimedia.org/T280307) [18:41:22] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) For the record, the logs from k8s-mwdebug pods do show up in Logstash but not on the `mwdebug` dashboard... [18:42:03] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) >>! In T288164#7742387, @Krinkle wrote: > For the record, the logs from k8s-mwdebug pods do show up in Logstash but not on the mwdebug... [18:45:50] (03CR) 10Razzi: [C: 03+2] analytics_cluster::datahub::opensearch: add datahubsearch1002 [puppet] - 10https://gerrit.wikimedia.org/r/766828 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [18:47:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti2007.codfw.wmnet [18:52:04] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [18:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:46] !log bblack@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:00] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [18:55:14] (03PS1) 10BBlack: Remove netbox includes for eqiad [ef]4 [dns] - 10https://gerrit.wikimedia.org/r/766831 [18:55:52] PROBLEM - configured eth on datahubsearch1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.16.45: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:56:47] (03CR) 10jerkins-bot: [V: 04-1] Remove netbox includes for eqiad [ef]4 [dns] - 10https://gerrit.wikimedia.org/r/766831 (owner: 10BBlack) [18:57:49] 10ops-codfw, 10decommission-hardware: decommission ganeti2007 - https://phabricator.wikimedia.org/T302577 (10MoritzMuehlenhoff) [18:58:41] (03PS1) 10Cathal Mooney: Remove authdns includes for reverse zones Eqiad rack E4/F4 subnets [dns] - 10https://gerrit.wikimedia.org/r/766832 (https://phabricator.wikimedia.org/T299758) [18:58:57] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/766831 (owner: 10BBlack) [18:59:00] (03CR) 10Majavah: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/766831 (owner: 10BBlack) [18:59:54] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.241e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [19:01:30] (03Abandoned) 10BBlack: Remove netbox includes for eqiad [ef]4 [dns] - 10https://gerrit.wikimedia.org/r/766831 (owner: 10BBlack) [19:01:52] (03CR) 10BBlack: [C: 03+1] Remove authdns includes for reverse zones Eqiad rack E4/F4 subnets [dns] - 10https://gerrit.wikimedia.org/r/766832 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:02:07] (03CR) 10Cathal Mooney: [C: 03+2] Remove authdns includes for reverse zones Eqiad rack E4/F4 subnets [dns] - 10https://gerrit.wikimedia.org/r/766832 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:05:24] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [19:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:44] (03CR) 10RLazarus: [C: 03+2] miscweb: Update envoy to 1.15.5-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/766208 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [19:09:55] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:29] (03Merged) 10jenkins-bot: miscweb: Update envoy to 1.15.5-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/766208 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [19:13:35] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [19:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:02] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:20] (03PS1) 10Ebernhardson: wdqs/elastic: Remove icinga checks after moving to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/766834 (https://phabricator.wikimedia.org/T289077) [19:18:02] (03CR) 10Ottomata: Add a set of charts for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [19:18:22] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1003.eqiad.wmnet [19:18:23] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [19:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766834 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [19:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:55] topranks: your changes will be likely picked by razzi's run above [19:18:59] you started few seconds apart [19:19:04] only one will succeed :D [19:19:29] yes, it will be fixed when we add locking to spicerack, but not in the next month [19:19:49] haha, not much slips you by Riccardo! [19:19:57] thanks for the heads up :) [19:20:17] so you can probbaly hit ctrl+c [19:20:21] and in case re-run it after [19:20:22] ok will do [19:20:23] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [19:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:38] check with razzi if he got your changes too or not ;) [19:23:35] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:24:45] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05MoritzMuehlenhoff→03RobH Stealing this as this is blocked on me doing a test install of a host being racked as part of T297151 [19:27:12] (03CR) 10Ebernhardson: "Looking over PCC output i don't think anything here is going to need `ensure => absent`, but not 100% sure." [puppet] - 10https://gerrit.wikimedia.org/r/766834 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [19:27:34] RECOVERY - configured eth on datahubsearch1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:28:05] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:23] (03CR) 10RLazarus: [C: 03+2] miscweb: Update envoy to 1.15.5-1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/766209 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [19:34:35] (03PS14) 10Btullis: Add a set of charts for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [19:34:50] (03CR) 10jerkins-bot: [V: 04-1] Add a set of charts for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [19:36:57] (03Merged) 10jenkins-bot: miscweb: Update envoy to 1.15.5-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766209 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [19:38:01] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:06] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Damiendf) Done =) [19:50:13] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:55] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:28] (03PS1) 10CDanis: add link to status page [software/klaxon] - 10https://gerrit.wikimedia.org/r/766839 [19:51:43] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host datahubsearch1003.eqiad.wmnet [19:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:10] (03PS3) 10Huji: Increase AbuseFilter's emergency disable threshold for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763982 (https://phabricator.wikimedia.org/T302227) [19:55:17] (03PS1) 10RLazarus: kubernetes: Upgrade default envoy version to 1.15.5 [puppet] - 10https://gerrit.wikimedia.org/r/766840 (https://phabricator.wikimedia.org/T300324) [19:58:42] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766841 (https://phabricator.wikimedia.org/T302716) [19:58:51] (03PS1) 10RLazarus: miscweb: Restore envoy image_version to the inherited default [deployment-charts] - 10https://gerrit.wikimedia.org/r/766842 (https://phabricator.wikimedia.org/T300324) [19:59:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10Htriedman) 05In progress→03Declined Kiron is leaving Tumult Labs soon, so this task doesn't need to be completed. [19:59:53] (03PS1) 10Razzi: dhcpd: add and configure datahubsearch1003 [puppet] - 10https://gerrit.wikimedia.org/r/766843 (https://phabricator.wikimedia.org/T301383) [20:02:52] (03CR) 10Razzi: [C: 03+2] dhcpd: add and configure datahubsearch1003 [puppet] - 10https://gerrit.wikimedia.org/r/766843 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [20:03:07] !log creating ucoc_edits table on each wiki for elections voterlist (T302433) [20:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:17] T302433: Create voter list for UCoC ratification vote - https://phabricator.wikimedia.org/T302433 [20:08:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:16] (03PS2) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766841 (https://phabricator.wikimedia.org/T302716) [20:11:26] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766841 (https://phabricator.wikimedia.org/T302716) (owner: 10Kosta Harlan) [20:15:17] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/766841 (https://phabricator.wikimedia.org/T302716) (owner: 10Kosta Harlan) [20:16:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:56] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [20:20:59] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [20:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:45] (03PS1) 10Herron: standard_packages: install freeipmi-ipmiseld on metal by default [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) [20:26:30] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [20:32:26] 10SRE, 10SRE Observability, 10Traffic, 10User-ema: Investigate cp5006 crash - https://phabricator.wikimedia.org/T292506 (10herron) [20:40:04] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.302e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [20:41:10] 10SRE, 10SRE Observability (FY2021/2022-Q3): Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10herron) Something like https://github.com/pyrra-dev/pyrra seems worth exploring for this and possibly more [20:45:14] PROBLEM - SSH on kubernetes2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:45:47] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [20:45:51] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [20:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:03] (03PS1) 10Dzahn: git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 [20:47:17] (03PS2) 10Dzahn: git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 [20:47:27] (03PS3) 10Dzahn: git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 [20:47:59] (03CR) 10jerkins-bot: [V: 04-1] git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 (owner: 10Dzahn) [20:48:33] (03PS4) 10Dzahn: git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 [20:52:47] (Processor usage over 85%) firing: (2) Alert for device scs-eqsin.mgmt.eqsin.wmnet - Processor usage over 85% - https://alerts.wikimedia.org [20:53:36] (03CR) 10JHathaway: [C: 03+1] "Do we want to alter any initial settings, e.g. log priority is set to LOG_ERR by default, is that what we want?" [puppet] - 10https://gerrit.wikimedia.org/r/766848 (https://phabricator.wikimedia.org/T302639) (owner: 10Herron) [20:55:36] (03PS1) 10Dzahn: wikistats: move repo from operatiobs/debs on Gerrit to cloud on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/766852 [21:00:05] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220228T2100). Please do the needful. [21:00:05] Urbanecm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:26] i'll self-service [21:00:33] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:00:43] (03CR) 10Urbanecm: [C: 03+2] Mentor dashboard: Mark mentor-tools as stable [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766794 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [21:00:55] urbanecm: can we add two more patches to the window? [21:01:01] kostajh: sure thing! [21:01:06] can you add them to the calendr please? [21:02:03] (03PS1) 10Kosta Harlan: Don't let MobileFrontend show abandonedit after saveComplete [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766799 (https://phabricator.wikimedia.org/T302746) [21:02:23] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10razzi) 05Open→03Resolved [21:02:32] (03PS1) 10Kosta Harlan: Make sure postEdit hook doesn't fire until after saveComplete is done [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766800 (https://phabricator.wikimedia.org/T302746) [21:02:58] yea, adding them now [21:03:09] hi, i also have a late addition, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/766788 (will add to calendar in a moment) [21:03:10] thanks [21:03:21] MatmaRex: ack! [21:03:25] i forgot the late window is so early :) [21:03:35] heh [21:04:31] added them [21:04:48] (03CR) 10Urbanecm: [C: 03+2] Make sure postEdit hook doesn't fire until after saveComplete is done [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766800 (https://phabricator.wikimedia.org/T302746) (owner: 10Kosta Harlan) [21:04:55] (03CR) 10Urbanecm: [C: 03+2] Don't let MobileFrontend show abandonedit after saveComplete [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766799 (https://phabricator.wikimedia.org/T302746) (owner: 10Kosta Harlan) [21:04:58] and +2'ed [21:05:05] (03PS1) 10Bartosz Dziewoński: Revert "htmlform: Replace some uses of isHidden to isDisabled" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766801 (https://phabricator.wikimedia.org/T302512) [21:05:07] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:05:19] (03CR) 10Urbanecm: [C: 03+2] Revert "htmlform: Replace some uses of isHidden to isDisabled" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766801 (https://phabricator.wikimedia.org/T302512) (owner: 10Bartosz Dziewoński) [21:06:41] (03CR) 10Brennen Bearnes: [C: 04-1] git::clone: allow gitlab as a source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766851 (owner: 10Dzahn) [21:07:30] (03PS1) 10Ayounsi: drmrs: Add GTT links to OSPF [homer/public] - 10https://gerrit.wikimedia.org/r/766857 [21:08:27] I'll ping you once it's at mwdebug [21:11:58] thx [21:16:50] FYI, my patch should have no user impact, it only undoes some refactoring to fix some log spam that it caused [21:17:05] (and i also can't figure out how to reproduce the log spam) [21:17:22] MatmaRex: so it needs to be synced blindly? [21:17:23] so i'm planning to just check that the warnings are no longed being logged [21:17:29] yeah [21:17:37] well, i can check that the API still works. but it's a revert [21:18:02] I'll pull it to mwdebug just in case (but yeah, revert shouldn't break stuff) [21:21:44] (03Merged) 10jenkins-bot: Mentor dashboard: Mark mentor-tools as stable [extensions/GrowthExperiments] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766794 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [21:21:54] finally [21:22:51] testing myp atch at mwdebug1001... [21:22:55] ...and it works [21:23:45] syncing [21:24:32] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/GrowthExperiments/includes/Specials/SpecialMentorDashboard.php: 706c2bc7f86f9eadc1284c84cc6668a4e1bf5abc: Mentor dashboard: Mark mentor-tools as stable (T280307) (duration: 00m 49s) [21:24:36] and live [21:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:44] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [21:25:37] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:26:27] (03Merged) 10jenkins-bot: Make sure postEdit hook doesn't fire until after saveComplete is done [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766800 (https://phabricator.wikimedia.org/T302746) (owner: 10Kosta Harlan) [21:26:30] (03Merged) 10jenkins-bot: Don't let MobileFrontend show abandonedit after saveComplete [extensions/VisualEditor] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766799 (https://phabricator.wikimedia.org/T302746) (owner: 10Kosta Harlan) [21:26:36] (03Merged) 10jenkins-bot: Revert "htmlform: Replace some uses of isHidden to isDisabled" [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/766801 (https://phabricator.wikimedia.org/T302512) (owner: 10Bartosz Dziewoński) [21:27:10] kostajh: MatmaRex: all backports are at mwdebug1001 now [21:27:13] can you have a look? [21:27:44] looking [21:28:38] the action=options API works… that's all i know [21:28:50] okay [21:28:53] syncing and let's hope [21:28:57] (and so does the normal preferences form) [21:29:44] i'll watch this logstash search: https://logstash.wikimedia.org/goto/9e786dfeb7ea731991c6472a28ae15cd [21:29:51] thanks [21:30:31] urbanecm: having a look [21:30:34] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/includes/htmlform/: 67831a3: Revert "htmlform: Replace some uses of isHidden to isDisabled" (T302512) (duration: 00m 48s) [21:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:41] T302512: PHP Notice: Undefined index: watchlistdays-local-exception - https://phabricator.wikimedia.org/T302512 [21:30:49] MatmaRex: should be live now. Please ping me if anything wrong's happening. [21:31:11] thanks [21:32:31] np [21:34:10] MatmaRex: unrelated to the change I'm testing, but I noticed a bunch of console errors when opening VE on https://test.wikipedia.org/w/index.php?title=Erica_Nockalls/edithistory&action=edit, fyi [21:35:05] kostajh: thanks, known issue: https://phabricator.wikimedia.org/T302362 will be fixed this week [21:35:07] aqu: please check your internet connection -- you're quitting/joining the channel every couple of mins. Thanks! [21:35:42] kostajh: let me know how it looks like :) [21:35:46] urbanecm: nearly done [21:35:49] ack [21:36:44] urbanecm: done. looks good [21:36:51] great! syncing [21:38:50] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.23/extensions/VisualEditor/modules/ve-mw/init/targets: e22e4d5: b4dd4c4: VisualEditor backports (T302746) (duration: 00m 51s) [21:38:55] kostajh: and should be live [21:38:57] anything else? [21:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:58] T302746: Post-edit dialog does not appear for newcomer task edits - https://phabricator.wikimedia.org/T302746 [21:39:14] urbanecm: no, that's it for me. thank you! [21:39:18] no problem [21:39:27] !log UTC late B&C window done [21:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:30] (03PS1) 10MewOphaswongse: GLAM event: Update wgGECampaigns and wgGECampaignTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766869 (https://phabricator.wikimedia.org/T301029) [21:44:49] RECOVERY - puppet last run on deneb is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:59:47] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:57] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [22:00:05] Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220228T2200). [22:00:48] !log running extensions/SecurePoll/cli/wm-scripts/ucoc/populateEditCount.php on each wiki (s1 thru s8 simultaneously) (T302433) [22:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:55] T302433: Create voter list for UCoC ratification vote - https://phabricator.wikimedia.org/T302433 [22:07:43] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:15] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.358e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [22:27:07] (03PS1) 10JHathaway: Restrict filesystem_avail_bigger_than_size check to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/766871 (https://phabricator.wikimedia.org/T302687) [22:27:40] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/766871 (https://phabricator.wikimedia.org/T302687) (owner: 10JHathaway) [22:36:07] !log start in-place reindex of kmwiki kmwiktionary and kmwikibooks on cirrus cloudelsatic cluster T299707 [22:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:16] T299707: Reindex khmer language wikis on cloudelastic - https://phabricator.wikimedia.org/T299707 [22:48:39] RECOVERY - SSH on kubernetes2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:57] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [23:15:59] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.395e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [23:20:48] (03PS5) 10Dzahn: git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 [23:21:15] (03CR) 10Dzahn: git::clone: allow gitlab as a source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/766851 (owner: 10Dzahn) [23:23:20] (03PS2) 10Dzahn: wikistats: move repo from operations/debs on Gerrit to cloud on Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/766852 [23:32:15] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6016 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [23:33:28] (03CR) 10Brennen Bearnes: [C: 03+1] git::clone: allow gitlab as a source [puppet] - 10https://gerrit.wikimedia.org/r/766851 (owner: 10Dzahn) [23:35:13] RECOVERY - Time elapsed since the last kafka event processed by purged on cp6014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [23:42:50] (03PS1) 10Dzahn: add 15.wikipedia.org to cert for miscweb behind istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) [23:43:43] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ganeti2007 - https://phabricator.wikimedia.org/T302577 (10Papaul) [23:46:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6016 is CRITICAL: 2.413e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6016 [23:48:57] PROBLEM - Time elapsed since the last kafka event processed by purged on cp6014 is CRITICAL: 2.415e+08 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=drmrs+prometheus/ops&var-instance=cp6014 [23:50:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10JBennett) Approved [23:51:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10JBennett) Approved [23:51:56] (03PS1) 10Bking: elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) [23:52:42] (03CR) 10jerkins-bot: [V: 04-1] elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198) (owner: 10Bking) [23:53:47] (03PS2) 10Bking: elastic: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/766876 (https://phabricator.wikimedia.org/T276198)