[00:00:37] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) 05Open→03Resolved These have all been re-imaged with the correct raid configuration [00:29:37] (03PS1) 10MusikAnimal: Add WikiEditor's Realtime Preview to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781096 (https://phabricator.wikimedia.org/T304596) [00:34:42] (03CR) 10MusikAnimal: "Submitting this patch now as we already had it enabled on testwiki, but it got removed once we added the check for the beta feature. From " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781096 (https://phabricator.wikimedia.org/T304596) (owner: 10MusikAnimal) [01:36:49] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:19] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:39:17] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:45:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 52.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:45:49] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 40.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:47:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 73.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:48:07] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:46:59] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:42:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:43:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:45:03] (03PS2) 10Ayounsi: Add script to move devices attributes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) [05:47:02] (03CR) 10Ayounsi: "Thanks, it's satisfying to go from 160 lines to 122 :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [05:48:09] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:49:05] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:49:35] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47966 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:53:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:15] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:30:06] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10ayounsi) @wiki_willy See T304849 (and its description history), or T306129 @cmooney we can query LibreNMS as it also have the data, but I'd prefer to not hav... [06:32:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki-history-drop-snapshot.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:35:10] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) [06:35:16] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) [06:35:22] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) Cables can't be moved through the Netbox UI, they need to be deleted and re-created, which is cumbersome and error-prone. The curre... [06:49:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Aklapper) Hi @maryyang (and welcome!), could you please also provide the purpose / the underlying reason why this is requested? Thanks a lot! [06:49:18] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Aklapper) a:05dr0ptp4kt→03None [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220415T0700) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:20:31] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:05] PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100% [07:52:49] PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100% [07:52:49] PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100% [07:53:05] PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:59] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01002 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:29:13] RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [08:30:13] RECOVERY - Host ms-be1069 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [08:32:13] RECOVERY - Host ms-be1070 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [08:41:23] RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:07:36] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS stretch [09:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch [09:21:18] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [09:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage [09:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:57] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10cmooney) > @cmooney we can query LibreNMS as it also have the data, but I'd prefer to not have the source of truth driven by production (thus the alert only f... [10:23:13] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10cmooney) Actually one really ugly thing you could do is to make the Jinja templates add "disabled" config for every _possible_ interface name. For example fo... [10:30:59] (03PS1) 10Majavah: dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:34:24] (03PS1) 10Majavah: hieradata: pcc: add project-proxy puppetmaster key [puppet] - 10https://gerrit.wikimedia.org/r/781956 [10:42:09] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Thanks! Note that I don't know if there is enough total ports/capacity/diversity. So don't be surprised when you run the numbers :) If there is enough could you let us know how much free ports/c... [10:42:42] (03PS1) 10Majavah: openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) [10:44:27] (03CR) 10jerkins-bot: [V: 04-1] openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [10:44:39] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34847/console" [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [10:45:41] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [10:46:20] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [10:46:47] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [10:46:54] (03PS2) 10Majavah: openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) [10:48:26] (03CR) 10jerkins-bot: [V: 04-1] openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [10:49:04] (03PS3) 10Majavah: openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) [10:50:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34848/console" [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [10:56:59] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1068.eqiad.wmnet with OS stretch [10:57:00] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [10:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch executed with errors:... [10:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:00] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1070.eqiad.wmnet with OS stretch [10:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:05] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch executed with errors:... [11:00:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:00:41] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [11:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:44] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1071.eqiad.wmnet with OS stretch [11:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:49] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch executed with errors:... [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:36] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1069.eqiad.wmnet with OS stretch [11:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch executed with errors:... [11:06:39] (03PS1) 10Majavah: openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) [11:07:20] (03CR) 10jerkins-bot: [V: 04-1] openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [11:08:41] (03PS2) 10Majavah: openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) [11:09:20] (03CR) 10jerkins-bot: [V: 04-1] openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [11:10:29] (03PS3) 10Majavah: openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) [11:21:34] (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [11:26:04] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:27:07] (03PS1) 10Zabe: maintain-views: remove user_options column from user table [puppet] - 10https://gerrit.wikimedia.org/r/782017 [13:12:42] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Good little lesson on how this stuff works in NB thanks :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [13:29:02] 10SRE, 10Traffic: Improve handling/logging of HAproxy emergency log messages - https://phabricator.wikimedia.org/T306236 (10CDanis) Something else I'm wondering about is if we can do any rate-limiting of the generation of such messages within haproxy. I suspect it was spending a non-trivial amount of CPU time... [14:03:38] !log labweb1001:~$ mwscript resetUserEmail.php --wiki labswiki Fomafix [14:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:13] (03PS1) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:14:46] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:24:50] (03PS3) 10Roman Stolar: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) [14:25:14] (03CR) 10Roman Stolar: Create docker configuration for local development (034 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) (owner: 10Roman Stolar) [14:26:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:44] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:04:36] (03PS2) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:05:10] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [15:14:00] (03PS1) 10PipelineBot: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/782183 [15:15:38] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:54] (03PS3) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:17:30] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [15:29:10] (03PS4) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:29:42] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [15:55:39] (03PS5) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:56:11] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [15:58:46] (03PS6) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:59:20] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [16:11:36] !log depooling & disabling puppet on cp2027 for some manual testing T303534 [16:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:22] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10maryyang) Hi Andre, Here is the purpose: Observability tools such as Logstash, as well as other monitoring and config viewing facilities (especially for Wikifunctions and connected architect... [16:42:22] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10maryyang) [16:43:48] 10SRE, 10LDAP-Access-Requests, 10SRE Observability (FY2021/2022-Q4): Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10lmata) [17:06:06] (03PS7) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [17:06:42] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [17:07:46] (03PS8) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [17:08:19] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [17:40:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:40:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [17:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24647 and previous config saved to /var/cache/conftool/dbconfig/20220415-174050-ladsgroup.json [17:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:43:33] !log reenabled puppet on cp2027 and repooled after some manual testing T303534 [17:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:11] (03PS1) 10Zabe: analytics: migrate clean_jupyter_user_local_trash cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782339 (https://phabricator.wikimedia.org/T273673) [17:44:13] (03PS1) 10Zabe: analytics: remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/782340 (https://phabricator.wikimedia.org/T273673) [17:48:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24648 and previous config saved to /var/cache/conftool/dbconfig/20220415-174824-ladsgroup.json [17:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:48:35] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34859/" [puppet] - 10https://gerrit.wikimedia.org/r/782339 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:54:18] (03PS1) 10Dzahn: replace deploy-1002 with deploy-1004 [puppet] - 10https://gerrit.wikimedia.org/r/782353 (https://phabricator.wikimedia.org/T306069) [17:55:35] (03PS2) 10Dzahn: replace deploy-1002 with deploy-1004 [puppet] - 10https://gerrit.wikimedia.org/r/782353 (https://phabricator.wikimedia.org/T306069) [17:56:30] (03PS1) 10Zabe: prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673) [17:56:33] (03PS1) 10Zabe: prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673) [17:57:25] (03CR) 10Dzahn: [C: 03+2] replace deploy-1002 with deploy-1004 [puppet] - 10https://gerrit.wikimedia.org/r/782353 (https://phabricator.wikimedia.org/T306069) (owner: 10Dzahn) [17:57:28] (03PS2) 10Zabe: prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673) [17:59:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:59:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [17:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:25] (03CR) 10Thcipriani: [C: 03+1] ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [18:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24649 and previous config saved to /var/cache/conftool/dbconfig/20220415-180329-ladsgroup.json [18:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:27] (03CR) 10Dzahn: [C: 03+2] ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [18:16:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned? (I even see... [18:18:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24650 and previous config saved to /var/cache/conftool/dbconfig/20220415-181834-ladsgroup.json [18:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:33:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24651 and previous config saved to /var/cache/conftool/dbconfig/20220415-183339-ladsgroup.json [18:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:34:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [18:34:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [18:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24652 and previous config saved to /var/cache/conftool/dbconfig/20220415-183412-ladsgroup.json [18:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24653 and previous config saved to /var/cache/conftool/dbconfig/20220415-184332-ladsgroup.json [18:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:45:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Majavah) >>! In T305414#7858338, @ayounsi wrote: > Thanks, from what I understand moving those hosts to private IPs are much shorter term goals... [18:49:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:49:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24654 and previous config saved to /var/cache/conftool/dbconfig/20220415-185837-ladsgroup.json [18:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:13:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24655 and previous config saved to /var/cache/conftool/dbconfig/20220415-191343-ladsgroup.json [19:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24656 and previous config saved to /var/cache/conftool/dbconfig/20220415-192848-ladsgroup.json [19:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:29:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [19:29:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [19:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24657 and previous config saved to /var/cache/conftool/dbconfig/20220415-192920-ladsgroup.json [19:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:48] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24658 and previous config saved to /var/cache/conftool/dbconfig/20220415-193638-ladsgroup.json [19:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24659 and previous config saved to /var/cache/conftool/dbconfig/20220415-195143-ladsgroup.json [19:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:18] (ProbeDown) firing: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:06:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24660 and previous config saved to /var/cache/conftool/dbconfig/20220415-200648-ladsgroup.json [20:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:09:48] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:11:18] (ProbeDown) resolved: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24661 and previous config saved to /var/cache/conftool/dbconfig/20220415-202153-ladsgroup.json [20:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:22:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [20:22:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [20:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24662 and previous config saved to /var/cache/conftool/dbconfig/20220415-202227-ladsgroup.json [20:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:37:00] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:37:16] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 45.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:37:30] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 33.84 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:38:40] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 12.39 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:39:30] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 84.61 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:39:44] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 90.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:39:54] (03PS9) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [20:40:32] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [20:40:54] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [20:44:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24663 and previous config saved to /var/cache/conftool/dbconfig/20220415-204449-ladsgroup.json [20:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:59:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24664 and previous config saved to /var/cache/conftool/dbconfig/20220415-205954-ladsgroup.json [20:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:20] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Dzahn) Hi @jmads @MNovotny_WMF is there a date we can use for the "contract_expiry" field? It can be updated later but when we upload the needed code change and it's for contractors we'll have to put som... [21:14:46] PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24665 and previous config saved to /var/cache/conftool/dbconfig/20220415-211500-ladsgroup.json [21:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24666 and previous config saved to /var/cache/conftool/dbconfig/20220415-213005-ladsgroup.json [21:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:30:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [21:30:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [21:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24667 and previous config saved to /var/cache/conftool/dbconfig/20220415-213038-ladsgroup.json [21:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:59] (03PS10) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [21:41:32] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [21:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24668 and previous config saved to /var/cache/conftool/dbconfig/20220415-214757-ladsgroup.json [21:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:55:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24669 and previous config saved to /var/cache/conftool/dbconfig/20220415-220302-ladsgroup.json [22:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24670 and previous config saved to /var/cache/conftool/dbconfig/20220415-221807-ladsgroup.json [22:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24671 and previous config saved to /var/cache/conftool/dbconfig/20220415-223312-ladsgroup.json [22:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:33:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [22:33:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [22:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24672 and previous config saved to /var/cache/conftool/dbconfig/20220415-223345-ladsgroup.json [22:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24673 and previous config saved to /var/cache/conftool/dbconfig/20220415-225719-ladsgroup.json [22:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:00:54] RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [23:02:06] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:03:19] FWIW the above issue with ms-be1071 was our old ARP problem on the Juniper QFX series. [23:04:14] I reset the l2 forwarding cache to resolve. I did manage to capture some debug info I'd been after so hopefully that will allow Juniper to progress the case. [23:12:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24674 and previous config saved to /var/cache/conftool/dbconfig/20220415-231224-ladsgroup.json [23:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:02] (03PS5) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [23:23:24] (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson) [23:27:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24675 and previous config saved to /var/cache/conftool/dbconfig/20220415-232729-ladsgroup.json [23:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:18] (03PS6) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [23:42:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24676 and previous config saved to /var/cache/conftool/dbconfig/20220415-234234-ladsgroup.json [23:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:43:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [23:43:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [23:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24677 and previous config saved to /var/cache/conftool/dbconfig/20220415-234306-ladsgroup.json [23:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:08] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24678 and previous config saved to /var/cache/conftool/dbconfig/20220415-235023-ladsgroup.json [23:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565