[00:00:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) 05Open→03Resolved These have all been re-imaged with the correct raid configuration
[00:29:37] <wikibugs>	 (03PS1) 10MusikAnimal: Add WikiEditor's Realtime Preview to BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781096 (https://phabricator.wikimedia.org/T304596)
[00:34:42] <wikibugs>	 (03CR) 10MusikAnimal: "Submitting this patch now as we already had it enabled on testwiki, but it got removed once we added the check for the beta feature. From " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781096 (https://phabricator.wikimedia.org/T304596) (owner: 10MusikAnimal)
[01:36:49] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:48:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:16:19] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:39:17] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:45:27] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 52.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:45:49] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 40.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:47:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 73.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[03:48:07] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[04:46:59] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:42:35] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:43:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:45:03] <wikibugs>	 (03PS2) 10Ayounsi: Add script to move devices attributes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166)
[05:47:02] <wikibugs>	 (03CR) 10Ayounsi: "Thanks, it's satisfying to go from 160 lines to 122 :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi)
[05:48:09] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:49:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:49:35] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:52:45] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47966 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:53:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:21:15] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:30:06] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10ayounsi) @wiki_willy See T304849 (and its description history), or T306129  @cmooney we can query LibreNMS as it also have the data, but I'd prefer to not hav...
[06:32:13] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki-history-drop-snapshot.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:35:10] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi)
[06:35:16] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi)
[06:35:22] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) Cables can't be moved through the Netbox UI, they need to be deleted and re-created, which is cumbersome and error-prone. The curre...
[06:49:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Aklapper) Hi @maryyang (and welcome!), could you please also provide the purpose / the underlying reason why this is requested? Thanks a lot!
[06:49:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Aklapper) a:05dr0ptp4kt→03None
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220415T0700)
[07:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:20:31] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:23] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:05] <icinga-wm>	 PROBLEM - Host ms-be1068 is DOWN: PING CRITICAL - Packet loss = 100%
[07:52:49] <icinga-wm>	 PROBLEM - Host ms-be1070 is DOWN: PING CRITICAL - Packet loss = 100%
[07:52:49] <icinga-wm>	 PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100%
[07:53:05] <icinga-wm>	 PROBLEM - Host ms-be1069 is DOWN: PING CRITICAL - Packet loss = 100%
[08:19:59] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01002 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[08:29:13] <icinga-wm>	 RECOVERY - Host ms-be1068 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[08:30:13] <icinga-wm>	 RECOVERY - Host ms-be1069 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[08:32:13] <icinga-wm>	 RECOVERY - Host ms-be1070 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[08:41:23] <icinga-wm>	 RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[09:07:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1071.eqiad.wmnet with OS stretch
[09:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch
[09:21:18] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage
[09:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:53] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1071.eqiad.wmnet with reason: host reimage
[09:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:57] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10cmooney) > @cmooney we can query LibreNMS as it also have the data, but I'd prefer to not have the source of truth driven by production (thus the alert only f...
[10:23:13] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netbox, 10netops: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10cmooney) Actually one really ugly thing you could do is to make the Jinja templates add "disabled" config for every _possible_ interface name.  For example fo...
[10:30:59] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453)
[10:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:34:24] <wikibugs>	 (03PS1) 10Majavah: hieradata: pcc: add project-proxy puppetmaster key [puppet] - 10https://gerrit.wikimedia.org/r/781956
[10:42:09] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Thanks!  Note that I don't know if there is enough total ports/capacity/diversity. So don't be surprised when you run the numbers :) If there is enough could you let us know how much free ports/c...
[10:42:42] <wikibugs>	 (03PS1) 10Majavah: openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247)
[10:44:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[10:44:39] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34847/console" [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[10:45:41] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi)
[10:46:20] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi)
[10:46:47] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi)
[10:46:54] <wikibugs>	 (03PS2) 10Majavah: openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247)
[10:48:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[10:49:04] <wikibugs>	 (03PS3) 10Majavah: openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247)
[10:50:08] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34848/console" [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah)
[10:56:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1068.eqiad.wmnet with OS stretch
[10:57:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[10:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1068.eqiad.wmnet with OS stretch executed with errors:...
[10:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1070.eqiad.wmnet with OS stretch
[10:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1070.eqiad.wmnet with OS stretch executed with errors:...
[11:00:40] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:00:41] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[11:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:44] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1071.eqiad.wmnet with OS stretch
[11:01:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch executed with errors:...
[11:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:02:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:36] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1069.eqiad.wmnet with OS stretch
[11:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:41] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-be1069.eqiad.wmnet with OS stretch executed with errors:...
[11:06:39] <wikibugs>	 (03PS1) 10Majavah: openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666)
[11:07:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[11:08:41] <wikibugs>	 (03PS2) 10Majavah: openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666)
[11:09:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[11:10:29] <wikibugs>	 (03PS3) 10Majavah: openstack: make wmf_sink authenticate to enc api via keystone [puppet] - 10https://gerrit.wikimedia.org/r/781977 (https://phabricator.wikimedia.org/T274666)
[11:21:34] <wikibugs>	 (03CR) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[11:26:04] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:27:07] <wikibugs>	 (03PS1) 10Zabe: maintain-views: remove user_options column from user table [puppet] - 10https://gerrit.wikimedia.org/r/782017
[13:12:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  Good little lesson on how this stuff works in NB thanks :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi)
[13:29:02] <wikibugs>	 10SRE, 10Traffic: Improve handling/logging of HAproxy emergency log messages - https://phabricator.wikimedia.org/T306236 (10CDanis) Something else I'm wondering about is if we can do any rate-limiting of the generation of such messages within haproxy.  I suspect it was spending a non-trivial amount of CPU time...
[14:03:38] <Krinkle>	 !log labweb1001:~$ mwscript resetUserEmail.php --wiki labswiki Fomafix
[14:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:13] <wikibugs>	 (03PS1) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[14:14:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[14:24:50] <wikibugs>	 (03PS3) 10Roman Stolar: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249)
[14:25:14] <wikibugs>	 (03CR) 10Roman Stolar: Create docker configuration for local development (034 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) (owner: 10Roman Stolar)
[14:26:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:44] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:04:36] <wikibugs>	 (03PS2) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[15:05:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[15:14:00] <wikibugs>	 (03PS1) 10PipelineBot: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/782183
[15:15:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:54] <wikibugs>	 (03PS3) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[15:17:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[15:29:10] <wikibugs>	 (03PS4) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[15:29:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[15:55:39] <wikibugs>	 (03PS5) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[15:56:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[15:58:46] <wikibugs>	 (03PS6) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[15:59:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[16:11:36] <cdanis>	 !log depooling & disabling puppet on cp2027 for some manual testing T303534	
[16:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:22] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:42:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10maryyang) Hi Andre,   Here is the purpose: Observability tools such as Logstash, as well as other monitoring and config viewing facilities (especially for Wikifunctions and connected architect...
[16:42:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10maryyang)
[16:43:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10SRE Observability (FY2021/2022-Q4): Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10lmata)
[17:06:06] <wikibugs>	 (03PS7) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[17:06:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[17:07:46] <wikibugs>	 (03PS8) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[17:08:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[17:40:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[17:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[17:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24647 and previous config saved to /var/cache/conftool/dbconfig/20220415-174050-ladsgroup.json
[17:40:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:54] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[17:43:33] <cdanis>	 !log reenabled puppet on cp2027 and repooled after some manual testing T303534	
[17:43:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:11] <wikibugs>	 (03PS1) 10Zabe: analytics: migrate clean_jupyter_user_local_trash cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782339 (https://phabricator.wikimedia.org/T273673)
[17:44:13] <wikibugs>	 (03PS1) 10Zabe: analytics: remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/782340 (https://phabricator.wikimedia.org/T273673)
[17:48:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24648 and previous config saved to /var/cache/conftool/dbconfig/20220415-174824-ladsgroup.json
[17:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:28] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[17:48:35] <wikibugs>	 (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34859/" [puppet] - 10https://gerrit.wikimedia.org/r/782339 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[17:54:18] <wikibugs>	 (03PS1) 10Dzahn: replace deploy-1002 with deploy-1004 [puppet] - 10https://gerrit.wikimedia.org/r/782353 (https://phabricator.wikimedia.org/T306069)
[17:55:35] <wikibugs>	 (03PS2) 10Dzahn: replace deploy-1002 with deploy-1004 [puppet] - 10https://gerrit.wikimedia.org/r/782353 (https://phabricator.wikimedia.org/T306069)
[17:56:30] <wikibugs>	 (03PS1) 10Zabe: prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673)
[17:56:33] <wikibugs>	 (03PS1) 10Zabe: prometheus: remove absented prometheus_directorysize cron [puppet] - 10https://gerrit.wikimedia.org/r/782360 (https://phabricator.wikimedia.org/T273673)
[17:57:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] replace deploy-1002 with deploy-1004 [puppet] - 10https://gerrit.wikimedia.org/r/782353 (https://phabricator.wikimedia.org/T306069) (owner: 10Dzahn)
[17:57:28] <wikibugs>	 (03PS2) 10Zabe: prometheus: migrate prometheus_directorysize cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673)
[17:59:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[17:59:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[17:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:25] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[18:03:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24649 and previous config saved to /var/cache/conftool/dbconfig/20220415-180329-ladsgroup.json
[18:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[18:16:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) Thanks, from what I understand moving those hosts to private IPs are much shorter term goals than the ones you mentioned? (I even see...
[18:18:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24650 and previous config saved to /var/cache/conftool/dbconfig/20220415-181834-ladsgroup.json
[18:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:33:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24651 and previous config saved to /var/cache/conftool/dbconfig/20220415-183339-ladsgroup.json
[18:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:44] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[18:34:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[18:34:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[18:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24652 and previous config saved to /var/cache/conftool/dbconfig/20220415-183412-ladsgroup.json
[18:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24653 and previous config saved to /var/cache/conftool/dbconfig/20220415-184332-ladsgroup.json
[18:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:38] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[18:45:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Majavah) >>! In T305414#7858338, @ayounsi wrote: > Thanks, from what I understand moving those hosts to private IPs are much shorter term goals...
[18:49:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[18:49:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[18:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24654 and previous config saved to /var/cache/conftool/dbconfig/20220415-185837-ladsgroup.json
[18:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:13:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24655 and previous config saved to /var/cache/conftool/dbconfig/20220415-191343-ladsgroup.json
[19:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24656 and previous config saved to /var/cache/conftool/dbconfig/20220415-192848-ladsgroup.json
[19:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:53] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[19:29:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[19:29:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[19:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24657 and previous config saved to /var/cache/conftool/dbconfig/20220415-192920-ladsgroup.json
[19:29:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:48] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:36:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24658 and previous config saved to /var/cache/conftool/dbconfig/20220415-193638-ladsgroup.json
[19:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:44] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[19:51:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24659 and previous config saved to /var/cache/conftool/dbconfig/20220415-195143-ladsgroup.json
[19:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:18] <jinxer-wm>	 (ProbeDown) firing: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:06:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24660 and previous config saved to /var/cache/conftool/dbconfig/20220415-200648-ladsgroup.json
[20:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:32] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:09:48] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:11:18] <jinxer-wm>	 (ProbeDown) resolved: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:21:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24661 and previous config saved to /var/cache/conftool/dbconfig/20220415-202153-ladsgroup.json
[20:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:01] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[20:22:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[20:22:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[20:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24662 and previous config saved to /var/cache/conftool/dbconfig/20220415-202227-ladsgroup.json
[20:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:37:00] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:37:16] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 45.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[20:37:30] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 33.84 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[20:38:40] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 12.39 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[20:39:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 84.61 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[20:39:44] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 90.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[20:39:54] <wikibugs>	 (03PS9) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[20:40:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[20:40:54] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[20:44:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24663 and previous config saved to /var/cache/conftool/dbconfig/20220415-204449-ladsgroup.json
[20:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:55] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[20:59:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24664 and previous config saved to /var/cache/conftool/dbconfig/20220415-205954-ladsgroup.json
[20:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Dzahn) Hi @jmads @MNovotny_WMF is there a date we can use for the "contract_expiry" field? It can be updated later but when we upload the needed code change and it's for contractors we'll have to put som...
[21:14:46] <icinga-wm>	 PROBLEM - Host ms-be1071 is DOWN: PING CRITICAL - Packet loss = 100%
[21:15:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24665 and previous config saved to /var/cache/conftool/dbconfig/20220415-211500-ladsgroup.json
[21:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24666 and previous config saved to /var/cache/conftool/dbconfig/20220415-213005-ladsgroup.json
[21:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:11] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[21:30:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[21:30:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[21:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24667 and previous config saved to /var/cache/conftool/dbconfig/20220415-213038-ladsgroup.json
[21:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:59] <wikibugs>	 (03PS10) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107
[21:41:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook)
[21:47:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24668 and previous config saved to /var/cache/conftool/dbconfig/20220415-214757-ladsgroup.json
[21:48:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:02] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[21:55:36] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:03:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24669 and previous config saved to /var/cache/conftool/dbconfig/20220415-220302-ladsgroup.json
[22:03:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24670 and previous config saved to /var/cache/conftool/dbconfig/20220415-221807-ladsgroup.json
[22:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:33:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24671 and previous config saved to /var/cache/conftool/dbconfig/20220415-223312-ladsgroup.json
[22:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:17] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[22:33:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[22:33:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[22:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24672 and previous config saved to /var/cache/conftool/dbconfig/20220415-223345-ladsgroup.json
[22:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24673 and previous config saved to /var/cache/conftool/dbconfig/20220415-225719-ladsgroup.json
[22:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:24] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[23:00:54] <icinga-wm>	 RECOVERY - Host ms-be1071 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[23:02:06] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:03:19] <topranks>	 FWIW the above issue with ms-be1071 was our old ARP problem on the Juniper QFX series.
[23:04:14] <topranks>	 I reset the l2 forwarding cache to resolve.  I did manage to capture some debug info I'd been after so hopefully that will allow Juniper to progress the case.
[23:12:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24674 and previous config saved to /var/cache/conftool/dbconfig/20220415-231224-ladsgroup.json
[23:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:02] <wikibugs>	 (03PS5) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009
[23:23:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson)
[23:27:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24675 and previous config saved to /var/cache/conftool/dbconfig/20220415-232729-ladsgroup.json
[23:27:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:18] <wikibugs>	 (03PS6) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009
[23:42:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24676 and previous config saved to /var/cache/conftool/dbconfig/20220415-234234-ladsgroup.json
[23:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:42:40] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[23:43:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[23:43:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[23:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24677 and previous config saved to /var/cache/conftool/dbconfig/20220415-234306-ladsgroup.json
[23:43:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:08] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:50:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24678 and previous config saved to /var/cache/conftool/dbconfig/20220415-235023-ladsgroup.json
[23:50:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:27] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565