[00:04:47] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 55.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:05:15] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 52.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:07:01] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:07:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:13:19] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 108 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [00:46:30] !log rook@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1016.eqiad.wmnet [00:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:41] !log rook@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudvirt1016.eqiad.wmnet [00:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:47] PROBLEM - Host mw1415 is DOWN: PING CRITICAL - Packet loss = 100% [01:43:49] PROBLEM - Host mw1415.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:50:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:49] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw1415.eqiad.wmnet [01:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:02] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1415.eqiad.wmnet [01:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:36] ACKNOWLEDGEMENT - SSH on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T307755 https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:53:36] ACKNOWLEDGEMENT - PHP7 rendering on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T307755 https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:53:36] ACKNOWLEDGEMENT - Apache HTTP on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T307755 https://wikitech.wikimedia.org/wiki/Application_servers [01:53:36] ACKNOWLEDGEMENT - Host mw1415 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T307755 [01:53:36] ACKNOWLEDGEMENT - Host mw1415.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T307755 [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:27:59] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:13:47] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:22:02] good morning. I am going to do an instance migration for CI which might cause some builds here and there to fail [06:30:19] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:43:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [06:56:49] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 201 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220506T0700) [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:09] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 58 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:03:38] turns out I forgot to copy data bah [07:06:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [07:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [07:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [07:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:25] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:14:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [07:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [07:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [07:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2057.codfw.wmnet with OS bullseye [07:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:44] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2057.codfw.wmnet with OS bullseye [07:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:53] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2057.codfw.wmnet with OS bullseye [07:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:49:15] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2057.codfw.wmnet with OS bullseye [07:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:01:06] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 32.47 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:05:10] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:16:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2057.codfw.wmnet with OS bullseye [08:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:17:46] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:17:56] looking [08:18:19] <_joe_> uhhh [08:18:22] <_joe_> mgmt? [08:18:27] https://librenms.wikimedia.org/graphs/to=1651824900/id=15269/type=port_bits/from=1651738500/ [08:18:52] nah one or more device in row A is saturating one of the links between A and the router [08:19:15] actually inbound [08:19:18] looks like analytics [08:19:32] <_joe_> so not management? [08:20:06] mgmt is just the hostname, the saturation is on the prod links [08:20:31] <_joe_> yeah i saw now :P [08:20:31] how to know which devices are connected there? [08:21:56] I am around [08:22:46] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:22:46] (Primary outbound port utilisation over 80% #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:22:47] I used the vlans view [08:23:09] I'm as well, trying to make my way from alert to the librenms link above [08:23:19] jynus: icinga has it [08:23:40] * jbond here reading back [08:23:48] But getting to the actual device... [08:24:02] Only librenms comes to mind [08:24:11] or netbox for the list of devices [08:24:15] <_joe_> asking in #analytics [08:24:48] I see some an-worker with spikes [08:24:56] but I am not sure it is that [08:25:32] <_joe_> the surge started at the hour [08:25:34] (e.g. my backups also cause spikes but no pages/disrruption) [08:25:48] the spike is visible across all the rows, row A happened to be the most impacted [08:26:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:26:04] <_joe_> joal is killing an anaylytics monthly report right now [08:26:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:25] <_joe_> got a recovery in VO? [08:27:30] Hi folks- sorry for the mess :S I just killed the big job that was currently runnig [08:27:57] <_joe_> joal: we just got a recovery, it seems [08:28:03] I see traffic going down as well [08:28:10] I didn't get a recovery yet [08:28:32] <_joe_> XioNoX: it was going down for some time though [08:28:39] XioNoX: could you give us a quick way to how you figure out which vlan was? [08:28:40] <_joe_> it should drop off if it was the job [08:28:40] the jinxer bot above showed a recovery, before the job was killed I think [08:28:50] or which set of devices? [08:29:01] yeah, or maybe the job was just one big spike and then decreased on its own [08:29:02] I saw lots of graphs but nothing clear to me [08:29:25] jynus: yep [08:29:52] reaching the router was easy- it is on icinga, as Alex said [08:29:55] but from there? [08:30:24] <_joe_> joal: looks like the job was done with the heavy part of it, sigh [08:30:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:41] I went here: https://librenms.wikimedia.org/alerts and clicked on the little "+" icons only appearing on mouseover unfortunately [08:30:52] PROBLEM - Host ml-serve-ctrl1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:55] <_joe_> XioNoX: I would suggest we re-start the job and see if it saturates again [08:31:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:31:16] <_joe_> elukey, klausman ^^ quite a few alerts for ml-serve [08:31:23] AIUI that shows the links, thich in turn shows the host connected there (like here https://librenms.wikimedia.org/device/device=160/tab=port/port=14244/) [08:31:37] _joe_: that's my bad. [08:31:38] RECOVERY - Host ml-serve-ctrl1001 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [08:31:40] <_joe_> jayme, jynus please let's focus on the issue we saw for now [08:31:40] _joe_: We have ways to throttle that job - it never happened yet :S [08:31:54] Could the recent change of vlan be related? [08:31:55] _joe_: doing reboots for the kernel update and forgot to downtime things [08:31:58] I can't see why [08:32:00] yeah ml-serve should be unrelated to the saturation [08:32:03] <_joe_> klausman: np! [08:32:06] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:32:22] <_joe_> so, regarding the saturation, we're pretty sure it was caused by this job [08:32:40] <_joe_> should we try to run it again to confirm, if it happens again add throttling? [08:33:57] <_joe_> joal, XioNoX ^^ any opinions? [08:34:00] PROBLEM - Host ml-serve-ctrl1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:17] joal: if by new vlan you mean the new eqiad racks, it doesn't seem related as there was no spike over there [08:34:30] I was think analytics-vlan removal [08:34:54] joal: nah, unrelated [08:34:59] ack [08:35:04] RECOVERY - Host ml-serve-ctrl1002 is UP: PING WARNING - Packet loss = 50%, RTA = 0.37 ms [08:36:31] _joe_: sure [08:36:33] From a scheduling perspective: that big usually launches the 1st of the month, in conjunction with many other big jobs - this makes those "big jobs" all share resources, and not having a specific one be too greedy [08:37:07] We had an issue with that job to run at its regular time, and now it runs almost alone, making it putting a lot of pressure at once [08:37:09] creating a group of analytics interfaces to monitor them better in librenms, 5min [08:37:19] sure XioNoX [08:38:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:39:17] I do plan to look at QoS on the network (hopefully next quarter). Doesn't solve these kind of problems, but should hopefully be able to mitigate impact in such a circumstance. [08:39:49] https://librenms.wikimedia.org/bill/bill_id=28/ (cr1-eqiad only so far) [08:40:04] "analytics" in general being a perfect candidate for scavenger class, as it's not production traffic, I assume not real-time, and I'm sure shifts around large datasets at times [08:40:39] correct topranks - thank you for investing in that [08:41:28] PROBLEM - Check systemd state on ml-serve-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:01] joal, _joe_, it's ok for me if you want to re-run it [08:43:11] <_joe_> +1 [08:43:17] I'm curious though why this time the rate limiting didn't work out [08:43:18] as is, not throttling? [08:44:14] yeah [08:45:58] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet [08:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:46:54] Job just restarted [08:47:00] error already? [08:47:12] monitoring! [08:47:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:05] XioNoX: it'll take time to get to the point of network-usage - I'd say about 15 minutes (maybe a bit less) [08:48:16] noted, thx! [08:52:31] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet [08:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:26] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet [08:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:01] details: Currently 2 jobs running - they are starting to move data across the network [08:57:51] PROBLEM - Host ms-be2057 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:30] the network load should continue to grow in the next minutes [08:58:36] ok [08:58:51] RECOVERY - Host ms-be2057 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [09:00:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2057.codfw.wmnet with reason: host reimage [09:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:45] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet [09:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:57] 1 job is fully at the network transfert stage, the other still grows [09:02:38] I see the spikes [09:02:55] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1003.eqiad.wmnet [09:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:00] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2057.codfw.wmnet with reason: host reimage [09:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:12] more to come XioNoX :S [09:03:31] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:05:48] we're almost at peak I think [09:06:17] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:08:18] joal: we're about to page [09:08:28] joal: one link is saturating [09:08:32] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1003.eqiad.wmnet [09:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:14] XioNoX: let me know when you wish me to kill the thing [09:09:45] RECOVERY - Hadoop NodeManager on analytics1076 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:10:45] joal: yeah it would be better [09:10:50] ack doing so [09:10:55] to make sure it doesn't impact prod traffic [09:10:58] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:10:58] (Primary outbound port utilisation over 80% #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:10:58] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [09:11:02] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [09:11:10] Jobs killed [09:11:54] poor Steve [09:12:16] pages acked on VO [09:13:01] ok - I'm gonna add throtlling for the jobs of this type for which I know there is a probability of saturating the links [09:13:09] sorry for the mess again folks [09:13:43] thanks! no pb! [09:17:50] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1004.eqiad.wmnet [09:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:20:09] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:56] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1004.eqiad.wmnet [09:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:46] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1005.eqiad.wmnet [09:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2057.codfw.wmnet with OS bullseye [09:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast4003.wikimedia.org [09:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:58] (Primary outbound port utilisation over 80% #page) resolved: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:30:58] (Primary outbound port utilisation over 80% #page) resolved: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:30:58] (Primary inbound port utilisation over 80% #page) resolved: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [09:31:02] (Primary inbound port utilisation over 80% #page) resolved: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [09:31:47] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1005.eqiad.wmnet [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast4003.wikimedia.org [09:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:41] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1006.eqiad.wmnet [09:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install4001.wikimedia.org [09:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:37] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install4001.wikimedia.org [09:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:25] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:02] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1006.eqiad.wmnet [09:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow4002.ulsfo.wmnet [09:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:03] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1007.eqiad.wmnet [09:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:54:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow4002.ulsfo.wmnet [09:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:43] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2058.codfw.wmnet with OS bullseye [09:56:45] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1007.eqiad.wmnet [09:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:39] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1008.eqiad.wmnet [09:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir4001.ulsfo.wmnet [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:20] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1008.eqiad.wmnet [10:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir4001.ulsfo.wmnet [10:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir4002.ulsfo.wmnet [10:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:48] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir4002.ulsfo.wmnet [10:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:33] PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:32] RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms [10:29:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast5002.wikimedia.org [10:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:39] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2058.codfw.wmnet with OS bullseye [10:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2058.codfw.wmnet with OS bullseye [10:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:32] PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100% [10:35:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast5002.wikimedia.org [10:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:56] RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [10:37:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install5001.wikimedia.org [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install5001.wikimedia.org [10:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow5002.eqsin.wmnet [10:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [10:48:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow5002.eqsin.wmnet [10:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2058.codfw.wmnet with reason: host reimage [10:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2058.codfw.wmnet with reason: host reimage [10:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir5001.eqsin.wmnet [10:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir5001.eqsin.wmnet [11:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir5002.eqsin.wmnet [11:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir5002.eqsin.wmnet [11:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2058.codfw.wmnet with OS bullseye [11:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:35] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:13:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install3001.wikimedia.org [11:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install3001.wikimedia.org [11:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:09] XioNoX: heya - I'm going to do a test run of this morning problematic job with throtlling - ok for you? [11:19:21] joal: yep! [11:20:08] launching [11:23:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow3002.esams.wmnet [11:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:57] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:25:05] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow3002.esams.wmnet [11:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:16] hnowlan: happy for me to merge your change [11:34:44] jbond: please do! [11:35:07] hnowlan: merged [11:36:10] thanks [11:38:55] !log enabling postgres slow query log on maps replicas T307671 [11:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:01] T307671: High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20 - https://phabricator.wikimedia.org/T307671 [11:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:03:50] XioNoX: we're at max network pull - it'll last longer but not harder on links [12:14:45] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:14:54] joal: yeah it almost went to paging levels: https://librenms.wikimedia.org/graphs/to=1651839000/id=15269/type=port_bits/from=1651817400/ [12:15:28] well, it did, but not long enough to page :) [12:17:06] XioNoX: more throtelling, uh? [12:18:00] joal: if possible, a bit more would be better [12:18:12] XioNoX: sounds good - that's why I wanted a test run :) [12:18:23] Thanks for checking :) [12:27:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev1004.eqiad.wmnet [12:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev1004.eqiad.wmnet [12:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:49:57] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:13] sukhe ^^ already awake? [12:51:19] * vgutierrez taking a look anyways [12:52:31] sigh.. a timeout fetching a log shouldn't trigger a service "crash" [12:59:13] !log ayounsi@deploy1002 Started deploy [netbox/deploy@87a36a7]: Netbox bullseye on netbox-dev2002 [12:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:19] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@87a36a7]: Netbox bullseye on netbox-dev2002 (duration: 00m 05s) [12:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:03] 10SRE-OnFire, 10DBA, 10Blocked-on-schema-change, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Marostegui) p:05Low→03Medium [13:11:19] !log ayounsi@deploy1002 Started deploy [netbox-dev/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 [13:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:09] (03CR) 10Jbond: [C: 03+2] P:conftool::requestctl_client: add simple script to check for block ips [puppet] - 10https://gerrit.wikimedia.org/r/789162 (owner: 10Jbond) [13:15:44] (03CR) 10Jbond: [V: 03+1] P:varnish::common: Add support for passing wikimedia_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [13:15:50] (03CR) 10Kormat: [C: 03+2] auto_schema: Supply -N to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/788709 (owner: 10Kormat) [13:16:25] (03Merged) 10jenkins-bot: auto_schema: Supply -N to db-mysql. [software] - 10https://gerrit.wikimedia.org/r/788709 (owner: 10Kormat) [13:16:31] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:51] vgutierrez: :] [13:17:57] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10faidon) So first of all, when I look into [[ https://atlas.ripe.net/probes/6671 | this (#6671) Atlas' probe page ]], I see this: > The LIR us.wmf has shared admi... [13:18:25] sukhe: does certspotter returns different status codes based on the error? [13:18:38] sukhe: maybe we could add some of them to the list of acceptable status codes that aren't 0 [13:19:37] I have to look, but that's just one of the many things that broken with it and requires fixing [13:19:42] I have a list [13:20:00] !log depool Wikidough and durum in ulsfo for T307425 [13:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:05] T307425: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 [13:21:29] !log ayounsi@deploy1002 Finished deploy [netbox-dev/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 (duration: 10m 10s) [13:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:48] (03PS1) 10Jbond: scap/virtualenv.sh: use distro specific artifacts [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789804 [13:23:24] (03CR) 10Ayounsi: [C: 03+1] scap/virtualenv.sh: use distro specific artifacts [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789804 (owner: 10Jbond) [13:23:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] scap/virtualenv.sh: use distro specific artifacts [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789804 (owner: 10Jbond) [13:24:02] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh4001.wikimedia.org [13:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:09] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10ops-monitoring-bot) VM doh4001.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [13:25:41] PROBLEM - Bird Internet Routing Daemon on doh4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:25:49] PROBLEM - Bird Internet Routing Daemon on durum4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:25:57] ^ expected [13:26:13] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:26:15] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:26:25] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh4002.wikimedia.org [13:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:26:32] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10ops-monitoring-bot) VM doh4002.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [13:26:51] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum4001.ulsfo.wmnet [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:58] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10ops-monitoring-bot) VM durum4001.ulsfo.wmnet rebooted by sukhe@cumin2002 with reason: None [13:27:05] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:27:08] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum4002.ulsfo.wmnet [13:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10ops-monitoring-bot) VM durum4002.ulsfo.wmnet rebooted by sukhe@cumin2002 with reason: None [13:27:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh4001.wikimedia.org [13:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] RECOVERY - Bird Internet Routing Daemon on doh4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:30:28] (03PS1) 10Jbond: scap/checks/virtualenv.sh: fix directories [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789807 [13:30:41] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:41] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:43] RECOVERY - Bird Internet Routing Daemon on durum4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:30:53] (03CR) 10Ayounsi: [C: 03+1] scap/checks/virtualenv.sh: fix directories [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789807 (owner: 10Jbond) [13:30:57] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 105, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum4002.ulsfo.wmnet [13:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:35] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh4002.wikimedia.org [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] scap/checks/virtualenv.sh: fix directories [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789807 (owner: 10Jbond) [13:34:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum4001.ulsfo.wmnet [13:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:24] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Cmjohnson) Dell has shipped the part [13:35:51] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10ssingh) [13:36:26] (03PS1) 10JMeybohm: Add kubernetes admin credentials to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/789808 [13:38:22] !log ayounsi@deploy1002 Started deploy [netbox-dev/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 [13:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:16] !log depool Wikidough and durum in esams for T307425 [13:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:21] T307425: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 [13:39:29] er, hmm [13:39:31] !log depool Wikidough and durum in esams for T307424 [13:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:35] T307424: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 [13:39:45] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:54] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum3002.esams.wmnet [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:59] (03CR) 10Jbond: scap: add new `scap` user to deployment hosts and scap targets (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:40:01] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10ops-monitoring-bot) VM durum3002.esams.wmnet rebooted by sukhe@cumin2002 with reason: None [13:40:10] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum3001.esams.wmnet [13:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:14] !log ayounsi@deploy1002 Finished deploy [netbox-dev/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 (duration: 01m 52s) [13:40:17] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10ops-monitoring-bot) VM durum3001.esams.wmnet rebooted by sukhe@cumin2002 with reason: None [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:31] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh3001.wikimedia.org [13:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:38] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10ops-monitoring-bot) VM doh3001.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [13:40:56] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh3002.wikimedia.org [13:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:02] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10ops-monitoring-bot) VM doh3002.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [13:41:22] (03PS2) 10JMeybohm: Add kubernetes admin credentials to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/789808 [13:42:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35122/console" [puppet] - 10https://gerrit.wikimedia.org/r/789808 (owner: 10JMeybohm) [13:43:17] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:43] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 8 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:55] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 8 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:44:04] ^ expected [13:46:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:51:49] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:15] (03CR) 10Muehlenhoff: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:54:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:29] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:54:57] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/789794 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:55:09] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:56:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum3002.esams.wmnet [13:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum3001.esams.wmnet [13:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh3002.wikimedia.org [13:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:52] (03PS1) 10Ayounsi: Revert "Add netbox-dev directory on deploy servers" [puppet] - 10https://gerrit.wikimedia.org/r/789343 [13:58:55] (03CR) 10Jbond: [C: 03+1] Revert "Add netbox-dev directory on deploy servers" [puppet] - 10https://gerrit.wikimedia.org/r/789343 (owner: 10Ayounsi) [13:59:12] (03PS1) 10Hashar: gerrit: replicate to codfw with 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/789810 (https://phabricator.wikimedia.org/T307137) [13:59:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh3001.wikimedia.org [13:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:55] (03CR) 10Ayounsi: [C: 03+2] Revert "Add netbox-dev directory on deploy servers" [puppet] - 10https://gerrit.wikimedia.org/r/789343 (owner: 10Ayounsi) [14:00:49] (03PS1) 10Jbond: Revert "scap/checks/virtualenv.sh: fix directories" [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789344 [14:01:10] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10ssingh) [14:01:34] !log depool Wikidough and durum in eqsin for T307426 [14:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:38] T307426: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 [14:01:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "scap/checks/virtualenv.sh: fix directories" [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789344 (owner: 10Jbond) [14:02:40] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum5002.eqsin.wmnet [14:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:47] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10ops-monitoring-bot) VM durum5002.eqsin.wmnet rebooted by sukhe@cumin2002 with reason: None [14:02:49] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:50] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum5002.eqsin.wmnet [14:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh5002.wikimedia.org [14:02:57] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10ops-monitoring-bot) VM durum5002.eqsin.wmnet rebooted by sukhe@cumin2002 with reason: None [14:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:02] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10ops-monitoring-bot) VM doh5002.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [14:03:02] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh5001.wikimedia.org [14:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:07] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:09] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10ops-monitoring-bot) VM doh5001.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [14:03:36] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM durum5002.eqsin.wmnet [14:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum5001.eqsin.wmnet [14:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:55] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10ops-monitoring-bot) VM durum5001.eqsin.wmnet rebooted by sukhe@cumin2002 with reason: None [14:04:16] !log ayounsi@deploy1002 Started deploy [netbox-dev/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 [14:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:43] (03CR) 10Cathal Mooney: [C: 03+2] Minor fixes to ASW EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/789597 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:04:50] !log ayounsi@deploy1002 Finished deploy [netbox-dev/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 (duration: 00m 34s) [14:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] (03Merged) 10jenkins-bot: Minor fixes to ASW EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/789597 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [14:06:55] PROBLEM - BFD status on cr3-eqsin is CRITICAL: CRIT: Down: 8 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:05] (03PS1) 10Jbond: scap.cfg: use correct deploy dir [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789811 [14:07:36] (03CR) 10Ayounsi: [C: 03+1] scap.cfg: use correct deploy dir [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789811 (owner: 10Jbond) [14:07:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] scap.cfg: use correct deploy dir [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789811 (owner: 10Jbond) [14:07:56] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 [14:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:05] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:08:55] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 8 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:11:11] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:11:27] RECOVERY - BFD status on cr3-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh5001.wikimedia.org [14:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum5001.eqsin.wmnet [14:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh5002.wikimedia.org [14:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum5002.eqsin.wmnet [14:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:24] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: sre.ganeti.reboot-vm cookbook should re-enable Puppet if it was disabled - https://phabricator.wikimedia.org/T307792 (10ssingh) [14:16:53] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10ssingh) [14:18:42] (03PS1) 10Ayounsi: DHCP: netbox-dev2002 set to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/789814 [14:19:25] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 (duration: 11m 29s) [14:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] (03CR) 10Jbond: [C: 03+1] DHCP: netbox-dev2002 set to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/789814 (owner: 10Ayounsi) [14:21:53] (03CR) 10Ayounsi: [C: 03+2] DHCP: netbox-dev2002 set to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/789814 (owner: 10Ayounsi) [14:36:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes admin credentials to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/789808 (owner: 10JMeybohm) [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [14:45:12] (03PS1) 10Ayounsi: Revert "hieradata: use netbox-next/deploy scap repo" [puppet] - 10https://gerrit.wikimedia.org/r/789345 [14:45:50] (03CR) 10Jbond: [C: 03+1] Revert "hieradata: use netbox-next/deploy scap repo" [puppet] - 10https://gerrit.wikimedia.org/r/789345 (owner: 10Ayounsi) [14:46:07] (03CR) 10Ayounsi: [C: 03+2] Revert "hieradata: use netbox-next/deploy scap repo" [puppet] - 10https://gerrit.wikimedia.org/r/789345 (owner: 10Ayounsi) [14:50:24] (03PS1) 10Jbond: cas: only install groovy file if u2f enabled [puppet] - 10https://gerrit.wikimedia.org/r/789818 [14:51:42] (03CR) 10Jbond: [C: 03+2] cas: only install groovy file if u2f enabled [puppet] - 10https://gerrit.wikimedia.org/r/789818 (owner: 10Jbond) [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:04:04] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7903184, @Ottomata wrote: > FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it is... [15:04:10] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:15] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 (duration: 00m 04s) [15:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:35] (03PS5) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:08:08] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:09:06] (03PS2) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [15:09:56] (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [15:10:56] (03PS1) 10Jcrespo: Revert "install_server: Wipe backup1002 completely" [puppet] - 10https://gerrit.wikimedia.org/r/789826 [15:11:16] (03PS2) 10Jcrespo: Revert "install_server: Wipe backup1002 completely" [puppet] - 10https://gerrit.wikimedia.org/r/789826 [15:11:29] (03PS6) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:11:43] (03PS3) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [15:12:16] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:12:45] (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [15:14:15] (03PS7) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:15:03] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:20:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1020.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1018.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1019.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1017.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host aqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:46] (03PS2) 10JHathaway: admin: add Fabian Kaelin to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/789681 (https://phabricator.wikimedia.org/T307573) [15:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:48] (03CR) 10Hokwelum: [C: 04-1] "Ariel and I tested this using the puppet compiler and it failed because systemd::timer::job needs a description parameter (console output " [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [15:22:38] (03CR) 10JHathaway: [C: 03+2] admin: add Fabian Kaelin to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/789681 (https://phabricator.wikimedia.org/T307573) (owner: 10JHathaway) [15:24:19] 10SRE, 10SRE-Access-Requests, 10Generated Data Platform, 10Patch-For-Review: Request to add user fkaelin to analytics-platform-eng-admins group - https://phabricator.wikimedia.org/T307573 (10jhathaway) 05Open→03Resolved Commit has been merged in, please reopen if there are any problems, thanks! [15:26:50] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs1018.mgmt.eqiad.wmnet with reboot policy FORCED [15:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] (03PS1) 10Jbond: P:netbox: Add libapache2-mod-wsgi-py3 [puppet] - 10https://gerrit.wikimedia.org/r/789821 [15:28:14] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:29:57] (03PS4) 10Ahmon Dancy: mediawiki 0.2.0: Add mw.localmemcached.enabled value [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 [15:30:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1019.mgmt.eqiad.wmnet with reboot policy FORCED [15:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1021.mgmt.eqiad.wmnet with reboot policy FORCED [15:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1017.mgmt.eqiad.wmnet with reboot policy FORCED [15:33:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1020.mgmt.eqiad.wmnet with reboot policy FORCED [15:33:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs1016.mgmt.eqiad.wmnet with reboot policy FORCED [15:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:48] (03CR) 10Jcrespo: [C: 03+2] Revert "install_server: Wipe backup1002 completely" [puppet] - 10https://gerrit.wikimedia.org/r/789826 (owner: 10Jcrespo) [15:34:31] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson) [15:34:57] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson) aqs1018 mgmt did not setup, I will need to check the cable [15:35:49] (03PS1) 10Jcrespo: Revert "install_server: Update backup-format recipe to install on sdb/sdc" [puppet] - 10https://gerrit.wikimedia.org/r/789828 [15:36:43] (03PS8) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:37:03] (03CR) 10Jcrespo: [C: 03+2] Revert "install_server: Update backup-format recipe to install on sdb/sdc" [puppet] - 10https://gerrit.wikimedia.org/r/789828 (owner: 10Jcrespo) [15:37:43] (03PS4) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [15:38:06] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:38:35] (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [15:45:09] (03PS2) 10BCornwall: admin: Add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/789727 [15:45:59] (03CR) 10BCornwall: admin: Add user "brett" to ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789727 (owner: 10BCornwall) [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:52:52] (03CR) 10Ssingh: [C: 03+1] "Thanks, looks good! We will merge this later. (After https://office.wikimedia.org/wiki/SRE/Training_Checklists#Goal:_Root_access)" [puppet] - 10https://gerrit.wikimedia.org/r/789727 (owner: 10BCornwall) [15:55:01] (03PS1) 10Giuseppe Lavagetto: service::docker: allow use of 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/789846 [15:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:57:02] (03PS2) 10Majavah: P:wmcs: unify toolsdb profiles [puppet] - 10https://gerrit.wikimedia.org/r/789611 [15:57:42] (03PS1) 10Cmjohnson: update site.pp and netboo.cfg with aqs1016-1021 [puppet] - 10https://gerrit.wikimedia.org/r/789847 (https://phabricator.wikimedia.org/T305570) [15:58:37] (03CR) 10Cmjohnson: [C: 03+2] update site.pp and netboo.cfg with aqs1016-1021 [puppet] - 10https://gerrit.wikimedia.org/r/789847 (https://phabricator.wikimedia.org/T305570) (owner: 10Cmjohnson) [15:58:40] (03Abandoned) 10Andrew Bogott: raid::hpsa: symlink hpssacli to ssacli [puppet] - 10https://gerrit.wikimedia.org/r/783866 (https://phabricator.wikimedia.org/T306354) (owner: 10Andrew Bogott) [15:58:41] (03CR) 10Majavah: P:wmcs: unify toolsdb profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789611 (owner: 10Majavah) [16:04:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye [16:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:00] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye [16:06:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1021.eqiad.wmnet with OS bullseye [16:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:22] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1021.eqiad.wmnet with OS bullseye [16:07:46] (03PS1) 10Ayounsi: Force MarkupSafe==2.0.1 [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789849 (https://phabricator.wikimedia.org/T296452) [16:08:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789849 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [16:09:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1017.eqiad.wmnet with OS bullseye [16:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:12] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye [16:09:20] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Force MarkupSafe==2.0.1 [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789849 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [16:09:43] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 [16:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:51] (03PS5) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [16:12:01] (03PS1) 10Hnowlan: scaffold: fix issue where volumes will be folded into comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/789850 [16:13:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye [16:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:09] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye [16:13:19] (03CR) 10Jforrester: "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/789846 (owner: 10Giuseppe Lavagetto) [16:15:22] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Netbox bullseye on netbox-dev2002 (duration: 05m 39s) [16:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1019.eqiad.wmnet with OS bullseye [16:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:24] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1019.eqiad.wmnet with OS bullseye [16:29:20] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:34:26] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1021.eqiad.wmnet with OS bullseye [16:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:31] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1021.eqiad.wmnet with OS bullseye executed with errors: - aqs102... [16:37:21] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1017.eqiad.wmnet with OS bullseye [16:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:29] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1017.eqiad.wmnet with OS bullseye executed with errors: - aqs101... [16:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:39:20] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:41:44] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1016.eqiad.wmnet with OS bullseye [16:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:48] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye executed with errors: - aqs101... [16:52:38] (03PS1) 10Ayounsi: provision_server: validate port number for non VC switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789857 [17:07:36] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:29] (03CR) 10JHathaway: "Thanks for working on this, a couple of questions." [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [17:15:20] (03CR) 10JHathaway: [C: 03+2] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [17:15:44] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [17:17:59] (03PS1) 10Krinkle: static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) [17:19:02] (03CR) 10jerkins-bot: [V: 04-1] static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [17:20:50] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:24:23] 10SRE, 10serviceops: Provide node14 and node16 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10bd808) [17:30:44] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [17:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:28] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @dcaro (sorry to pick on you!) Is it possible we can allocate these IP addresses on the cloud switches, from the existing 192.168... [17:36:42] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [17:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:58] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good to me! Obviously we should test before merging though." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789857 (owner: 10Ayounsi) [17:41:11] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1006.eqiad.wmnet [17:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:14] (03PS1) 10KartikMistry: ULS entrypoint: Do not show current language, fix domain redirects [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789832 (https://phabricator.wikimedia.org/T307745) [17:49:31] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1006.eqiad.wmnet [17:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:54:23] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1007.eqiad.wmnet [17:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:55] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [17:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:44] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35126/" [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [17:58:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [17:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:49] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Dzahn) on releases1002: ` Error: Could not find group deployment Error: /Stage[main]/Helm/File[/var/cache/helm]/group: change from 'wikidev' to 'dep... [17:59:56] (03CR) 10Ori: [C: 03+1] service::docker: allow use of 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/789846 (owner: 10Giuseppe Lavagetto) [18:00:05] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint/releases/etc: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Dzahn) [18:00:56] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Dzahn) [18:02:10] 10SRE, 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10RobH) 05Open→03Resolved I fixed the netbox info on 2022-04-21 and neglected to resolve this task. [18:02:19] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1007.eqiad.wmnet [18:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:29] (03CR) 10Dzahn: [C: 03+2] "this installed apparmor on some hosts like build2001 and releases1002. they already had the 'libapparmor1' package but not the userland to" [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [18:06:48] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:23] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet [18:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:04] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet [18:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:56] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1009.eqiad.wmnet [18:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:38] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1009.eqiad.wmnet [18:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:52] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet [18:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:53] (03PS2) 10Krinkle: static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) [18:21:47] (03CR) 10jerkins-bot: [V: 04-1] static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [18:23:43] (03PS3) 10Krinkle: static.php: Remove unused handling of /static/current/ routes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789863 (https://phabricator.wikimedia.org/T302465) [18:24:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet [18:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:54] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [18:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:39:38] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [18:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [18:56:57] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1001.eqiad.wmnet [18:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:14] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:10] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1001.eqiad.wmnet [19:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:34] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1002.eqiad.wmnet [19:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1002.eqiad.wmnet [19:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:30] (03PS1) 10BryanDavis: toolforge: Add liblocale-codes-perl to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) [19:29:00] (03PS1) 10BryanDavis: perl: add liblocale-codes-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/789875 (https://phabricator.wikimedia.org/T307812) [19:31:33] (03CR) 10BryanDavis: "The package is available in stretch (where it is virtual), buster, and bullseye so no need for OS version guards." [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:33:10] (03CR) 10BryanDavis: [C: 03+2] perl: add liblocale-codes-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/789875 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:33:50] (03Merged) 10jenkins-bot: perl: add liblocale-codes-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/789875 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:42:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:47:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:03] (03CR) 10Majavah: [C: 04-1] "not available in stretch per https://packages.debian.org/source/oldstable/liblocale-codes-perl" [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:50:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:14] * jhathaway here [19:54:33] * cdanis here [19:54:44] I do see that thumbor has greatly increased latency and 429 rate in eqiad [19:54:58] (03CR) 10RhinosF1: toolforge: Add liblocale-codes-perl to exec_environ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:55:00] <_joe_> so, a scraper? [19:55:08] (03PS1) 10Hnowlan: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) [19:55:13] * jbond here but anerversery today so would prfer to bow out if i can [19:55:17] <_joe_> if so, if we can identify it by UA, we can raelimit it [19:55:18] jbond: shoo [19:55:22] thx [19:55:23] <_joe_> jbond: go away now [19:55:29] * jbond gone [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:56:30] (03CR) 10BryanDavis: toolforge: Add liblocale-codes-perl to exec_environ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:57:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:55] need help? [19:58:02] marostegui: nah [19:58:14] it looks like the prober has been flapping (but below paging threshold) for a while now [19:58:17] happy to help if needed [19:58:59] here as well and it's the prometheus based monitoring again. this seems to happen every couple hours. just that this time it went slightly over threshold [19:59:07] I found this in logstash: Sampurna_Gandhi%2C_vol._90.pdf/page105-1280px- [19:59:09] cdanis: ooook, I will go to need soonish then [19:59:15] like it was trying to resize a pretty large PDF [19:59:18] not sure how to find UA [19:59:19] (03CR) 10Majavah: toolforge: Add liblocale-codes-perl to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [19:59:19] here [20:00:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:02:32] <_joe_> mutante: yes there is a bunch of large images being requested [20:02:50] seems like that type of latency jump is pretty unusual, at least looking at the graphs [20:05:11] _joe_: what logs are you looking at? varnish 5xx? [20:05:13] (03PS1) 10Bking: elastic: update deployment-prep hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) [20:05:18] _joe_: well, it's under "slow but succesful probes", just that the probe took like 2.2seconds. do we really have to act on it? [20:05:20] <_joe_> cdanis: yes [20:06:24] I need to go, baby is paging me [20:06:50] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-iflorez-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:05] is it OK if I merge a Toolforge-only puppet change? or should I hold off? [20:07:40] _joe_: what dashboard are looking at for varnish 5xx? [20:09:30] <_joe_> legoktm: go on [20:09:47] <_joe_> jhathaway: the "webrequest 5xx" dashboard on logstash [20:09:57] _joe_: thanks [20:10:08] <_joe_> or, the file on centrallog1001, /srv/log/webrequest/5xx.json [20:10:46] (03CR) 10Legoktm: [C: 03+2] toolforge: Add liblocale-codes-perl to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/789874 (https://phabricator.wikimedia.org/T307812) (owner: 10BryanDavis) [20:11:36] "user_agent":"MyBib/1.0 (https://www.mybib.com/; mailto:support@mybib.com)", this one? [20:14:05] <_joe_> and yes, that looks probably as a candidate [20:14:18] <_joe_> mutante: most 429s were indeed for pdfs [20:14:36] <_joe_> I'm looking at thumbor's haproxy logs for 429s during the slowdown [20:15:41] <_joe_> looks like someone asked for all the pages in that pdf at once [20:16:06] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:12] and it has over 100 pages or so [20:16:21] <_joe_> over 1200 [20:16:44] there is a possible correlation between mybib (a bibliography generator) and Sampurna Gandhi, which are the writings of MK Gandhi [20:16:56] essentially on mybib you can enter a URL which it uses to fetch the PDF [20:16:59] which is what I think is happening here [20:17:08] and to generate that bibliography, it has to scan the entire PDF [20:17:17] <_joe_> well this is fetching thumbnails of all pages heh [20:17:28] <_joe_> if they just downloaded the pdf, we wouldn't have noticed [20:17:47] hm right thumbor, wonder why the thumbnails though [20:17:52] so the part that they are getting 429s for doing this.. is "works as designed" ? [20:17:58] <_joe_> ok, it's quite late here [20:18:14] <_joe_> I'll go afk given it's not an immediate emergency [20:19:02] would you say anything even needs to be done? I am unclear about that now. It seems over and they got limited for sending 1000 pages at once seems right [20:19:40] we wouldnt have been paged before the prometheus based paging was activated, spike was legit.. but should it page [20:21:12] (03Abandoned) 10Samtar: InitialiseSettings: Set wgRestrictDisplayTitle = false for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770120 (https://phabricator.wikimedia.org/T303665) (owner: 10Samtar) [20:24:38] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:28:28] well, I will make a call then and says it's not an immediate emergency / action item. thing was a one-time spike, got 429 as it should, resolved since 30 min and the part that it's paging is still the experimental paging. if it keeps happening we will look into blocking mybib [20:30:07] Block them and say Zotero is way better :P [20:30:56] haha [20:34:32] (03CR) 10Ebernhardson: [C: 03+1] elastic: update deployment-prep hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:02:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:12:40] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:45] ^^ above was doh1001, BGP went down didn't come back for 30 seoncds. [21:24:00] Nothing to worry about, state has returned to "warning" due to some down external peers, hence no recovery showing here. [21:26:15] thanks topranks! [21:26:31] I will check what happened after dinner, just to be sure [21:26:43] everything looks good though [21:27:43] Yeah, they've flapped once or twice before too. We might want to look at the BFD timers, esp. with a VM, and the hypervisor scheduling of same, possibly they are slightly too aggressive. [21:27:56] I don't think any major worry anyway [21:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:02] (03PS2) 10Jforrester: [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:21:38] (03CR) 10Jforrester: [C: 03+1] "Tweaked the commit message based on how I've been historically marking these changes; OK for me to merge now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:24:55] (03PS1) 10BCornwall: icinga: Grant BCornwall host service command privs [puppet] - 10https://gerrit.wikimedia.org/r/789881 [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [22:58:16] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:23:20] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:23:44] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:23:56] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:25:19] (03CR) 10BryanDavis: [C: 03+1] service::docker: allow use of 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/789846 (owner: 10Giuseppe Lavagetto) [23:25:40] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:26:02] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:26:16] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:28:32] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:59:28] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook