[07:49:21] 10netops, 10Infrastructure-Foundations: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) p:05Triage→03Low [07:49:55] 10netops, 10Infrastructure-Foundations: Junos: resolve DNS through mgmt_junos - https://phabricator.wikimedia.org/T317175 (10ayounsi) [07:50:00] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [07:50:08] 10netops, 10SRE: Junos changes for management-instance support on QFX - https://phabricator.wikimedia.org/T269340 (10ayounsi) [07:54:16] (VarnishTrafficDrop) firing: Varnish traffic in eqsin has dropped 67.1528859072229% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [07:54:56] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:59:16] (VarnishTrafficDrop) firing: Varnish traffic in eqsin has dropped 23.48906404796916% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [08:24:16] (VarnishTrafficDrop) resolved: (2) Varnish traffic in eqsin has dropped 1.3243684288937405% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [08:36:02] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7af287ca-21ab-4f9d-adb3-478641fdd465) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reas... [08:42:47] 10Traffic, 10SRE, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10Vgutierrez) [08:42:51] 10Traffic, 10SRE: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10Vgutierrez) 05Open→03Resolved ` $ curl -v -o /dev/null -s https://api.wikimedia.org/feed/v1/wikipedia/en/onthisday/all/09/07 2>&1 | egrep -i "geoip|wmf-last-access"; echo $? 1 ` cl... [08:43:56] (PyBalBGPUnstable) firing: (2) PyBal BGP sessions on instance lvs5001 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [08:48:56] (PyBalBGPUnstable) firing: (3) PyBal BGP sessions on instance lvs5001 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [08:58:53] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10ayounsi) [09:01:25] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10taavi) [09:13:26] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:14:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10Peachey88) [09:14:44] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10hashar) I apologize for my unclear comment, I was referring to the notes taking document at https://docs.google.com/document/d/1Ka9MQB8OwdzAzJVfZua... [09:19:27] (PyBalBGPUnstable) resolved: PyBal BGP sessions on instance lvs5001 are failing - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=eqsin%20prometheus/ops&var-server=lvs5001 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [09:20:14] 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=826e80d5-55a6-4bb6-ab1c-e094eba7f6cd) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) and their services with reas... [09:24:46] 10Traffic, 10Phabricator, 10SRE, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) > Does it have to be converted to an incident report on Wikitech? It does. > I could do it but could use pairing with someone familiar w... [09:26:56] (PyBalBGPUnstable) firing: (3) PyBal BGP sessions on instance lvs5001 are failing - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [09:33:11] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Vgutierrez) p:05Triage→03Medium [09:36:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) [09:57:56] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:12:56] (HAProxyEdgeTrafficDrop) firing: (2) 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:13:16] (VarnishTrafficDrop) firing: Varnish traffic in drmrs has dropped 68.3476249073215% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [10:17:56] (HAProxyEdgeTrafficDrop) firing: (2) 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:23:16] (VarnishTrafficDrop) resolved: Varnish traffic in drmrs has dropped 68.54074239128748% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [10:35:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) [10:35:59] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Just to confirm the change from CNAME back to A records has worked, my BIND server at home is able to resolve WMCS names again. In... [10:37:56] (HAProxyEdgeTrafficDrop) resolved: (2) 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [11:12:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) Upgrade completed ok for cr2-eqsin and cr3-eqsin. Went straight to 21.2R3-S2.9 based on experience in ulsfo, all went ok. Used no-validate when addi... [11:13:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [11:26:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) [13:10:38] (LVSHighCPU) firing: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:15:38] (LVSHighCPU) resolved: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [14:38:26] 10Traffic, 10SRE: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10Aklapper) a:05MMandere→03None Removing inactive assignee (please do so as part of offboarding - thanks!) [14:38:33] 10Traffic, 10SRE: Clean up Traffic Grafana dashboards to reflect HA-Proxy metrics - https://phabricator.wikimedia.org/T304153 (10Aklapper) a:05MMandere→03None Removing inactive assignee (please do so as part of offboarding - thanks!) [14:39:19] 10Traffic, 10SRE, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10Aklapper) a:05MMandere→03None Removing inactive assignee (please do so as part of offboarding - thanks!) [14:50:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10Aklapper) Assuming this task is not literally neverending (if it was, it should be a project tag instead) [15:20:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) It is not never-ending as it's about converting existing hosts. Either they will or won't. New services are out of scope. [16:15:53] vgutierrez: are you able to deploy _joe_'s change? do you need me for that? [16:16:27] vgutierrez: err, I didn't notice the time. tomorrow perhaps :) [16:18:01] is ats going to replace varnish? [16:20:54] hard question [16:21:17] It was the initial idea. It replaced successfully varnish-be [16:21:31] But it's still far from being ready to replace varnish-fe [18:20:30] 10Traffic, 10SRE, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) 05Stalled→03In progress [21:41:26] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [21:41:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [21:47:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) a:03BBlack @bblack, I have all of these staged for racking onsite (basically stacked in the racks but not on rails.) I have a few pending questions for you on these: 1) Half of these... [21:54:15] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [21:56:23] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [22:03:35] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) p:05Triage→03Medium [22:07:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) a:05RobH→03BBlack @bblack, The current question on T317244, is can I decom cp4021 and replace it with new cp4037 for testing? If so, is cp4037 to be a single or dual NVMe host? Once... [22:07:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH)