[03:02:03] 10Traffic, 10API Platform, 10SRE, 10VisualEditor, and 2 others: Find out if Varnish is messing with ETags, and what to do about it. - https://phabricator.wikimedia.org/T310904 (10ssastry) [05:10:44] (PurgedHighEventLag) firing: (14) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:12:56] (HAProxyEdgeTrafficDrop) firing: 30% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:15:52] (PurgedHighEventLag) firing: (16) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:27:56] (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:30:35] (PurgedHighEventLag) resolved: (32) High event process lag with purged on cp5001:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [05:36:56] (HAProxyEdgeTrafficDrop) firing: (5) 20% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:41:56] (HAProxyEdgeTrafficDrop) resolved: (6) 63% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:08:24] 10Traffic, 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (June 23 2022) - https://phabricator.wikimedia.org/T311197 (10TheresNoTime) Many thanks for the report @AlexisJazz — looks like it's recovered now. Think the only publicly actionable item here is going to be adding an incident to https://www.wik... [06:15:01] 10Traffic, 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (June 23 2022) - https://phabricator.wikimedia.org/T311197 (10Marostegui) This should be, indeed, fixed by now. [10:32:18] 10netops, 10Infrastructure-Foundations, 10netbox: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) p:05Triage→03Low [10:32:29] 10netops, 10Infrastructure-Foundations, 10netbox: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) [10:40:56] (HAProxyEdgeTrafficDrop) firing: (4) 43% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:45:56] (HAProxyEdgeTrafficDrop) resolved: (6) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:48:05] 10netops, 10Infrastructure-Foundations, 10netbox: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) https://netbox-next.wikimedia.org/ipam/fhrp-groups/1/ {F35267517} Some quick thoughts: * It help get rid of duplicate IPs as a VIP is assigned to the group, which is the... [14:37:19] sukhe: If you have the time today I'd love to get info on why we don't want DNS records for lvs (https://phabricator.wikimedia.org/T271144). No rush [14:38:16] brett: yes happy to discuss it some time later (fixing another issue) [14:39:39] brett: though I should have been more clear since the task seems to be just about LVS that vgutierrez is a better person to ask. I am happy to share re: authdns and others [14:41:08] sukhe: Oh, then that's fine! Since the ticket is just for lvs I'll just turn to vgutierrez. [15:02:40] brett: AFAIK there is no reason to not have AAAA records for the lvs cluster [15:03:06] we've been doing that on purpose for ns[012].wikimedia.org. but that's it [15:03:45] also, it's worth mentioning that the lvs cluster in esams, eqsin and drmrs have both A and AAAA records [15:07:17] vgutierrez, brett: fyi volans is working on an improved netbox report for those [15:10:56] vgutierrez: Is that to say that I'm safe to add the DNS name to these IP addresses? [15:11:13] you won't be safe, the LVS servers will ;P [15:12:04] the tragedy of life. Thanks for the reassurance :D [15:12:28] if you wanna play it extra safe, hit first ulsfo [15:13:24] aye aye. Do I need to "!log" this? [15:13:58] nope, we don't usually log DNS changes [15:14:47] * brett proceeds to set ulsfo [15:25:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:30:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:31:27] Er, I was just going to set the IPs to be the hostname (in fqdn form, ofc) but I see that one of the lvs' has upload-lb.ulsfo.wikimedia.org set to the lo device. Should all the other device IPs be set to the same thing? [16:07:36] 10Traffic, 10MediaWiki-General, 10SRE: Advance declaration of query parameters - https://phabricator.wikimedia.org/T310087 (10ori) @Krinkle AIUI the OAuth 1 spec stipulates that parameters be normalized prior to computing a signature, so that should be OK. Not sure about 2.0. [16:43:05] vgutierrez: https://netbox.wikimedia.org/search/?q=lvs4007.ulsfo.wmnet&obj_type= is an example of what I've done to all the lvs instances in ulsfo. Does that seem appropriate to you? [17:18:26] is the wmf using any custom-built vmods currently? [17:23:49] I guess cc sukhe since it's late for vgutierrez. If you are able to confirm that my netbox changes are appropriate I can move on to the other regions [17:26:35] Compare against lvs3005 [17:31:47] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ssingh) During the upgrade to bird2 today, the bird side of things seems to have caused no issues. The bird2 service started successfully and the configuration file was correct. Howeve... [18:16:47] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) [18:17:32] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) p:05Triage→03Low [18:21:24] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10Dzahn) We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs. So I would suggest to check if you can get... [18:22:48] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10ssingh) >>! In T311264#8023977, @Dzahn wrote: > We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs. >... [18:23:40] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10Dzahn) See T283582 [20:29:26] 10Traffic, 10SRE, 10ops-eqsin: SSH on cp5012.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311264 (10RobH) So if the idrac is accessible, the firmware update isn't OS impacting. However, I cannot login to this idrac interface via HTTPS or SSH, so it appears it'll have to be fully power dr... [21:57:40] 10Traffic, 10SRE, 10Wikimedia-Incident: 503 Service Unavailable (June 23 2022) - https://phabricator.wikimedia.org/T311197 (10colewhite) 05Open→03Resolved a:03colewhite Thank you for the report. Users experienced connectivity issues to the projects starting at 5:05 UTC. Service was restored at 05:11... [22:32:57] (HAProxyEdgeTrafficDrop) firing: 47% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [22:37:57] (HAProxyEdgeTrafficDrop) resolved: (2) 49% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop