[08:24:11] 10Traffic, 10Pybal, 10Wikidata, 10wdwb-tech: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups - https://phabricator.wikimedia.org/T284981 (10Marostegui) [08:29:18] 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups - https://phabricator.wikimedia.org/T284981 (10Addshore) The query seems to come from https://gerrit.wikimedia.org/g/mediawiki/core/+/873118723cbe3c78e631bea4... [08:31:03] 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Ladsgroup) > Though I guess we also want to look at why the query tries to scan the whole... [08:31:11] 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Addshore) [08:31:50] 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Addshore) > 9:27 AM addshore: Maybe the issue is the schema change isn't bein... [08:34:14] 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) My guess is that as the schema change isn't made on the same transaction, the... [08:40:57] (VarnishTrafficDrop) firing: (4) 33% GET drop in text@codfw during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [08:45:56] (VarnishTrafficDrop) resolved: (4) 68% GET drop in text@codfw during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:25:46] 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 3 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) All the hosts have been recovered. [09:33:57] (VarnishTrafficDrop) firing: 61% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:36:28] 10Traffic, 10netops, 10SRE: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10jcrespo) [09:37:00] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Majavah) [09:38:42] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Peachey88) [09:38:57] (VarnishTrafficDrop) firing: (2) 32% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:54:20] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jcrespo) [09:55:15] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jbond) Once telia issues have been resolved we need to repool ESQIN. @ayounsi can you confirm when we are good to repool [10:28:57] (VarnishTrafficDrop) firing: (2) 69% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [10:33:57] (VarnishTrafficDrop) resolved: (2) 69% GET drop in text@eqsin during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [11:00:37] ema: I guess allowing CP to origin server traffic over v6, it more complicated than I can imagine? :) [11:00:40] is* [11:01:42] most of them go through LVSes [11:02:06] and we currently setup v4 only there AFAIK [11:05:24] 10netops, 10SRE: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [11:05:33] 10netops, 10SRE, 10Sustainability (Incident Followup): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) 05Open→03Resolved Closed! After 4 years and 1 week. [11:05:45] XioNox: excuse ignorance, what do you mean by the term "CP" here? [11:06:01] the caching servers [11:06:12] ok [11:06:45] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions can sometimes help too [11:07:19] also most likely outdated ;) [11:07:23] that's great thanks :) [12:29:39] XioNoX: yeah so as volans said we're currently using v4 on the LVSs, but I'm not sure how complicated it would be to use v6 instead [12:30:58] pybal simply does things like ipvsadm -A [...] so I suppose one could pass an IPv6 there instead of IPv4 and in theory it should work [12:35:51] in particular pybal's Server class looks up both A and AAAA records, so possibly if for instance mw1271.eqiad.wmnet had an AAAA record then pybal would attempt using that [12:53:35] 10Traffic, 10Pybal, 10SRE: PyBal healthchecks should specify User-Agent instead of using "Twisted PageGetter" - https://phabricator.wikimedia.org/T246431 (10ema) [13:08:42] I see. We do server text-lb, etc.. over IPv6 so LVS does support v6 [13:12:49] yeah, that's for sure [13:13:15] the question is what would it mean in practice to use it internally too [14:54:43] 10Traffic, 10SRE: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10ssingh) p:05Triage→03High [14:55:06] ^ ema: apologies if this should not be "High". please feel free to change it accordingly [15:43:23] sukhe: thanks, it looks like an issue with Special:Contributions [17:31:12] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10serviceops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) I can confirm since a while these have been happening. The pattern is always: - only mgmt - only codfw - ran... [18:06:06] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) I made a typo in the commit msg so this didn't link: https://gerrit.wikimedia.org/r/c/operations/dns/+/699957 [18:42:36] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) Ok @volans was kind enough to explain how I could just revert the original change instead: https://gerrit.wikimedia.org/r/c/... [19:00:57] (VarnishTrafficDrop) firing: 64% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [19:05:57] (VarnishTrafficDrop) firing: (2) 55% GET drop in text@ulsfo during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [19:12:48] 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) CR merged and DNS updated. All looks good, dns servers are returning the eqsin IPs again and traffic is back to normal level... [19:30:57] (VarnishTrafficDrop) resolved: (2) 66% GET drop in text@ulsfo during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org