[08:24:11] <wikibugs>	 10Traffic, 10Pybal, 10Wikidata, 10wdwb-tech: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups - https://phabricator.wikimedia.org/T284981 (10Marostegui)
[08:29:18] <wikibugs>	 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups - https://phabricator.wikimedia.org/T284981 (10Addshore) The query seems to come from https://gerrit.wikimedia.org/g/mediawiki/core/+/873118723cbe3c78e631bea4...
[08:31:03] <wikibugs>	 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Ladsgroup) > Though I guess we also want to look at why the query tries to scan the whole...
[08:31:11] <wikibugs>	 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Addshore)
[08:31:50] <wikibugs>	 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Addshore) > 9:27 AM <marostegui> addshore: Maybe the issue is the schema change isn't bein...
[08:34:14] <wikibugs>	 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) My guess is that as the schema change isn't made on the same transaction, the...
[08:40:57] <jinxer-wm>	 (VarnishTrafficDrop) firing: (4) 33% GET drop in text@codfw during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[08:45:56] <jinxer-wm>	 (VarnishTrafficDrop) resolved: (4) 68% GET drop in text@codfw during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[09:25:46] <wikibugs>	 10Traffic, 10MediaWiki-General, 10Platform Engineering, 10Pybal, and 3 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) All the hosts have been recovered.
[09:33:57] <jinxer-wm>	 (VarnishTrafficDrop) firing: 61% GET drop in text@ during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[09:36:28] <wikibugs>	 10Traffic, 10netops, 10SRE: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10jcrespo)
[09:37:00] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Majavah)
[09:38:42] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedias eqsin datacenter has network connectivity issues (?) - https://phabricator.wikimedia.org/T284986 (10Peachey88)
[09:38:57] <jinxer-wm>	 (VarnishTrafficDrop) firing: (2) 32% GET drop in text@eqsin during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[09:54:20] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jcrespo)
[09:55:15] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10jbond) Once telia issues have been resolved we need to repool ESQIN.  @ayounsi can you confirm when we are good to repool
[10:28:57] <jinxer-wm>	 (VarnishTrafficDrop) firing: (2) 69% GET drop in text@eqsin during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[10:33:57] <jinxer-wm>	 (VarnishTrafficDrop) resolved: (2) 69% GET drop in text@eqsin during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[11:00:37] <XioNoX>	 ema: I guess allowing CP to origin server traffic over v6, it more complicated than I can imagine? :)
[11:00:40] <XioNoX>	 is*
[11:01:42] <volans>	 most of them go through LVSes
[11:02:06] <volans>	 and we currently setup v4 only there AFAIK
[11:05:24] <wikibugs>	 10netops, 10SRE: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi)
[11:05:33] <wikibugs>	 10netops, 10SRE, 10Sustainability (Incident Followup): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) 05Open→03Resolved Closed! After 4 years and 1 week.
[11:05:45] <topranks>	 XioNox: excuse ignorance, what do you mean by the term "CP" here?
[11:06:01] <XioNoX>	 the caching servers
[11:06:12] <topranks>	 ok
[11:06:45] <XioNoX>	 https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions can sometimes help too
[11:07:19] <volans>	 also most likely outdated ;)
[11:07:23] <topranks>	 that's great thanks :)
[12:29:39] <ema>	 XioNoX: yeah so as volans said we're currently using v4 on the LVSs, but I'm not sure how complicated it would be to use v6 instead 
[12:30:58] <ema>	 pybal simply does things like ipvsadm -A [...] so I suppose one could pass an IPv6 there instead of IPv4 and in theory it should work
[12:35:51] <ema>	 in particular pybal's Server class looks up both A and AAAA records, so possibly if for instance mw1271.eqiad.wmnet had an AAAA record then pybal would attempt using that
[12:53:35] <wikibugs>	 10Traffic, 10Pybal, 10SRE: PyBal healthchecks should specify User-Agent instead of using "Twisted PageGetter" - https://phabricator.wikimedia.org/T246431 (10ema)
[13:08:42] <XioNoX>	 I see. We do server text-lb, etc.. over IPv6 so LVS does support v6
[13:12:49] <ema>	 yeah, that's for sure
[13:13:15] <ema>	 the question is what would it mean in practice to use it internally too
[14:54:43] <wikibugs>	 10Traffic, 10SRE: 503 errors from varnish - https://phabricator.wikimedia.org/T284996 (10ssingh) p:05Triage→03High
[14:55:06] <sukhe>	 ^ ema: apologies if this should not be "High". please feel free to change it accordingly
[15:43:23] <ema>	 sukhe: thanks, it looks like an issue with Special:Contributions
[17:31:12] <wikibugs>	 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10serviceops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) I can confirm since a while these have been happening. The pattern is always:  - only mgmt - only codfw - ran...
[18:06:06] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) I made a typo in the commit msg so this didn't link:  https://gerrit.wikimedia.org/r/c/operations/dns/+/699957
[18:42:36] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) Ok @volans was kind enough to explain how I could just revert the original change instead:  https://gerrit.wikimedia.org/r/c/...
[19:00:57] <jinxer-wm>	 (VarnishTrafficDrop) firing: 64% GET drop in text@ during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[19:05:57] <jinxer-wm>	 (VarnishTrafficDrop) firing: (2) 55% GET drop in text@ulsfo during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org
[19:12:48] <wikibugs>	 10Traffic, 10netops, 10SRE, 10Wikimedia-Incident: Wikimedia's eqsin datacenter (Asia Pacific) had network connectivity issues - https://phabricator.wikimedia.org/T284986 (10cmooney) CR merged and DNS updated.  All looks good, dns servers are returning the eqsin IPs again and traffic is back to normal level...
[19:30:57] <jinxer-wm>	 (VarnishTrafficDrop) resolved: (2) 66% GET drop in text@ulsfo during the past 30 minutes  - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org