[01:07:19] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 7.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:08:29] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [09:08:01] I wonder if future Hadoop-run mw jobs could have issues when codfw is primary? [09:46:14] hey folks. Hope this is the right channel to ask; The mw-page-content-change-enrich (flink) app in eqiad is failing to startup because it can't reach swift https://logstash.wikimedia.org/goto/ce1765e186329ed74f179d375f8df182. We need swift for HA, so the k8s operator throws in the towel at boot. [09:47:04] the app is multi-dc, and started to consume traffic from codfw. There seem to be no isse connectivity from that DC. [09:47:28] would you maybe have any pointer wrt troubleshooting this? [09:48:22] apologies if I missed some phab task / comms. It's my first DC switchover involving k8s and swift. [09:51:50] swift-rw is pooled in codfw only, swift and swift-ro were depooled from eqiad, but then repooled in both DCs because of thumbor issues [09:52:22] Ah it's thanos-swift [09:52:36] thanos-swift was depooled from eqiad [09:53:26] The question is why is the cross-DC call failing [09:55:44] claime ack [09:56:00] egress only allows calling to eqiad [09:56:14] mmm... config issue on my end then [09:56:16] Allowing egress traffic: [09:56:18] To Port: 443/TCP [09:56:20] To: [09:56:22] IPBlock: [09:56:24] CIDR: 10.2.2.54/32 [09:56:26] Except: [09:56:52] nice find! [09:57:34] another (unrelated but interesting) question is: why don't those logs have regular k8s annotations like namespace and master URL? [09:58:22] gmodena: The egress is set in values-staging.yaml for your deployment [09:58:52] Well in values-* [09:59:33] However since you've set values-eqiad and values-codfw to only be able to talk to the dc-local thanos-swift, I'm wondering if there's a reason for that segmentation? [10:05:33] jynus: the DAG will always run from eqiad, it'll write to cassandra and be exposed via something (AQS or an http-shim) and the maint script in the primary dc would just call that http entry point. so it shouldn't matter [10:06:09] claime going through some phab tasks to restore context. [10:06:52] we use swift to checkpoint kafka offsets, and we want the eqiad app to track eqiad offset [10:07:01] but that should be ensured at bucket level [10:07:03] gmodena: If you need the dc-local segmentation, your app should call the dc-local datastore dns name, i.e. thanos-swift.svc.codfw.wmnet or thanos-swift.svc.eqiad.wmnet, and then the ingress being set to only the dc-local IP makes sense [10:08:09] If you call thanos-swift.discovery.wmnet (as is the case at least in staging), your app should be dc-agnostic, and the egress set for both dc IPs [10:08:09] claime got it. [10:08:15] Does that make sense ? [10:09:10] claime it does, thanks! [10:09:22] np :) [10:09:29] apologies for the noise, and thanks for the explanation. TIL :) [10:11:22] No problem, the switchover process is also important exactly to find that kind of issues [10:11:34] +1 [11:30:35] arnaudb marostegui: Schema changes repo is now in gitlab: https://gitlab.wikimedia.org/repos/sre/schema-changes [11:30:43] nice [11:30:52] cool! [11:35:31] claime I don't understand why our app did not fail before, when codfw was the passive DC. It should still have tried to connect to swift. Could thanos-swift.discovery.wmnet have been routing to a pooled codfw swift cluster in that case (and we just got lucky)? [11:35:51] gmodena: Yeah, it's active/active, so both were pooled [11:36:01] When we did the switchover yesterday, we depooled eqiad [11:36:15] claime thanks for confirming [11:36:21] It will be repooled next week, but part of the process is to try and handle everything from one DC [11:36:29] I'm writing up an incident reports, wanted to get the details right :) [11:36:40] ack :) [13:11:09] marostegui andrewbogott for own edification, for the wiki replicas depool would it be https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/wikireplicas/update-views.py#127 to initiate drain the nice way, then https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/wikireplicas/update-views.py#155 to check that it drained nicely, [13:11:10] then https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Depool_wikireplicas#Wikireplica_database_servers (like Andrew's patch) to force the haproxy reload - all from dbproxy1018, then finally `sudo systemctl restart mariadb` on clouddb1019? [13:11:38] (not saying a cookbook is needed, just pointing at the sample commands) [13:12:11] dr0ptp4kt: With draining I meant basically check that there are not running queries when you issue a mysql -e "show processlist" [13:12:22] if it is empty, then you can issue the proxy reload [13:13:03] root@cumin1001:~# db-mysql clouddb1019:3314 -e "show processlist" | egrep -v "system|root|orch|wmf-pt|State" [13:13:03] root@cumin1001:~# [13:13:05] That one is empty [13:13:09] oh right - for andrew he can just root auth into clouddb1019's mariadb instances to check that, right? for me, i just have the querysampler id [13:13:15] root@cumin1001:~# db-mysql clouddb1019:3316 -e "show processlist" | egrep -v "system|root|orch|wmf-pt|State" [13:13:15] 5370405 s52321 10.64.37.27:36206 ruwiki_p Sleep 8 NULL 0.000 [13:13:16] That one is not [13:13:32] To be honest, if there are just few connections, I would just reload the proxy [13:14:04] yeah - that's what happened the other day to get clouddb1017 reconnected iirc [13:14:15] I depooled 1019 and thought I'd just wait for current queries to finish up on their own. I see now that that never happened [13:14:30] that or I failed to actually depool [13:14:55] taking kid to bus, then jogging, will check back later [13:14:56] andrewbogott: Or that user simply never let his thread die (which is a bad practice) [13:15:17] So far the only thing I've looked at is memory usage, which remains totally flat [13:15:24] I assume that means it's still doing some epic join [13:15:40] andrewbogott: I wouldn't trust the memory graph to judge if it is empty or not, that is unlikely to change [13:15:47] ok [13:15:49] You'd better check the connections or disk usage [13:15:56] as in IOPs [13:15:56] * andrewbogott has to figure out where the socket files are for this [13:16:31] Under /run/mysqd... [13:16:40] iops aren't overloaded but there's a busy mysqld process [13:16:59] The process will always be busy (replication) [13:17:22] root@cumin1001:~# db-mysql clouddb1019:3314 -e "show processlist" | egrep -v "system|root|orch|wmf-pt|State" [13:17:22] root@cumin1001:~# db-mysql clouddb1019:3316 -e "show processlist" | egrep -v "system|root|orch|wmf-pt|State" [13:17:22] 5370405 s52321 10.64.37.27:36206 ruwiki_p Sleep 8 NULL 0.000 [13:17:33] The host is basically empty, you can probably go ahead reload the proxy and then restart mysql [13:17:44] ok, thank you [13:17:56] Once it is restarted, don't forget to issue mysql -S $SOCKET_PATH -e "start slave" [13:17:59] So you start replication [13:18:08] That is needed for both sockets, s4 and s6 [13:19:16] 'the proxy' in this case is haproxy right? [13:19:31] yeah, sorry [13:19:53] ok, restarted haproxy on dbproxy1018 [13:19:56] now restarting mysql... [13:20:19] and to restar mariadb: systemctl mariadb@s4 restart [13:20:23] and same with s6 [13:20:50] well actually restart mariadb@s4 [13:21:35] https://www.irccloud.com/pastebin/2YOy0srC/ [13:22:12] and now I guess I can repool that host and watch it go oom again as soon as that bad query runs again :) [13:22:32] yeah [13:22:46] If it is a particular query, remember there is a pt-kill that can be customized [13:22:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/959018 [13:23:13] I don't know if it's a particular query, but that will be the next step if the memory use spices again [13:23:15] marostegui: https://gerrit.wikimedia.org/r/c/operations/puppet/+/959018 [13:23:57] dr0ptp4kt: I'll look at those cookbook links as soon as I'm doing fewer things at once :) [13:24:02] I don't know if it is enabled on clouddb, but performance_schema and sys profiling may help you debug once restarted [13:24:11] andrewbogott: go for it [13:24:54] do I need to manually restart haproxy after that's applied? [13:28:25] * andrewbogott does it anyway [13:28:44] so now everything should be back to 'normal' until memory usage spikes again [13:30:15] yes you have to [13:30:20] a reload is enough too [16:56:30] Right, I've taken upstream's review comments on board, so now I can leave the build running for the next 2.5hrs and review in the morning.