[01:07:12] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 35.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:12:32] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [09:16:45] good morning my neighbours o/. back at work. please let me know if there's something that requires my immediate attention. a lot to catch up on... [09:16:57] kwakuofori: o/ [09:17:09] kwakuofori: I sent you an email about some holidays - that's all i have :) [09:17:40] hey marostegui! [09:18:24] sure. will look at it [09:18:41] thanks [11:24:23] Emperor: would it be okay for me to do that previously postponed thumbor pooling test today? [11:25:52] Yeah, if you break swift I'll have an excuse to stop working on clinic duty tasks ;-) [11:26:21] hnowlan: I have a meeting 15:30-16:00 UTC, otherwise my calendar's pretty clear - LMK when a good time for you is? [11:27:25] Emperor: I'll do my best to avoid breaking it too much. Would about 16:00 UTC work? [11:39:34] sure [14:53:19] o/ [14:55:33] afternoon :) [16:13:19] Emperor: if it's still okay I'd like to pool thumbor-k8s in the next few minutes [16:18:46] Sure [16:21:19] cool, going now. [16:25:31] Emperor: Yep, looks like the same thing again. Depooling unless there's anything you can think of being worth checking live [16:26:45] hnowlan: do we have a clear idea of where the 500s are coming from (are they from thumbor-on-k8s and passing through swift)? [16:28:00] Emperor: unfortunately no - I don't see elevated levels of 500s in the thumbor logs themselves, and it doesn't seem like a high level of 5xx errors are being returned to users [16:28:04] For the time being I'll depool [16:28:29] OK, sorry it's remaining an elusive problem [16:34:06] so there is this https://grafana.wikimedia.org/goto/YTIdwWa4z?orgId=1 [16:34:15] and they seem to specifically be read errors https://grafana.wikimedia.org/goto/1qkpwW-Vk?orgId=1 [16:34:22] 5xxs back to normal again [16:35:52] oh, those are effectively showing the same data [16:35:53] hnowlan: Hm, with the benefit of hindsight, it might be useful to have captured some of the requests going through thumbor so we could try and find them in logs [16:36:04] same> yes [16:47:36] Emperor: you can spot them pretty easily in proxy-access.log for the period by grepping for something like ^"Mar 8 16:23" and looking for " 503 " [16:47:52] Not seeing a whole lot of corresponding tx ids showing up in server.log though [16:48:00] but I have no idea if there should be a mapping there [16:59:10] that gets ~1000 hits [17:00:03] (you might find the tx id turning up on the backends that the request went to) [17:07:36] huh, looks like the corresponding requests for a few given transactions just 404 on the backends [17:24:15] right, in which case swift will have sent the request on to thumbor, and passed the error code from thumbor back to the user [17:36:29] so just to be clear that means the 500s in the proxy log *are* 500s from thumbor? [17:39:39] or to put it another way, will proxy-access.log contain 503s coming from thumbor? [17:42:04] hnowlan: I think in this situation that yes, those 500s are coming from thumbor (and proxy-access.log is then logging them), in the midst of the custom 404 handler [17:59:34] Emperor: alright, that makes things a bit clearer. I suspect we're hiding errors somewhat, I'll go back to the drawing board on that. thanks for the help [21:43:53] (SessionStoreErrorRateHigh) firing: Session storage error rates (5xx) in eqiad are elevated - TODO - https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreErrorRateHigh [21:53:53] (SessionStoreErrorRateHigh) resolved: Session storage error rates (5xx) in eqiad are elevated - TODO - https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreErrorRateHigh