[07:30:22] brouberol: check -operations [07:30:27] Your change has been merged [07:30:37] thanks [09:45:43] Emperor: belated thanks for the heads up on codfw thumbor (I was PTO). error rate is back down now [09:48:25] cool, thanks :) [10:23:44] jelto: what are you looking at, so I can look at something else [10:24:46] I still need a minute to get to my Computer. [10:25:18] ah, cool no prob [10:25:36] I am checking k8s mw for now [10:26:10] so there was a deployment window right at 10 UTC [10:27:07] hnowlan: it started at 10:18 or so [10:28:22] https://grafana.wikimedia.org/goto/PIhnAuLIR?orgId=1 when mw-web traffic started going down [10:28:28] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&refresh=1m shows the traffic to appservers started at 10:10 [10:28:49] let me check superset just in case [10:29:00] ok, I will check logstash [10:31:21] effie: that's almost exactly when the jobs peaked [10:33:00] claime mentioned commons having issues [10:33:19] commons is having issues because it's served only by bare metal appservers, no k8s traffic [10:33:29] jelto: anything standing out in superset? [10:33:33] so if bare-metal fpm workers are saturated, commons is slow [10:33:42] requests are up 6x on metal [10:33:56] The fact that there's a request spike *only* on metal leads me to believe this is commons related [10:34:12] otherwise we would be seeing a corresponding request raise on mw-on-k8s [10:34:20] more in codfw than eqiad [10:35:49] claime: I see a drop in the number of requests in mw-web too [10:36:11] hnowlan: do you know which job peaked specifically ? [10:37:07] effie: wikibase-addUsagesforPage [10:37:24] although I suspect that is either a side effect or a distraction, the rps spike is far more than the jobs [10:37:25] that is not ver helpful, I have no idea what it does [11:18:27] things are resolved, there was a traffic increase involved which has been sorted [17:44:10] FYI we are moving some job's off the the mw job queue today and tomorrow, in aggregate this will remove ~300 jobs/sec from the job runners [17:44:35] the load itself still continues, but shifted to mw-api-int-async [17:51:17] all looks pretty normal, it's already caught up with the lag