[00:03:36] FYI - lvs testing complete for the evening. [00:03:54] lvs1016 has been functionally replaced by lvs1020 as the new "backup" lvs [00:04:25] lvs1016 is still sitting with puppet disabled and pybal stopped, leaving it that way for tonight as it will make any sudden reversion of the state of affairs a little easier [00:04:54] \o/ nice [00:05:23] for now, lvs service config changes can go ahead like normal. Just keep in mind that the active set in eqiad is now lvs1013, 14, 15, and 20 (not 16) when you're going around restarting things, etc. [00:05:52] (will hunt for docs to update tomorrow) [00:17:37] Could someone remove me as a subscriber from https://phabricator.wikimedia.org/T297762. The 'One notification about an object which no longer exists or which you can no longer see was discarded.' notifications can be quite anoying (and buggy). [00:19:31] (I'm just guessing, the answer to my question on that task is 'no') [00:21:13] zabe: done, and yeah, thanks for pointing that out [00:21:33] thanks [09:53:22] _joe_, jayme and other envoy users... I'm currently testing envoy on cp4025 and I'm seeing that it's reporting basically 2x the requests per second, see https://grafana.wikimedia.org/d/uz11QGcnk/ats-haproxy-envoy-cluster-view?orgId=1&refresh=30s&viewPanel=13 VS https://grafana-rw.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?orgId=1&var-site=ulsfo%20prometheus%2Fops&var-instance=cp4025&from=now-30m&to=now&viewPanel=73 [09:54:03] <_joe_> vgutierrez: uhhh we should be able to see if that's the case for the other stuff [09:54:07] <_joe_> like the appservers [09:54:11] <_joe_> let me take a look [09:54:27] I'm using the envoy_http_downstream_rq_xx counter and according to https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats seems like the way to go [09:55:21] <_joe_> vgutierrez: well I am not sure, but https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=local_port_80 gives the correct count [09:55:30] <_joe_> cross-checked with apache counters [09:55:59] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=17&orgId=1 [09:56:13] right [09:56:22] but that's not the request rate for the downstream [09:56:27] but for the upstream [09:56:39] and it should map 1:1 if you only got 1 upstream AFAIK [09:56:52] let me check the upstream metrics... [09:57:34] <_joe_> vgutierrez: exactly [09:57:50] <_joe_> and we've never used the downstream metrics as we're more interested in the upstream ones [10:01:16] funny [10:01:31] the upstream metrics seems to provide the expected values: I've updated the 2xx panel: https://grafana-rw.wikimedia.org/d/uz11QGcnk/ats-haproxy-envoy-cluster-view?orgId=1&refresh=30s&viewPanel=13 [11:55:02] <_joe_> bblack: small semi-breakage from the migration from lvs1016: you didn't remove it from the list in lvs_class_hosts once you turned off pybal, so now the restart scripts time out when trying to connect to it [11:55:20] <_joe_> because they extract the list of lvs servers to connect to from there [12:24:32] bblack: another minor thing (from cron-spam), lvs1016 is currently unable to reach debmonitor.discovery.wmnet, so debmonitor is failing there [12:52:44] volans: lvs1016 services were migrated to lvs1020 yesterday evening, probably relates to that. It's ready for decom but still online via primary interface. [12:53:28] sry I see that's mentioned above. anyway just needs a bit of cleanup. [12:54:17] The move went very well, I was esp. impressed that the puppetdb import script would move cables from one server to another if the same switchport was suddenly found connected to something else. [13:57:27] hmm. tox-docker doesn't seem to support python3.9 [13:57:40] hashar: would it be possible to get that fixed? ^ [13:57:52] (happy to file a task if you can tell me what tags should be on it) [13:58:18] re: lvs1016, there's a fair bit of cleanup, but I'm trying to hold that off a little bit in case we have any sudden reason for rolling this back. [14:00:10] (but yeah, that will commence later today I think - we'll reimage it to an insetup role and remove it from various lvs-specific configs, etc) [14:03:35] ack, thx [14:05:42] hashar: ah, it's already tracked by https://phabricator.wikimedia.org/T289222. reading [16:00:34] kormat: pretty much yes. We would need python 3.9 / 3.10 for Buster or whatever other pythons for Bullseye [16:01:14] or findout an alternative to debian packages for providing python on ci [16:02:29] hashar: http://ppa.launchpad.net/deadsnakes/ppa/ubuntu/ would probably work [16:43:14] * Emperor would rather a distro python if possible [19:08:14] https://www.wikidata.org/wiki/Wikidata:Tools/Wikidata_for_Firefox [21:02:42] anyone about to give https://gerrit.wikimedia.org/r/c/operations/puppet/+/747610 a quick stamp? [21:03:54] done [21:03:55] legoktm: thanks! [23:47:40] hm.. have we added restrictions to Graphite recently in terms of timeouts? [23:47:41] https://grafana.wikimedia.org/d/000000430/resourceloader-modules-overview?orgId=1 [23:47:58] I can't seem to load the latency graphs here, showing an error each time due to "time out after 6.0 seconds" [23:48:15] not sure why that's taking 6s though [23:49:31] tried removing the transforms and reducing from 3d to 12h, but no dice [23:49:50] query: 'MediaWiki.resourceloader_build.*.p99'