[00:59:43] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) 05Open→03Resolved a:03Legoktm A recap blog post was published a few days ago: https://techblog.wikimedia.org/2021/07/23/june-2021-data-cent... [07:15:57] hey folks, good news about the cpu/latency regression, seems gone with the new iptables [07:20:04] <_joe_> elukey: I was pretty sure given what you found [07:20:18] <_joe_> well what topranks found, you were just a bystander ofc [07:24:01] yep yep as always [07:24:37] I left a note in the task about the fact that apt::package_from_component does not automatically upgrade the pre-installed iptables to the component's version [07:24:55] I had to do it manually, but I didn't check if there is a way to force the install [07:25:03] (maybe specifying the version) [07:25:41] I'll try to look for a solution asap, but not super urgent given that nothing is exploding anymore [07:28:19] <_joe_> that is by design [07:28:30] <_joe_> puppet doesn't handle package upgrades [07:31:03] I know but I am pretty sure that after a reimage we'll forget to upgrade iptables :D [07:36:56] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10dcausse) Will set `kubernetes.disable.hostname.verification` to true for the k8s client for... [07:38:43] <_joe_> elukey: because it's part of the base system, right? [07:39:09] <_joe_> elukey: so in this case, you should probably talk about creating a base image with the updated iptables with moritzm [07:39:19] <_joe_> a tftp image I mean [07:40:06] _joe_ could be an option yes [08:11:17] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10JMeybohm) That is because your application is reading default kubernetes environment variab... [08:36:36] I'm looking for ex-services team folks that might know about the alerts here, who do you reckon is best? https://gerrit.wikimedia.org/r/c/operations/puppet/+/708476 [08:38:00] <_joe_> godog: Pchelolo / urandom are your best bet [08:38:49] makes sense, thanks [10:01:09] Hello. I'm hoping to create a new DNS discovery service for analytics-presto, which we'll use for managing active/passive failovers of Presto co-ordinators in eqiad. [10:01:57] So far I have a patch which creates a discovery service in the test cluster: https://gerrit.wikimedia.org/r/c/operations/puppet/+/706661 - even though we only have a single coordinator to use in the test cluster. [10:03:42] Would it be possible to get someone from your team to check the conftool and hieradata parts related to the DNS discovery service please, and to let me know if you think that there are any issues with this approach please? [10:03:48] <_joe_> btullis: uhm there are a few problems with your plan [10:03:59] <_joe_> mainly how our discovery dns stuff works [10:04:24] <_joe_> let me think about it for a sec, but yes I'll take a look [10:04:55] <_joe_> basically we do add entries to a geodns table, so for purely A/P services, it would work within a single DC [10:05:00] <_joe_> for A/A, it wouldn't [10:05:23] Thanks ever so much. We've used a CNAME for this kind of active/standby failover in the past, but we thought a conftool change would be so much cleaner if we can make it work. [10:05:50] <_joe_> so, in the case of this test, you only have one dnsdisc entry, "eqiad" [10:06:02] <_joe_> so if you set it to false, your service will just fail [10:06:13] Not planning A/A at the moment. Nor multi-DC. [10:06:28] <_joe_> (dns is programmed to redirect to failoid, which is our blackhole service ip :P) [10:07:08] <_joe_> btullis: one thing to verify is if discovery works without LVS, which I am not sure about [10:07:28] <_joe_> but lemme finish something else and I'll get back to you [10:07:48] Sure, no hurry at all. Thanks again _joe_. [10:22:16] <_joe_> btullis: so, let me go check back how the dyndns stuff works for a sec, I want to understand if it would work for your use-case, because it's not what it's designed to do originally [10:23:39] <_joe_> so it is configured via profile::dns::auth::discovery [10:24:30] Right. Looking now. [10:24:38] <_joe_> that includes a file called discovery-map [10:24:56] <_joe_> which is located at profile/files/dns/auth/discovery-map [10:25:07] <_joe_> datacenters => [eqiad codfw] [10:25:12] <_joe_> map => { default => [eqiad codfw] } [10:25:38] <_joe_> so it's all really designed to work when you have a codfw and a eqiad cluster [10:26:29] <_joe_> if we want to support your use-case, we need to work on this stuff, and I think the best person to help with that is b.black, but I'm not sure if he's back or not from PTO [10:27:30] <_joe_> so my suggestion would be to write down what you want to achieve specifically in a task, and to get someone from the traffic team to look into how hard it would be to support using dns discovery for failover within a single datacenter [10:29:32] <_joe_> it might be possible it would work as-is, but I am just not sure [10:31:14] <_joe_> it's very possible there is a much more straightforward way to configure that in gdnsd [10:33:37] OK, will do. Many thanks. I had reasoned that it would work because of the references to 'xyz is not an LVM service' and 'the IPs are those of the VMs' e.g. here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#2817 [10:35:19] Anyway, I really appreciate your time and I'll make that ticket setting out our potential use case(s). If we have to stick with DNS CNAME entries for manual failover, then so be it. [10:36:50] <_joe_> btullis: my suggestion would be to stick to cnames for now, consider using confctl-based dns discovery later, so that you don't get blocked right now [10:37:32] <_joe_> to be clear: your current patch would probably work, but for the real use-case, you'd want names different from "eqiad" and "codfw" in your map [10:37:43] <_joe_> and I think that would cause an error [10:42:47] Understood. Thanks. Will update my patch and probably revert to CNAMEs for now, as you suggest. [10:59:40] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) [11:00:55] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) As far as I know, we already generate an image for every +2 in mediawiki-config, so I'll assume that part is al... [11:08:47] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [11:12:31] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) @wiki_willy You guys can remove all old mw appservers from eqiad rack A5 and rack A8 already, they are decom... [11:14:39] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) p:05Triage→03High [11:14:42] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) @wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a look at special case mw1444 which shoul... [12:16:47] 10serviceops, 10Data-Persistence, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) [12:18:14] 10serviceops, 10Data-Persistence, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) [12:22:03] 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) dbproxy2* can be done anytime dbproxy1018 and dbproxy1019 are owned by the cloud services team. The other dbproxies hosts ar... [12:34:32] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10dcausse) seems to be fixed now by providing explicit K8S client env. [13:43:26] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) mw1434 has an issue with IPMI ` Remote IPMI failed for mgmt 'mw1434.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'mw1434.mgmt.eqiad.wmn... [13:45:18] _joe_: re dns discovery it dose work without lvs, although not its not used much. also good to know that it gdnsd only expects one ip address per DC, i assumed it was more flexible then that and did some simple round robin load balancing or something like that [13:45:19] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1435.eqiad.wmnet', 'mw1436.eqiad.wmnet'] ` The log can... [13:47:41] <_joe_> jbond: I'm sure you can configure it that way [13:47:59] <_joe_> but our current system kinda-expects the map to include one netry for codfw and one for eqiad [13:49:26] hello folks, first time that I change a chart, does it look good? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708529 [13:49:27] <_joe_> and I'm not sure it would work for different mappings [13:49:32] _joe_: ack thanks, will follow along with btu.llis descussion with bbl.ack [13:50:19] <_joe_> elukey: I would assume such a change could be managed via values.yaml, but I think in your case you just want to import the yaml blindly, correct? [13:50:47] _joe_ yes yes exactly [13:50:48] <_joe_> I mean from the values.yaml in helmfile.,d [13:51:02] <_joe_> you don't need to change the chart you can just override that [13:51:43] yeah but I thought it would have been a little confusing to have different values like that, if not I can leave the chart alone and just override values [13:52:03] (different values meaning the chart saying xyz and the rest something different) [13:52:06] <_joe_> yeah I think it's how we've been doing it [13:52:33] <_joe_> the values.yaml in the chart should be considered "defaults" that we can easily override and we should in production [13:52:43] perfect, fixing [13:57:57] <_joe_> yeah I think this new patch is much better [13:58:11] <_joe_> it also allows you to have a direct grasp of what you're installing [14:09:51] could I get a review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/698984 ? [14:18:06] moritzm: looking, from https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/800/console it looks like jenkins was happy but PCC didn't compile [14:19:27] yeah, I see the same on a fresh pcc run https://puppet-compiler.wmflabs.org/compiler1002/30388/ [14:19:34] rzl: sorry, wrong patch! I thought I had already abandoned that one. correct patch is actually https://gerrit.wikimedia.org/r/c/operations/puppet/+/702117 [14:19:43] oh! okay [14:19:58] whew, I was hoping you hadn't been waiting *that* long [14:20:29] (oops, but this one has actually had my name on it for a month, that's worse) [14:26:52] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1435.eqiad.wmnet', 'mw1436.eqiad.wmnet'] ` and were **ALL** successful. [14:28:29] moritzm: +1 [14:29:07] rzl: thanks! I'll merge this tomorrow [14:29:20] sounds good [14:30:33] mostly offtopic, but I don't actually know why we use nginx there in the first place -- seems like we could replace it with envoy which we're using for TLS proxying elsewhere, and have One Less Thing [14:30:44] maybe just historical reasons and it isn't worth the effort/upheaval of replacing it [14:31:50] 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) Thank you @Marostegui for the info! Yes this can wait next week or the week after no problem. [14:33:47] _joe_: jbond: regarding our earlier discussion about DNS discovery, I have created a feature request and assigned it to traffic for triage, as suggested: https://phabricator.wikimedia.org/T287584 [14:39:36] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw1434.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:41:51] <_joe_> btullis: great, sorry in a meeting rn I'll take a look soon-ish [14:43:15] btullis: thx [14:47:07] Hey! Can somebody help with reviewing this patch? It would be helpful to unblock our work on staging tegola-vector-tiles. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708494 [14:47:11] rzl: yeah, exactly, there are a few other cases where nginx is purely used for TLS termination which haven't switch to envoy as well. but I did a fleet-wide sweep across all Nginx installs to validate the flavours in use [14:48:52] nod [14:48:55] <_joe_> nemo-yiannis: I guess you need to add hnowlan to that patch, I miss all the context about maps [14:49:51] <_joe_> rzl: sorry I was in a meeting as well [14:50:02] <_joe_> no we definitely can't use envoy for proxying etcd [14:50:51] <_joe_> the only reason we use nginx there is because it makes it easier to do fine-grained RBAC for etcd without the perf penalty that etcdv2 imposes on responses whenever we run RBAC on the etcd cluster [14:51:17] <_joe_> so once we move conftool to use the etcdv3 datastore, we might drop nignx in front there [14:51:32] ohh interesting [14:51:54] <_joe_> doing the same stuff we do with nginx with envoy would require us to write a gRPC service for auth :P [14:52:46] in that case will you double-check https://gerrit.wikimedia.org/r/c/operations/puppet/+/702117 as well to make sure I didn't miss anything? [14:52:52] <_joe_> yeah so [14:53:01] I trust Moritz's read but that's a lot more context I didn't realize I didn't have :P [14:53:05] <_joe_> I have seen that patch, and never had the time to read into it [14:53:18] <_joe_> moritzm: what is nginx-light dropping specifically? [14:53:28] <_joe_> we don't use any lua for instance in the conf servers [14:54:20] _joe_: both effie and hnowlan (who work with on maps) are on vacation :/ i added some context on the patch but if its not enough i guess it can wait until folks are back. [14:54:23] <_joe_> sorry I got two other meetings starting in 3 minutes [14:54:33] <_joe_> oh they're both on vacation? [14:54:42] <_joe_> :D [14:54:58] <_joe_> I really would have a lot of reading to do before I can give you a +1 [14:55:30] <_joe_> rzl: I would say we can apply with care to a single server first [14:55:52] <_joe_> I would aslo check that pybal doesn't get confused by nginx restarting on the conf* servers [14:55:54] yeah that's about what I was thinking -- I'm less sure about exactly how to test it once it's there [14:56:42] <_joe_> rzl: do it in eqiad, then try to read and write conftool values [14:56:57] <_joe_> which is as simple as running "depool" on one mw appserver [14:57:06] nod [14:57:10] _joe_: ok, no worries, we can hold on merging this [14:57:11] that exercises everything we need? [14:57:15] <_joe_> nemo-yiannis: you got a +1 from mateus, do you need me to merge the change or can you do it yourself? [14:57:20] <_joe_> I would say just go on :) [14:57:29] got it thanks [14:57:36] <_joe_> if it doesn't work, you can rollback :) [14:57:42] sounds good [15:02:25] _joe_: the module differences in -full compared to what's in -light are: [15:02:33] STANDARD HTTP MODULES: Browser, Geo, Limit Connections, Limit Requests, Memcached, Referer, Split Clients [15:02:40] OPTIONAL HTTP MODULES: Gzip, Gzip Precompression, Index, Log, Real IP, Slice, SSI, SSL, Stub Status, Thread Pool, WebDAV, Upstream. [15:03:01] wrong paste. [15:03:04] should be [15:03:05] OPTIONAL HTTP MODULES: Addition, GeoIP, Gunzip, Image Filter, Stream, SSL Preread, Substitution, User ID, XSLT. [15:03:15] THIRD PARTY MODULES: Auth PAM, DAV Ext, GeoIP2, HTTP Substitutions, Upstream Fair Queue. [15:03:24] and some irrelevant stuff like mail/streaming modules [15:19:19] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1434.eqiad.wmnet'] ` and were **ALL** successful. [15:23:52] <_joe_> moritzm: yeah we don't need any of that stuff I think [15:24:01] <_joe_> sorry again in a meeting :) [15:27:56] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10JMeybohm) 05Open→03Resolved Thanks, closing then. [15:54:34] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [17:45:21] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10wiki_willy) Thanks @Dzahn! >>! In T280203#7242737, @Dzahn wrote: > @wiki_willy You guys can remove all old mw app... [18:44:05] 10serviceops, 10Platform Engineering, 10Shellbox, 10Wikimedia-production-error: InvalidArgumentException: Expected IJobSpecification objects - https://phabricator.wikimedia.org/T287623 (10thcipriani) [18:53:43] 10serviceops, 10Platform Engineering, 10Shellbox, 10Wikimedia-production-error: InvalidArgumentException: Expected IJobSpecification objects - https://phabricator.wikimedia.org/T287623 (10Legoktm) > Seeing a couple of these in production over the last four hours, unsure of user impact Probably some jobs a... [19:57:22] 10serviceops, 10Platform Engineering, 10Shellbox, 10Wikimedia-production-error: InvalidArgumentException: Expected IJobSpecification objects - https://phabricator.wikimedia.org/T287623 (10thcipriani) >>! In T287623#7244272, @Legoktm wrote: >> Tracing the entire request via reqId in logstash determined that...