[04:39:01] 10serviceops, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Add support for xhgui for jobs - https://phabricator.wikimedia.org/T292382 (10Ladsgroup) [05:29:00] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @Joe did so, thanks. [05:32:55] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) I run an initial test running some 1000s of production URLs. It appears that we are about to hit max_accelerated_files (curren... [07:00:50] Hi, I'm back o/ [07:04:02] o/ [07:05:22] <_joe_> oh noes [07:05:34] * _joe_ dumps 57 tasks onto jayme [07:38:07] 10serviceops, 10SRE, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Joe) p:05Triage→03Medium [08:03:58] 10serviceops, 10SRE, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10JMeybohm) You think we can piggyback the necessary helmfile.yaml changes for the helm3 migration (T251305) with this @Jelto ? [09:27:50] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review, 10Release-Engineering-Team (Next), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10Jelto) I prepared `gitlab-ce` `14.3.2-ce.0` and `gitlab-runner` `14.3.2` on apt host [09:37:50] good morning, checking something completely unrelated I noticed that we have some SVC certificate expiring in ~6 months. Is something we get checked by Icinga or should I open a task to not forget about them? [09:38:41] those are certs signed by the puppetmaster CA to be clear, things like api.svc.codfw.wmnet [09:47:44] <_joe_> volans: IIRC we have a process checking all certs created with cergen cc jbond [09:48:31] <_joe_> volans: but I'll check specifically, thanks [09:48:56] <_joe_> it's possible api.svc and appservers.svc predate cergen (they do in fact) and don't get checked [09:49:03] created in 2017 [09:49:39] <_joe_> where are you checking? [09:49:49] _joe_: I've exported the expiring data to /tmp/expiring-certs on puppetmaster1001 if you want to have a look [09:50:00] signed certs on the puppetmaster [09:50:07] didn't yet check if they are in use or not [09:50:12] <_joe_> ok so not the ones running in prod [09:50:14] <_joe_> ack [09:50:22] <_joe_> it's possible it's not used anymore, I'll check [09:51:07] <_joe_> rendering I'm pretty sure it's not used [09:52:08] if I connect with s_client to api.svc.codfw.wmnet I get the same expiration [09:52:17] _joe_: i wrote a check script for icinga but it stalled in gerrit https://gerrit.wikimedia.org/r/c/operations/puppet/+/552260/. however i did update all the icinga check_https checks so that they also cehcke for certificate expiry, although the warn/crit times are 7/10days [09:52:40] <_joe_> jbond: oh ok thanks [09:53:02] jbond: check_http_lvs doesn't seem to use the -C option [09:53:04] I did had check that [09:53:07] <_joe_> volans: yes api and appservers are in fact the two certs that do have to be renewed [09:53:33] s/did had/did/ [09:53:49] s/all/most/ i can't rember why i left the lvs ones originaly but will create a CR today to update them as well [10:23:52] fwiw https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=api.svc.eqiad.wmnet&service=LVS+api-https+eqiad+port+443%2Ftcp+-+MediaWiki+API+cluster-+api.svc.eqiad.wmnet+IPv4+%23page does report the expiration so I guess it will alert, I'm just wondering if for those more critical endpoints 7/10 days might be not enough :) [10:27:36] 10serviceops, 10SRE, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Jelto) Should be possible and sounds like a good idea to piggyback this with T251305 if we are going to re-deploy all services anyway. [10:50:02] <_joe_> volans: it's enough but yeah I'd probably prefer to have more warning time :) [10:53:19] :) [10:53:30] volans: _joe_: CR is https://gerrit.wikimedia.org/r/c/operations/puppet/+/725766 comment on there what times you would prefer. the 7,10 times where based on the default pki expiry of 4 weeks and just to get something in place but happy to change to what ever [11:59:36] 10serviceops, 10Wikifeeds, 10Sustainability (Incident Followup): Clarify in Wikifeeds documention the request flows - https://phabricator.wikimedia.org/T291912 (10hnowlan) >>! In T291912#7383607, @akosiaris wrote: > @elukey, @hnowlan Let me know if this is more clear now. A lot clearer, thanks! [13:20:14] 10serviceops: Release a 1.16 tag of docker-registry.wikimedia.org/golang - https://phabricator.wikimedia.org/T283425 (10Addshore) [13:20:18] 10serviceops: Release a 1.16 tag of docker-registry.wikimedia.org/golang - https://phabricator.wikimedia.org/T283425 (10Addshore) [14:16:44] 10serviceops, 10Wikifeeds, 10Sustainability (Incident Followup): Clarify in Wikifeeds documention the request flows - https://phabricator.wikimedia.org/T291912 (10akosiaris) 05Open→03Resolved a:03akosiaris Cool, I 'll resolve then, feel free to reopen though! [14:32:25] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Next), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10Jelto) upgrade of `gitlab2001` to `14.3.2-ce.0` was successful. [14:41:36] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Next), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10brennen) 05Open→03Resolved a:03brennen [15:59:03] I had to do a yucky feeling thing, () but I now have the Toolhub web crawler running from inside the eqiad k8s cluster. [16:58:52] <_joe_> bd808: I'm not sure there is a better way to do it, but admittedly I never studied Cronjobs properly [17:00:35] _joe_: there are some different hacks that I've seen, but they are all pretty ugly. I think the proper "fix" here will be to actually build the celery cluster I originally planned, but the cronjob should get us by for a while. [17:00:44] <_joe_> the alternative solution would be not having the crawler with all its sidecars as a cronjob, but rahter having it listening to a queue, and having a cronjob that submits jobs to that queue [17:00:50] <_joe_> eheh yes [17:01:47] The planned work for Toolhub in Q3 will very likely result in other jobs we want to run and make the investment in celery make more sense [17:17:24] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.2 - https://phabricator.wikimedia.org/T291095 (10jijiki) >>! In T291095#7395126, @dancy wrote: >>>! In T291095#7390898, @jijiki wrote: >> @dancy it would be lovely if we can speed this up, right now we have `deploy1002` and `ma... [17:24:39] <_joe_> bd808: tell me celery can now use kafka as a backend [17:26:21] looks like the bug for that from 2014 is still open -- https://github.com/celery/kombu/issues/301 [17:30:11] <_joe_> sigh [17:30:27] <_joe_> ok then maybe not celery :P [17:31:07] yeah... we'll figure something out. [17:31:33] <_joe_> it's a pity, I think they're overcomplicating things tbh [17:31:49] <_joe_> celery has a very nice user interface though [17:32:07] * bd808 randomly found https://github.com/joowani/kq [18:44:23] 10serviceops, 10MediaWiki-General, 10SRE, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) Gosh this it's hard to parse what's going on here and the folk... [18:47:26] 10serviceops, 10MediaWiki-General, 10SRE, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Reedy) a:05holger.knust→03None [20:11:11] I guess since I can't ssh into kubernetes exec nodes or the logstash servers there really isn't much I can personally do to debug https://phabricator.wikimedia.org/T292099. Unless somebody can clue me into other methods of investigation? [20:27:11] 10serviceops, 10MediaWiki-General, 10SRE, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) > I'll first dry-run the uppercaseTitlesForUnicodeTransition.p... [21:19:21] 10serviceops, 10MediaWiki-General, 10SRE, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Pchelolo) So, first I've dry-run the script ` foreachwiki uppercaseTitl... [22:37:16] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) Tested whether we can download all the existing databases PLUS the new databases using the same license..... [23:01:42] 10serviceops, 10Release-Engineering-Team: docker-reporter-releng-images => docker registry: status=3/NOTIMPLEMENTED - https://phabricator.wikimedia.org/T292485 (10Dzahn) [23:06:40] 10serviceops, 10Release-Engineering-Team: docker-reporter-releng-images => docker registry: status=3/NOTIMPLEMENTED - https://phabricator.wikimedia.org/T292485 (10Dzahn) well.. just manually starting it fixed it for now: [deneb:~] $ sudo systemctl start docker-reporter-releng-images 23:02 <+icinga-wm> RECOV... [23:07:05] 10serviceops, 10Release-Engineering-Team: docker-reporter-releng-images => docker registry: status=3/NOTIMPLEMENTED - https://phabricator.wikimedia.org/T292485 (10Dzahn) p:05Triage→03Low