[08:58:11] hello folks [08:58:35] there are 3 api appservers in eqiad not pooled - https://config-master.wikimedia.org/pybal/eqiad/api-https [08:58:48] I can't find maintenance for them though [09:03:23] <_joe_> elukey: we noticed yesterday too, no one knows why they're like that [11:25:16] 10serviceops, 10Phabricator, 10Release-Engineering-Team (Next): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10awight) >>! In T296022#7742371, @Dzahn wrote: >>>! In T296022#7742329, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-op... [12:53:10] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10akosiaris) >>! In T288375#7804357, @BTullis wrote: > Could we deploy the GeoIP databases to the kube-workers and then mount it... [13:04:30] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10BTullis) >>! In T288375#7810592, @akosiaris wrote: >>>! In T288375#7804357, @BTullis wrote: >> Could we deploy the GeoIP datab... [14:37:53] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10akosiaris) >>! In T291707#7401997, @Mvolz wrote: > >>> >>> In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/... [14:58:02] 10serviceops, 10GitLab (Infrastructure): GitLab minor version upgrade: 14.9.x - https://phabricator.wikimedia.org/T304622 (10Jelto) This will happen tomorrow/Tuesday due to scheduling conflicts. [15:01:26] running 5m late this morning, sorry [15:15:12] 10serviceops, 10decommission-hardware: decommission kubernetes200[1-4] - https://phabricator.wikimedia.org/T303045 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `kubernetes[2001-2004].codfw.wmnet` - kubernetes2001.codfw.wmnet (**PASS**) - Downtimed host on... [15:15:46] 10serviceops, 10decommission-hardware: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `kubernetes[1001-1004].eqiad.wmnet` - kubernetes1001.eqiad.wmnet (**PASS**) - Downtimed host on... [15:30:47] 10serviceops, 10Release-Engineering-Team, 10Scap: Deploy Scap version 4.5.0 - https://phabricator.wikimedia.org/T304134 (10jnuche) @Dzahn this task is not getting any love and I see you rolled out the two previous versions. Could we assign this to you? Failing that, can you please help get it unblocked? [15:37:55] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Release-Engineering-Team (๐Ÿš‚๐Ÿงช Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10herron) Removing from the sre access request queue while the details of the request are being clarified. Please r... [15:38:52] _joe_: $ sudo cumin 'A:stretch and P{C:envoyproxy}' => doc1001.eqiad.wmnet,ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet,restbase-dev[1004-1006].eqiad.wmnet,webperf2001.codfw.wmnet,webperf1001.eqiad.wmnet [15:39:15] <_joe_> so yeah the only important thing is ores [15:39:22] <_joe_> and we can blame elukey for that [15:39:26] ye [15:41:22] definitely [15:42:00] Moritz is helping us in porting python 3.5 on buster-wikimedia (I know it is sad), after that we should be able to migrate to buster on all ORES nodes [15:42:18] <_joe_> ahha [15:42:47] <_joe_> can we forward port COBOL-CICS as well if it helps [15:43:02] <_joe_> and smitty [15:44:51] <_joe_> moritzm: ^^ that's doable right to help our ML friends [15:51:07] yeah I know it is sad [15:51:34] this is a test to avoid changing all deps to python 3.7 since it may cause a lot of troubles for us, if it fails we'll try to upgrade ORES to 3.7 [15:57:57] <_joe_> elukey: yeah I get it [15:58:10] <_joe_> I'm just having cheap fun at your expenses [16:00:45] _joe_ well deserved, I feel really ashamed [16:01:24] <_joe_> hey ores you inherited [16:01:30] <_joe_> you should be ashamed of istio [16:04:57] yes that too, but I can share the burden with jayme [16:18:41] 10serviceops, 10ChangeProp, 10SRE, 10envoy, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10herron) p:05Triageโ†’03Medium [16:21:28] 10serviceops, 10SRE, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10herron) p:05Triageโ†’03Medium [17:00:50] 10serviceops, 10Release-Engineering-Team: docker-report-releng failing on multiple image tags because of certificate validation error - https://phabricator.wikimedia.org/T304875 (10JMeybohm) [17:02:10] 10serviceops, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Michaelcochez) @Joe We now made the changes to use the bullseye distribution and the provided image with go instal... [17:10:54] _joe_, jayme: helmfile diffs as promised https://www.irccloud.com/pastebin/Pq33Fd4n/diffs.txt [17:11:14] that's just eqiad, but codfw and staging are similar [17:12:10] <_joe_> shellbox is not a huge issue and I can take care of it [17:12:15] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) >>! In T300324#7801057, @RLazarus wrote: > Hmm, the 1.21.1 build didn't work out of the box. Running `build-envoy-deb buster future` got me this: > > `... [17:12:23] <_joe_> the rest... apple search I suppose should be ok? [17:12:28] <_joe_> but I have no idea tbh [17:13:14] changeprop is fun :/ it also has some undeployed resource limit changes in codfw, that's the exception [17:13:51] I don't think any of these individual changes is likely to be a huge deal, it's just that we should figure out a backstop process in general [17:13:57] citoid should be fine as well. I think I've bumped statsd-exporter at some point [17:16:00] rzl: the checksum/secrets is a thing...IIRC it is not guaranteed that the values always end up in the same order or something stupid like that. So sometimes there is a diff in checksum while there is no diff in content [17:16:39] haha cool [17:16:48] yeah, so statsd was me - sorry :/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/762463 [18:26:46] 10serviceops, 10Phabricator, 10Release-Engineering-Team (Next): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Dzahn) Thank you @awight :) Gotcha! So for now, even without migrating, it will still be possible to push to the repo, just not via ssh. It's s... [18:36:29] rzl: in case you get very bored: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774528 :D [18:45:41] jayme: ooh, will look [18:46:36] unfortunately it nicely passes all checks and is still broken :) [18:49:05] but PS4 should work [18:50:12] I'm off for today, ttyl o/ [18:51:59] ๐Ÿ‘‹ [18:56:51] phab2002 just went offline .. then came back online again. caused a pybal alert because of that git-ssh service we want to shut down [18:57:11] but server is fine and uptime not interrupted. more like cable / network outage [18:57:21] currently checking if that is the case