[07:04:28] 10serviceops, 10SRE: deneb.codfw.wmnet root partition is full - https://phabricator.wikimedia.org/T287222 (10MoritzMuehlenhoff) [07:14:04] hello folks [07:14:09] quick one for knative [07:14:10] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/707408/1/helmfile.d/admin_ng/helmfile.yaml [07:14:22] I have no idea if the merge order is ok for knative [07:14:41] I don't see anything problematic but my experience with helmfile is close to zero [07:14:44] :D [07:15:04] (I tried to sync but helmfile doesn't find anything, it should be due to the above) [07:23:58] <_joe_> elukey: uhm wouldn't that apply that file to all clusters though? [07:26:04] _joe_ ah ok this is a good point, I thought that with helmfile -e env -l etc.. we'd be fine, but one could also run helmfile sync without limits and expect it to work [07:26:27] <_joe_> more to my point, even if you run it with -e [07:26:37] <_joe_> that file will be included anyways [07:26:42] ah snap ok [07:26:50] <_joe_> unless I'm missing something [07:27:07] <_joe_> but that part of helmfile.d is not something I'm overly familiar with. [07:30:10] lovely [07:30:21] I'll try to dig a bit more into it [08:10:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: The mediawiki-webserver image should only log in json format - https://phabricator.wikimedia.org/T285384 (10Joe) a:05jijiki→03Joe [10:09:40] 10serviceops, 10Lift-Wing, 10Kubernetes, 10Machine-Learning-Team (Active Tasks): Discussion: dedicated directory in the deployment-chart repository for ML services - https://phabricator.wikimedia.org/T286791 (10elukey) It turns out that even for the `admin_ng` dir it is a problem, see for example early att... [11:08:06] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 4 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10LSobanski) The problem happened again - see T287362. Could this task be reviewed in terms of priority? [11:10:34] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 5 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Peachey88) [11:43:15] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-July-September): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10KartikMistry) Moving to done. Error... [12:31:15] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 5 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10mark) p:05Medium→03High Given that the underlying problem that this change might help with has already caused multiple full outages (all wikis... [13:44:02] 10serviceops, 10observability, 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10elukey) 05Resolved→03Open Hi! While investigating a problem on kubernetes nodes I found out that in all clusters where a kubelet runs we have the following repeated over a... [13:44:27] I reopened --^, found the same issue again while checking logs [13:58:39] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10sbassett) [14:51:28] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10JMeybohm) I went through various config options for the different components involved, but the only one yielding significant impact is rate... [14:58:00] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Urbanecm) >>! In T263220#7236031, @mark wrote: > Given that the underlying problem that this change might help with has already caused multiple fu... [15:08:48] 10serviceops, 10MW-on-K8s, 10Release Pipeline, 10Patch-For-Review: Evaluate Dragonfly for distribution of docker images - https://phabricator.wikimedia.org/T286054 (10dancy) Nice work! [15:22:52] 10serviceops, 10SRE-swift-storage, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Find a way to make swift Tempauth usable behind envoy - https://phabricator.wikimedia.org/T286935 (10MPhamWMF) 05Open→03Declined [15:25:19] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: Documentation updates in decom workflow - https://phabricator.wikimedia.org/T287388 (10RLazarus) p:05Triage→03Low [15:35:11] 10serviceops, 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10wkandek) [15:39:19] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Legoktm) @Urbanecm I'm on clinic duty this week and just so happen to have PoolCounter experience so let's find a time to pair on this. [17:58:10] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10jcrespo) @Bawolff I think you wrote the comment at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/645994/2/wmf-config/PoolCounterS... [18:40:59] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Bawolff) >>! In T263220#7237177, @jcrespo wrote: > @Bawolff I think you wrote the comment at https://gerrit.wikimedia.org/r/c/operations/mediawiki... [19:30:36] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 8 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Legoktm) @Urbanecm and I (plus lurker @majavah :)) spent an hour today trying to roll this out with mediocre success, but not enough confidence fo... [20:15:35] 10serviceops, 10observability, 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10colewhite) >>! In T210137#7236247, @elukey wrote: > > Is it possible that we have another occurrence of the same problem? On all the nodes that I checked I found: > Yes, th... [20:16:31] 10serviceops, 10observability, 10User-jijiki: Handle unknown stats in rsyslog_exporter - https://phabricator.wikimedia.org/T210137 (10colewhite) a:05colewhite→03None [23:17:33] 10serviceops, 10SRE, 10Traffic, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Legoktm) p:05Triage→03Medium [23:18:55] 10serviceops, 10SRE, 10Traffic, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Legoktm) @jijiki is {T286482} a duplicate of this one? To me it looks like both tasks have basically the same checklist [23:56:21] 10serviceops, 10SRE: php7.2-fpm_check_restart should be resilient to php7adm error pages - https://phabricator.wikimedia.org/T285593 (10Legoktm) p:05Triage→03Medium