[01:26:54] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Papaul) 05Open→03Resolved a:03Papaul @Jhancock.wm what i did for the provision cookbook to PASSws to reset the IDRAC password and re-run the cookbook again @Dzahn the host is backup . [07:42:28] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 (10RLazarus) @joe @JMeybohm That's a lot of code review at once, across two tasks -- I posted it all for context, but no expectation you'll have time to look at all of it immediat... [08:21:42] 10serviceops, 10SRE, 10Data-Platform-SRE (2023/24 Q3 Milestone 1), 10Discovery-Search (Current work): SUP: Partition update_pipeline kafka topic - https://phabricator.wikimedia.org/T354064 (10pfischer) [10:56:16] https://wikitech.wikimedia.org/wiki/Logstash#Kubernetes says "All a kubernetes service needs to do is log in a JSON structured format (e.g. bunyan for nodejs services)". But Bunyan emits a JSON object with `msg` and not `message` as the key. Logstash expects `message`. Maybe I am missing something? [11:12:46] kostajh: ecs or the legacy logstash ? [11:13:13] ecs is pretty picky about names of fields [11:13:14] I am not sure. What is ECS? [11:13:32] I want to improve logging for ipoid. All messages currently have an empty "message" field. [11:13:51] elastic common schema [11:13:59] (T351430 for context) [11:14:01] there's a number of tasks in phabricator about moving to it [11:14:27] alright. So in the short term, I should go with this hack to rename `msg` to `message` https://github.com/trentm/node-bunyan/issues/462#issuecomment-339715288 [11:14:29] it's easy to see in https://logstash.wikimedia.org/app/discover where there is logstash-* and ecs-* [11:14:58] or use https://github.com/pinojs/pino which makes it easier to use `message` field without a hack [11:15:24] kostajh: whatever suits you better [11:16:05] that line in that wikitech btw is before trying to bring some order to field names, I 'll amend [11:16:14] ok, ty [11:16:59] kostajh: for pertinent to our infra information on ECS, see https://wikitech.wikimedia.org/wiki/Logstash/Common_Logging_Schema [11:20:37] 10serviceops, 10CX-cxserver, 10RESTBase Sunsetting, 10Language-Team (Language-2024-January-March), 10Patch-For-Review: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10Nikerabbit) a:05santhosh→03KartikMistry [11:31:39] akosiaris: so for ipoid, we are not using ecs (yet?) ? [11:32:11] https://www.elastic.co/guide/en/ecs-logging/nodejs/current/pino.html seems like it would do what I want [11:32:41] 10serviceops, 10SRE, 10ops-codfw: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) mw2394 squared up and repooled, set back in active in Netbox [11:38:05] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c70b0979-84e8-4fe7-8682-45d50615a587) set by cmooney@cumin1002 f... [11:40:52] kostajh: I think if you use that, ipoid is gonna go to ecs [11:40:57] I am looking right now at the filter [11:42:00] ok [11:42:43] so, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/logstash/filters/17-filter_ecs.conf seems to imply that [11:44:58] MR with example output in comment https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/202 [11:47:52] if only gitlab wasn't logging me out so often [12:43:11] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=85a51908-5fa9-4a59-9d3d-bb7c8e8369e2) set by cgoubert@cumin2002 for 7 days, 0:00:00 on 3 host(s) and their service... [13:15:38] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) Ok I have made the Netbox changes and pushed the resulting config to lsw1-b8-codfw now, and the port it up (note the por... [13:23:59] Can somebody take a look at this patch? I discovered the issue while double checking the diff output of helmfile in my last deployment. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/989141 [13:24:05] cc hnowlan ^ [13:32:41] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) [14:11:22] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10akosiaris) >>! In T352883#9445709, @cmooney wrote: > Ok I have made the Netbox changes and pushed the resulting config to lsw1-b8... [14:14:38] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10ayounsi) > This is the thing we need to get fixed, I see Yep, that's {T352893} and its 2 CRs. [14:34:47] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) ` # kubectl describe nodes kubestage2002.codfw.wmnet | grep -A3 Addresses Addresses: InternalIP: 10.192.22.13... [14:35:00] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10akosiaris) >>! In T352883#9445907, @ayounsi wrote: >> This is the thing we need to get fixed, I see > Yep, that's {T352893} and i... [14:44:26] nemo-yiannis: ack [14:51:24] hnowlan: thanks! [15:44:34] 10serviceops, 10Content-Transform-Team-WIP, 10Data-Persistence: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657 (10Jgiannelos) [15:45:19] hnowlan: Any ideas about this one? ^ [15:45:42] Mostly the part that we switched away of RESTBase and we see more RESTBase errors ? [15:49:38] nemo-yiannis: same url for all? IS this a health check? [15:50:32] I don't think so, the URL is /ia.wikipedia.org/v1/page/html/Appendice%3ALista_de_parlatores_de_esperanto/666393 so i doubt it was chosen as a healthcheck [15:50:42] nemo-yiannis: could you link the graphs where you're seeing the increases/decreases? [15:50:46] sure [15:52:25] 10serviceops, 10Content-Transform-Team-WIP, 10Data-Persistence: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657 (10Jgiannelos) [15:52:49] This seems quite like what we saw with wikifeeds when we started routing directly to it, hidden errors getting exposed when switching over [15:53:01] obviously different sources [15:55:35] Is ia.wikipedia.org routed via rest-gateway ? [15:55:58] no [15:56:01] We had issues before with domains that are not valid anymore but apps used an old version of the domain [15:56:29] Sorry I meant, ia.wikipedia.org wikifeeds [15:56:33] well it depends on what you mean routed via [15:57:14] ah in that case yes, all /api/rest_v1/feed/.* URLs go via the gateway [15:57:21] ok [16:04:35] i tried it locally and i am getting the same cassandra related issue because wikifeeds asks for the summary of this page [16:04:41] this makes more sense [16:05:05] I guess restbase was suppressing those errors with a generic 404 because it failed internally [16:05:16] but now we get the actual error (?) [16:09:33] weirdly this error happens when trying to *update* a table in cassandra ?! [16:09:45] this page is huge with a lot of photos [16:10:09] maybe this was never saved before [16:10:55] so the GET on restbase level is actually an update if it doesnt exist [16:13:11] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10akosiaris) I 've unloaded the wdat_wdt module and issued one more reboot on mw1378 (the tests in mw1349 have led nowhere) And previously I would see ` [ OK ] Reached tar... [17:00:19] 10serviceops, 10Content-Transform-Team-WIP, 10Data-Persistence: Wikifeeds increase on 500 errors after switchover to core page HTML - https://phabricator.wikimedia.org/T354657 (10Jgiannelos) After reproducing this locally it looks like the restbase req is triggered because wikifeeds queries /page/summary whi... [17:05:18] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10akosiaris) Some more rough data: That container-shim message led me to track and find which containerd-shim we were talking about. Some trial and error [1] afterwards it a... [17:05:41] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Reboot issues for mw13[77-83].eqiad.wmnet - https://phabricator.wikimedia.org/T354413 (10akosiaris) With @kamila, we 'll dive more into this tomorrow so we can come up with a recommendation. [17:45:55] 10serviceops, 10MediaWiki-DjVu, 10Shellbox, 10Structured-Data-Backlog, and 4 others: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 (10hnowlan) After moving these tasks to shellbox and pointing the jobqueue back to k8s jobrunners, these errors have no... [20:02:43] 10serviceops, 10Parsoid (Tracking): Upgrade nodejs on testreduce1001 - https://phabricator.wikimedia.org/T345220 (10MoritzMuehlenhoff) >>! In T345220#9443318, @Arlolra wrote: > https://parsoid-rt-tests.wikimedia.org/ now looks correct > > I guess the last step here is to decommission 1001 Ack, I'll remove te... [22:24:50] 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, and 2 others: Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10EBernhardson) After reviewing the option here, along with reviewing the current state of the mw...