[05:47:05] hi folks! [05:47:39] I have depooled and powercycled restbase1027, the ping down alert was flapping and it didn't publish metrics in days [05:48:06] after a powercycle the host seems ok, IIUC it was something related to cpu usage (racadm getsel etc.. all looked clean) [05:48:22] I wasn't able to log in via tty on the mgmt console, so no idea what was wrong [05:48:38] I'll wait for your green light before setting pooled=yes [05:49:30] the other thing to check is https://phabricator.wikimedia.org/T344998 [05:50:21] I am ignorant about wikifunctions but there were some reverts related to network policies, afaics there is still something missing [05:50:52] (I don't see the envoy sidecar configured to query the mw api for example, but there are calls to api-rw) [05:56:42] 10serviceops, 10Abstract Wikipedia team, 10SRE, 10Wikifunctions, and 2 others: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10elukey) Answering to myself - I see in admin_ng that https://gerrit.wikimedia.... [05:57:43] jayme: --^ [07:10:21] elukey: welcome back [07:25:47] <3 [07:38:30] 10serviceops, 10CX-cxserver, 10Language-Team, 10RESTBase Sunsetting: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10Nikerabbit) [08:21:00] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10JMeybohm) [08:22:11] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10JMeybohm) p:05Unbreak!→03High Thanks @eluke... [08:26:02] 10serviceops, 10docker-pkg, 10Release Pipeline (Blubber): Fix how we keep docker-pkg based images up to date - https://phabricator.wikimedia.org/T344478 (10hashar) [08:33:54] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10elukey) Tested the use case outlined in the tas... [08:42:52] elukey: I'm not sure exactly how restbase actually works, but shouldn't the cassandra service be ok before we repool it [08:48:35] Restarted cassandara [08:48:46] waiting to see if it comes back up correctly [08:50:18] Aug 28 08:48:55 restbase1027 cassandra[20090]: ERROR [main] 2023-08-28 08:48:55,192 JVMStabilityInspector.java:196 - Exiting due to error while processing commit log during initialization. [08:50:25] Welp it's not coming back up without help [08:51:31] claime: o/ didn't check cassandra yet, what instance did you restart? In theory they were all restarted with the reboot [08:51:39] -a [08:53:20] ah lovely [08:53:31] 10serviceops, 10docker-pkg, 10Release Pipeline (Blubber): Fix how we keep docker-pkg based images up to date - https://phabricator.wikimedia.org/T344478 (10hashar) I slightly amended the task description. The thing I like with `docker-pkg` is that it effectively freeze the images parenting which gives some k... [08:55:22] so my reboot probably left the commit log in an unclear state (or it was already in a weird condition due to the heavy load etc..) [08:56:06] b and c are good, for -a I'd open a task and wait for eevans [08:57:10] claime: in theory we should just remove the file, restart and then run a repair of the node [08:57:30] hnowlan: o/ --^ wdyt? [08:57:53] hugh isn't available this morning [09:02:46] ack ack [09:05:45] claime: I'd try the above, move the file in my home, restart and launch a full repair [09:05:51] can't think of another procedure [09:06:20] yeah [09:06:34] it'll launch the repair itself if it doesn't have the file anyways right? [09:07:42] I am not 100% sure about it [09:09:18] opening a task [09:10:43] ack, thanks [09:12:09] 10serviceops, 10Data-Persistence: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) [09:12:24] claime: --^ [09:12:27] 10serviceops, 10Data-Persistence: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) [09:14:11] thanks <3 [09:14:40] claime: I can proceed if people are ok, otherwise we can wait for Eric [09:15:07] jayme, effie, akosiaris, thoughts ? ^ [09:16:09] claime: reading [09:17:20] do we have any services being unstable atm due to this? [09:18:27] just an instance down in theory [09:19:14] I don't know what kind of queries we do from clients, but even quorum reads should work since replication is always 3 (IIRC) [09:25:57] I would suggest we leave things as they are, 1027 depooled as it is and wait for erik [09:26:23] no service depending on restbase is complaining [09:26:29] (unless I am missing something) [09:26:44] sure [09:28:15] I've acked the alarms for now [09:28:45] I have another question, and forgive me if this is going to be confusing [09:28:57] https://www.irccloud.com/pastebin/jKMUhaLE/ [09:29:12] I suppose there is somehting we are missing here right [09:30:03] ok ok I think I found the pebcak [09:30:26] nono cassandra.service should not run, only the -{a,b,c} instances [09:30:29] ok it is cassandra-a [09:30:37] yeah cool, I was reading wikitech [09:30:49] and this one [09:30:50] nodetool status [09:30:56] does not work either, sig h [09:31:41] yep we have nodetool-{a,b,c} [09:32:09] ok TIL [09:33:07] not sure what is the future of the instances, but there are also a ton of tools written by Eric to cycle through them etc.. [09:33:15] (all on wikitech iirc) [09:34:29] I am trying to understand the state of things for the time being [09:37:23] FYI, kubetcd2005 will briefly go down for a Ganeti node reboot [09:41:57] ack [09:48:13] it's back [10:05:55] I am going to deploy changeprop for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/948136 [10:07:38] Godspeed [10:11:52] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10jijiki) @Jdforrester-WMF it will be extremely u... [10:56:29] I'll silence restbase1027 until eric gets here [10:56:38] It's depooled but flapping on service probes [11:03:50] 10serviceops, 10CX-cxserver, 10Language-Team, 10RESTBase Sunsetting: Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10Nikerabbit) [11:04:48] 10serviceops, 10RESTBase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Replace usage of RESTbase parsoid endpoints - https://phabricator.wikimedia.org/T328559 (10Nikerabbit) [11:06:35] 10serviceops, 10CX-cxserver, 10RESTBase Sunsetting, 10Language-Team (Language-2023-July-September): Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase - https://phabricator.wikimedia.org/T344982 (10Nikerabbit) p:05Triage→03Medium [11:51:11] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, and 2 others: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) FYI @Marostegui, who merged https://gerrit.wikimedia.org/r/c/operations/dns/... [11:51:49] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, and 2 others: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) 05Resolved→03Open This issue's happening again, for the same reasons ({d... [11:53:30] 10serviceops, 10docker-pkg: Attach git info metadata to docker images - https://phabricator.wikimedia.org/T345070 (10fgiunchedi) [12:18:27] 10serviceops, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10SRE, and 2 others: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) 05Open→03Resolved And service's up again. [12:25:57] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Urbanecm_WMF) Hi all, is there any update on this please? `... [12:36:51] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Clement_Goubert) 05Resolved→03Open We are still experiencing issues, some logs are getting escaped into single byte ISO-8859-1 values instead of the double-byte utf-... [13:02:20] 10serviceops, 10Data-Persistence: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) @Eevans do you think it is a safe plan? If so I'll try to execute it :) [13:12:25] FYI, kubetcd2004 will briefly go down for a Ganeti node reboot [13:21:29] it's back [13:25:45] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) 05Open→03Resolved a:03elukey Going to close this task since the bulk of the work is done, and I'll open new ones to fine-tune kafka main's status. [13:26:46] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10SRE: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 (10kamila) 05Open→03Resolved The message size limit is increased to 16k. Longer messages are very rare (< 1 per hour), so I think this is acceptable.... [13:30:05] 10serviceops: Improve kafka main 's partitions usage and leaders using topicmappr's rebalance - https://phabricator.wikimedia.org/T345077 (10elukey) [13:32:55] 10serviceops: Improve kafka main 's partitions usage and leaders using topicmappr's rebalance - https://phabricator.wikimedia.org/T345077 (10elukey) I have built both binaries needed (`metrics-fetcher` and `topicmappr` version 4.2.1) on my laptop on a pristine Debian bullseye container, and uploaded them to kafk... [13:39:20] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) >>! In T340843#9123222, @Urbanecm_WMF wrote: > H... [13:48:19] 10serviceops: Improve kafka main 's partitions usage and leaders using topicmappr's rebalance - https://phabricator.wikimedia.org/T345077 (10elukey) Then: ` elukey@kafka-main2001:~/T345077$ ./topicmappr rebalance --zk-addr "conf2005.codfw.wmnet:2181" --brokers -2 --topics '.*' --optimize-leadership --partition-... [13:48:32] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10Jdforrester-WMF) >>! In T344998#9122506, @JMeyb... [13:57:09] 10serviceops, 10Data-Persistence, 10Patch-For-Review: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Urbanecm_WMF) That sounds promising! Thank you very much fo... [14:12:15] folks I created https://phabricator.wikimedia.org/T345077 to further refine the status of kafka main [14:12:28] if you are ok I'd start with another round of rebalance in main-codfw [14:12:43] (at the end I hope to write some good wikitech documentaion explaining all the use cases) [14:13:25] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10Eevans) >>! In T345058#9123316, @elukey wrote: > @Eevans do you think it is a safe plan? If so I'll try to execute it :) Given what's currently hosted on this... [14:17:14] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10elukey) @Eevans ack! When you have a moment could you add some info about when it is good or not to start a repair (full or partial)? If there is something alr... [14:25:10] 10serviceops, 10Cassandra: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 (10Eevans) >>! In T345058#9123770, @elukey wrote: > @Eevans ack! When you have a moment could you add some info about when it is good or not to start a repair (fu... [14:26:02] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10JMeybohm) >>! In T344998#9123566, @Jdforrester-... [15:23:01] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10Jdforrester-WMF) [16:20:48] 10serviceops, 10Maps, 10Regression, 10Russian-Sites: Vandal attack on OpenStreetMap affected Wikimedia Maps - https://phabricator.wikimedia.org/T344753 (10Seddon) p:05Unbreak!→03Medium [20:02:42] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Patch-For-Review, 10Wikimedia-production-error: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10Jdforrester-WMF) a:03Jdf... [22:44:15] 10serviceops, 10SRE, 10ops-eqiad: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Papaul) a:03VRiley-WMF