[04:23:07] 10serviceops, 10Shellbox: Migrate Shellbox to PHP 7.4 - https://phabricator.wikimedia.org/T295489#9556647 (10tstarling) 05Open→03Resolved a:03taavi I assume this is done. [11:14:41] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2312.codfw.wmnet with OS bullseye [11:20:01] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2313.codfw.wmnet with OS bullseye [11:20:08] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2367.codfw.wmnet with OS bullseye [11:20:15] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2369.codfw.wmnet with OS bullseye [11:20:21] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558461 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2384.codfw.wmnet with OS bullseye [11:20:28] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2385.codfw.wmnet with OS bullseye [11:29:42] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2384.codfw.wmnet with OS bullseye executed with erro... [11:29:50] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2385.codfw.wmnet with OS bullseye executed with erro... [11:30:12] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2384.codfw.wmnet with OS bullseye [11:30:22] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2385.codfw.wmnet with OS bullseye [11:51:59] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558562 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2312.codfw.wmnet with OS bullseye completed: - mw2312 (**PASS**) - Downt... [11:55:38] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2313.codfw.wmnet with OS bullseye completed: - mw2313 (**PASS**) - Downt... [11:58:04] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2367.codfw.wmnet with OS bullseye completed: - mw2367 (**PASS**) - Downt... [12:01:08] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2369.codfw.wmnet with OS bullseye completed: - mw2369 (**PASS**) - Downt... [12:02:25] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2384.codfw.wmnet with OS bullseye executed with errors: - mw2384 (**FAIL**... [12:02:30] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2384.codfw.wmnet with OS bullseye [12:02:33] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2385.codfw.wmnet with OS bullseye executed with errors: - mw2385 (**FAIL**... [12:03:07] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw2385.codfw.wmnet with OS bullseye [12:18:52] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2385.codfw.wmnet with OS bullseye executed with errors: - mw2385 (**FAIL**... [12:18:58] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9558696 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw2384.codfw.wmnet with OS bullseye executed with errors: - mw2384 (**FAIL**... [12:34:44] claime, hnowlan: bit of an odd one for mw2379 [12:35:01] what did you see? [12:35:02] it seems it's an old fashioned duplicate IP on the network problem [12:35:08] O_O [12:35:21] I think some mistake must have happened when I manually assigned the IPs for the mw box, the IP is in use on lvs2014 [12:35:34] oh damn [12:35:41] I don't think the LVS is doing anything on that vlan, we can probably change it's IP easy [12:37:02] Do you mean the lvs' ip or the k8s node's ? [12:37:36] The LVS IP we can change [12:37:37] (the node is still draining) [12:37:41] I've shut the interface for now [12:37:46] it's drained [12:37:47] BGP is back up on mw2379 now [12:38:00] sry - shut the vlan2023 interface on lvs2014 [12:38:16] I know there are no realserver backedns on the private1-a3-codfw vlan yet so that's safe [12:38:37] I'll adjust the IP allocation - we must have left it out of netbox when it was allocated for the lvs [12:38:48] ack [12:39:16] I'll leave everything alone now for mw2379 - it can stay on it's current IP [12:39:21] sorry for the mix-up! [12:39:35] Shoud I leave it cordoned until you're done to be on the safe side? [12:39:51] No worries, it was just... weird x) [12:39:53] yeah maybe, puppet might try to bring the interface back up [12:40:03] ack, just ping me when you're done then [12:40:04] yeah odd behaviour [12:40:06] ok [13:03:07] claime: I'm waiting on patch to be reviewed but puppet is disabled on the lvs side and manually corrected so I think you're safe to undrain mw2379 [13:03:31] BGP to it seems happy, stable for 25 mins [13:11:57] eek, good catch [13:13:01] Was there something that caused this to be an issue now or were there background issues that were getting hidden since it came back up? [13:13:45] I think we probably had issues since it went live [13:14:22] the lvs isn't doing much on that vlan, but if it sends packets with the same IP it had the potential to get us into the state we found it in earlier [13:14:51] i.e. the top-of-rack switch cached an ARP entry tying the IP to the lvs server, at which moment comms to the mw box would have stopped [13:15:22] there are a few race conditions there and chance of exactly what broadcasts are sent, so probably nothing went wrong till today [13:15:32] if the lvs was busy on that vlan it'd have been apparent quickly [13:16:41] ahhh okay [13:17:08] last week the hold time expired in bird for the host but it only did it the once so I figured it was a once-off flap [13:17:18] also typically IP allocations are done by automation, lvs is an exception cos they have to be on every vlan [13:17:53] hnowlan: yeah I'd say that was caused by the duplicate IP too [13:18:56] hnowlan: o/, re cirrussearch & jobrunners issue this morning, looking at logs it seems that MW had issues talking to eventgate, do you know what could have caused this from 11:00 to 11:45 UTC? (https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=h@4c41ef6&_a=h@7ad1327 [13:19:26] sorry wrong link, this one's better: https://logstash.wikimedia.org/goto/9c0d5bd18be5dc4ca2928fc378173267 [13:23:25] yeah, does look like some errors in eventgate around the same time https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=All&var-kubernetes_namespace=All&var-destination=eventgate-main&var-destination=eventgate-analytics&from=1708426847916&to=1708429545550 [13:24:14] I think we saw this pairing before (cirrussearch and eventgate errors rising at the same time) [13:24:24] seems to have affected other mw clusters (mw-api-ext and int) [13:25:16] cirrussearch is a big user of the jobqueue so when eventgate has issue cirrussearch (via the EventBus extension) is likely to spam the logs [13:25:32] but other jobs might have been affected [13:27:51] weirdly it only appears to have affected search jobs at the time [13:28:33] claime: unrelated to search but you were right earlier about there being some DB issues https://logstash.wikimedia.org/goto/4c52ce0f9ad7df7a9d7b8e6c53c4d069 [13:29:28] hnowlan: if you filter with "NOT CirrusSearch" you should see other jobs like https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2024.02.20?id=eydcxo0B1Aouzw__94kU [13:29:52] ahhh okay [13:34:19] not much I can see in eventgate itself, there's a steady churn of 5xx per minute but nothing that lines up with the spikes [13:34:46] I'm gonna hop off for lunch and look at this some more when I'm back [13:38:37] thanks for taking a look! (T249745 might have some info) [13:48:02] 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, 10Patch-For-Review: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507#9559009 (10Jgiannelos) Now that RESTBase/parsoid storage deprecation is almost done,... [14:07:44] 10serviceops, 10Observability-Logging, 10iPoid-Service: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9559128 (10JMeybohm) [14:09:29] 10serviceops, 10Observability-Logging, 10iPoid-Service: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9545608 (10JMeybohm) >>! In T357616#9550758, @kostajh wrote: > @colewhite @JMeybohm @Clement_Goubert I think we could mark this resolved, unless you want... [14:15:17] topranks: ack for decordoning mw2379 [14:47:09] 10serviceops, 10Observability-Logging, 10iPoid-Service: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9559340 (10JMeybohm) Another case of no missing container logs from mw2434, @Clement_Goubert did restart rsyslgd which was probably in a bad state: ` Feb... [14:51:27] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Unstewarded-production-error, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745#9559370 (10hnowlan) We saw a recurrence of this issue this morning, with a [[ https://... [15:04:13] 10serviceops, 10Observability-Logging, 10iPoid-Service: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9559455 (10JMeybohm) I've created https://github.com/prometheus-community/rsyslog_exporter/pull/12 so we can collect kafka stats from rsyslogd as everythi... [15:09:42] 10serviceops, 10Observability-Logging, 10iPoid-Service: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#9559465 (10JMeybohm) The `rdk:broker2005` references one of the threads (I do see two) handling the connection to kafka-logging2005 (according to https://... [15:29:44] 10serviceops, 10MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), 10Maintenance-Worktype, 10Patch-For-Review, 10Wikimedia-production-error: TypeError: Argument 4 passed to Wikimedia\Parsoid\Utils\Title::__construct() must be of the type string, null given, calle... - https://phabricator.wikimedia.org/T356024#9559633 [15:54:48] 10serviceops, 10ops-codfw: Issues reimaging servers in codfw - https://phabricator.wikimedia.org/T358001#9559853 (10hnowlan) [16:41:28] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 11 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9560239 (10Sbailey) a:05Sbailey→03None [16:59:32] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9560354 (10akosiaris) [21:59:58] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Patch-For-Review, 10Video: Create a deployment for `shellbox-timedmedia` - https://phabricator.wikimedia.org/T357309#9561624 (10TheDJ) > This might become a very noisy neighbour in terms of i/o and cpu usage. It might be sensible to think of ways to reser...