[10:32:43] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10TheDJ) eqiad is in sync now. [10:35:04] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) This is from the last log entries: ` Jan 06 10:03:04 maps1009 imposm[14255]: [2023-01-06T10:03:04Z] 47:40:18 [info] Importing #90435... [10:35:42] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [10:40:07] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Fl.schmitt) Will codfw follow, too? It [[https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12&from=now-30d&to=now|see... [10:41:38] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) There is a full planet import that is pending: https://phabricator.wikimedia.org/T314472 [11:00:51] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 05): k8s deployment-charts mesh module should allow use of mesh without public_port Service - https://phabricator.wikimedia.org/T326252 (10EChetty) [11:01:29] 10serviceops, 10Data-Engineering-Planning, 10SRE-OnFire, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty) [11:02:17] 10serviceops, 10Data-Engineering-Planning, 10SRE-OnFire, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty) [11:02:23] 10serviceops, 10Data-Engineering-Planning, 10Discovery-Search (Current work), 10Event-Platform Value Stream (Sprint 05), 10Patch-For-Review: Flink on Kubernetes Helm charts - https://phabricator.wikimedia.org/T324576 (10EChetty) [12:49:32] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty) [12:50:15] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10SRE-OnFire, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty) [13:04:35] 10serviceops, 10Release Pipeline, 10SRE, 10Epic, 10Release-Engineering-Team (Seen): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10LSobanski) [13:17:41] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10LSobanski) [14:05:10] o/ jayme how's it lookin :) [14:05:30] ottomata: good morning o/ currently looking at the images [14:05:39] eheh [14:05:43] not that bad :) [14:08:05] ottomata: if you feel like it you could merge the two operator CRs already so we can get CI to green-light the flink-app chart CR [14:08:31] well...maybe split the admin_ng part out of the second CR to avoid deployment hickups [14:08:54] okay [14:08:57] i'll do that [14:16:10] okay, merging the first two, admin_ng stuff now in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/876200 [14:18:37] sounds good! [14:25:27] ottomata: the image CR adds the operator in version 1.3.0 - the operator chart is for 1.2.0. Is that on purpose? [14:26:45] oh! no i just updated the image yesterday but i guess i forgot to update the operator! [14:26:51] will do that now :p [14:27:09] cool [14:38:49] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) Sorry did not give update. Case# 159648923 was submitted 1/4/2023 Idrac was not reachable remotely. Reset Idrac with crash cart 1/6/2023 TSR... [14:47:21] ty, looking at image now [15:58:06] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) p:05Triage→03Low [16:00:54] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) racadm getsel log: ` ------------------------------------------------------------------------------- Record: 5 Date/Time: 01/... [16:01:00] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) [16:04:20] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) Thanks for the update. I will extend the downtime to two weeks from now, will revisit if necessary. [16:05:50] 10serviceops, 10DC-Ops, 10ops-eqiad: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5c4b686a-9560-44c1-acb3-c16978d72b37) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1... [16:10:49] jayme: still testing some things so haven't uploaded new patch yet, but i replied to all your comments [16:10:55] esp ones about entrypoint [16:10:58] lemme know what you think [16:11:18] hm actually i'll go ahead and push patch, still testing images though [16:45:05] ottomata: have there really been zero changes bewteen the operator chart 1.2.0 and 1.3.0? [16:49:19] jayme testing now, should have tested that more extensively [16:50:01] hm, there have been... will fix. [16:58:24] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Thank you for deploying will investigate today while on site [18:13:25] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Created ticket Confirmed: Service Request 159722060 was successfully submitted. Submitted TSR report to Dell [18:19:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Dzahn) 18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Dzahn) 05Open→03In progress [18:57:23] ergh, am having trouble testing a helm chart locally because I can't pull the image from my local docker cache [18:57:41] i've done this before, but i can't seem to remember how... [18:57:50] i have [18:57:53] image: [18:57:53] repository: docker-registry.discovery.wmnet/flink-kubernetes-operator [18:57:53] tag: 1.3.0-wmf1 [18:57:53] abd [18:58:01] docker-registry.wikimedia.org/flink-kubernetes-operator 1.3.0-wmf1 64677350965a 3 hours ago 385MB [18:58:06] OH [18:58:15] wait. wikimedia.org vs discovery.wmnet! [18:59:19] 10serviceops, 10GitLab, 10serviceops-collab, 10Kubernetes: Trusted gitlab runner containers need access to staging k8s cluster - https://phabricator.wikimedia.org/T325385 (10dduvall) (Orthogonal but worth a discussion in our next meeting.) Perhaps this is a good time to start looking at something like [[... [19:02:57] thank you IRC for being a rubber ducky. [20:14:02] okay, jayme: [20:14:02] - flink-kubernetes-operator chart updated for 1.3.0: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/876249 [20:14:02] - images patch ready for review again: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/858356 [20:14:02] - flink-app looking good too: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510