[00:20:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Dzahn) Thanks for adding docs! That's the perfect reaction. I just wanted to create awareness originally. Your edit https://wikitech.wikimedia.org/w/index.php... [06:35:23] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10Joe) >>! In T325131#8469178, @jhathaway wrote: > I think @Joe’s suggestion of using msmtp or another sendmail compatible client makes sense for short term solves. Are there any avenues for user injected da... [08:54:09] hello folks, I can help with the kafka-main's reboots if you want [08:54:47] (i can also support whoever wants to do it, it should be as easy as running the reboot node cookbook) [09:07:49] <_joe_> effie / claime / jayme ^^ [09:08:13] Right. [09:08:53] I can get on some reboots, I'm off for the rest of the week, so probably the best use of my morning [09:11:47] Sorry have to reboot, laptop is getting quirky on me [09:11:49] I'm on call next week, was planning to do my share then. Feel free to go with kafka elukey I'd say. It def. helps [09:13:16] Don't know if it was already talked about: Do we plan on rebooting mw,mc,parse this year or do we want to wait for after the holidays? [09:16:54] jayme: IIUC from Moritz's msgs risky reboots could be postponed to January, post holiday/banner season etc.. [09:17:29] yeah, that's what I understood as well. But I'm not sure if we actually consider those reboots "risky" [09:17:31] maybe mw nodes could be done earlier on, easier to depool etc.. [09:18:20] I woudn't touch the memcached nodes for example [09:18:53] in theory we are 100% resilient etc.. but we'd loose the in-memory cache, failover to the gutter pool, etc.. [09:19:05] that seems unnecessary to me at this pont of the year [09:19:54] the rest is debatable, if everything is depooled/rebooted/repooled with a slow pace it should be fine [09:20:17] same thing for kafka probably, we can skip it to January now that I think about it [09:20:45] (I'll be out the first two weeks though) [09:24:23] ah I totally forgot that we have a kafka reboot cookbook [09:24:31] uuh :) [09:26:46] I'd agree that mw and parse should be fine to do (or start with) when going slow. It would also be nice to not have to do all the reboots first thing in the year :) [09:27:56] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 (10JMeybohm) [09:28:41] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 (10JMeybohm) [09:28:59] testing the kafka reboot cookbook with kafka-test [09:29:08] _joe_ probably has an oppinion on the above ;) [09:37:13] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 (10taavi) > Promehteus 2.24.1 (bullseye) does not support client cert auth for kubernetes_sd, we would need to have 2.33.5 f... [09:38:05] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes: Migrate use of infrastructure_users tokens to client certificates - https://phabricator.wikimedia.org/T325268 (10JMeybohm) >>! In T325268#8470122, @taavi wrote: >> Promehteus 2.24.1 (bullseye) does not support client cert auth for kub... [09:39:42] ok so the reboot cookbook for kafka seems doing the right thing [09:40:24] maybe to be extra careful for main I'd tune the two parameters to wait a bit more between reboots, to allow more time to kafka servers/client to adapt [09:40:55] --batch-sleep-seconds 600 (default 300) --sleep-before-pref-replica-election 1200 (default 900) [09:41:03] it will take more but who cares :) [09:43:18] anyway, lemme know if you want me to do it or to assist :) [09:43:52] elukey: u still around next week= [09:43:55] *? [09:46:25] yep [09:47:01] cool. we can do that together realy next week then if you're fine with that [09:47:18] for the mw* hosts the risk is minimal since we already have plenty of them running the latest kernel (thanks to recently bringing in a new batch of servers), so even the theoretical risk of some minor kernel change breaking it is gone [09:48:07] <_joe_> frankly, the mw hosts are the easy target [09:48:10] <_joe_> those and k8s hosts [09:48:22] k8s hosts are not affected [09:48:26] just the etcd clusters [09:49:17] <_joe_> what is the thread scenario for etcd clusters? [09:50:19] <_joe_> *threat [09:52:03] <_joe_> no unsecured user input is ever processed there, only SRE input [09:52:24] I would assume it's at least small as we don't allow direct access but only k8s..yeah, that [09:53:02] plus we're going to reimage them anyways with the k8s 1.23 update [09:53:10] <_joe_> ah you're talking about the k8s etcd clusters, then I can see some argument there [09:53:54] but still ways less risky then mw or k8s I'd say [09:54:48] so: if we're fine with it I can take on mw hosts and kafka next week [09:55:06] plus the small stuff like dragonfly, docker-registry etc [10:40:16] Can [10:40:46] Sorry [10:44:10] dear all [10:44:44] nemo-yiannis and I will move maps from codfw to eqiad [10:45:43] we expect that tegola on eqiad will suffer slightly more than it already does, until its cache gets to a decent state [10:46:13] so we would like to check if it is ok to maybe give it a wee bit more on resources [10:46:24] let me get you our current numbers [10:49:02] requests: [10:49:02] cpu: 200m [10:49:02] memory: 512Mi [10:49:02] limits: [10:49:02] cpu: 400m [10:49:02] memory: 2048Mi [10:49:06] and 9 replicas [10:51:09] there is some throttling going on already [10:53:43] <_joe_> add more replicas or add cpu [10:53:50] <_joe_> whatever works better [10:55:10] cool thanks ! [11:22:58] 10serviceops, 10MW-on-K8s: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10Joe) a:03Joe My current idea for attacking this is the following: * install msmtp and msmtp-mta in the php multiversion base image * add a configmap to the k8s chart to write down the configuration in `/... [11:30:51] 10serviceops: Migrate mediawiki_http_requests alerts to AlertManager - https://phabricator.wikimedia.org/T325277 (10Clement_Goubert) [11:39:39] 10serviceops: Migrate mediawiki_http_requests alerts to AlertManager - https://phabricator.wikimedia.org/T325277 (10Clement_Goubert) Looking at this type of query where `$latency_threshold` is our request duration cutoff point for alerting, and `$rate_threshold` is the minimum number of request in the last 2 min... [11:40:04] 10serviceops: Migrate mediawiki_http_requests alerts to AlertManager - https://phabricator.wikimedia.org/T325277 (10Clement_Goubert) p:05Triage→03Low [12:19:27] 10serviceops: Decommission mc2019-mc2036 - https://phabricator.wikimedia.org/T313733 (10akosiaris) [12:19:32] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10akosiaris) [12:29:50] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [12:31:21] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s, 10Patch-For-Review: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Clement_Goubert) You're not, much thanks :D [12:37:52] effie: your puppet patch a20a38fdc17bb5526e50a73783c782b27639495f broke puppet on the cumin hosts [12:38:01] that have the redis sessions as config file for spicerack [12:38:16] oh dear, I actually never saw this coming [12:38:21] I will fix it [12:38:25] if you've removed all redis, then we can remove the config file from spicerack too :D [12:38:40] thanks! :) [12:38:49] not all redis, I removed redis_sessions [12:38:57] terribly sorry I didnt account for this [12:39:08] no prob [12:39:40] I have also have active alerts which I will deal in a bit [12:39:41] sigh [12:39:54] effie: FYI a git grep of redis::shards still shows few matches, not only spicerack [12:43:18] the rest are alright [12:49:23] effie: no need. didn't see this discussion until now, just made a patch when I noticed the failing puppet run on cumin2002 [12:49:37] now ehich one [12:49:42] the cumin patch ? [12:49:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/868393/ [12:50:01] will merge in a bit [12:50:33] oh nice [12:50:38] I will abandon mine [13:03:22] hnowlan: here? [13:03:34] we have an OSM issue and I think it is a pg issue [13:04:40] effie: yep, whassup? [13:05:12] would you please goi to maps1009 and do a journalctl -u imposm [13:08:01] oof [13:08:46] bit ot appeared it ran at 13:00 ? [13:08:54] it appears* [13:09:23] same happening on maps2009 [13:09:47] do you have any indication as to what is this? [13:09:56] not immediately, will keep looking [13:09:58] overlapping imports or something? [13:11:43] nemo-yiannis: ^ [13:12:49] Maybe ? The only thing I found out is that since Monday has not finished a complete run [13:12:58] maybe just worth restarting imposm ? [13:13:14] thing is there's so much log spam it's hard to see what the original command that failed [13:13:44] there should be a single command that failed that caused this entire transaction to fail [13:14:01] I'd be open to a restart but give me a minute to do some digging [13:14:20] bit suspicious that it happened on both masters at once [13:14:40] 2022-12-12 20:10:18 GMT [4190]: [3-1] user=osmupdater,db=gis,app=[unknown],client=[local] ERROR: value "-555555555500" is out of range for type integer [13:14:55] that's the start of the failed transaction [13:15:31] same on 2009 and 1009 [13:16:00] since the 12th [13:16:26] is this coming from the osm data? [13:18:11] would that mean that we received bad data to begin with? [13:18:46] yes, i think thats what happened, we received bad data from OSM [13:19:21] oh dear, now we will have to creeate another thing that will validate their data [13:21:05] do the repeated failures mean that we are *still* getting bad data though? Or is imposm trying to pick up where it left off? [13:21:07] so, what are our options [13:23:14] I think that imposm import is failing since monday [13:23:26] because some progress imports took like 50 hours [13:24:04] yeah the 12th is the first occurence of that [13:24:27] Let's restart and see if the error is fixed on the data diff from OSM [13:24:59] the issue that i see is that imposm is going to apply the diffs since then [13:25:07] so the problem might re-occur ? [13:25:20] yeah :/ [13:25:35] codfw is currently depooled, let's see what happens there [13:25:55] ok sounds good [13:26:05] it appears there are still *some* updates in progress judging by the proceeslist [13:26:27] `INSERT INTO "public"."wikidata_relation_members" ("osm_id", "geometry", "wikidata") VALUES ($1, $2::Geometry, $3)` for example - although that doesn't look like imposm data [13:26:39] restarting codfw anyway [13:26:42] ok [13:28:25] effie: just fyi i don't think it has something to do with the work done this week [13:29:08] me neither, but it appears like we need an alert for this [13:29:12] yeah [13:29:18] once we figure out what is up [13:29:39] looks like imposm is trying to talk to tilerator/cassandra somehow on startup? Doesn't seem critical [13:29:55] nvm, old log [13:31:33] import is starting on 2009 after the restart [13:32:01] oh hmm [13:32:43] from the imposm logs pre-import, some funky timestamps. These happen in order in the imposm log: 17h9m0s 49h4m0s 40h17m0s [13:33:02] that looks like imposm got bogged down and was trying multiple imports in parallel [13:33:11] yeah [13:33:34] it got stuck because of the failures [13:34:02] `Importing #89845 including changes till 2022-12-12 20:00:00 +0000 UTC (65h28m51s behind)` lets hope it doesnt pickup the bad data [13:34:42] there are some ~70 hours old which would also line up somewhat with the start of the failures. bit concerning that it just maintains the parallel imports [13:34:49] failed again [13:35:49] could be a bit of a problem that it's dumping almost an entire import into logs too :) [13:36:09] more detail on that error [13:36:09] Dec 15 13:34:45 maps2009 imposm[25425]: [2022-12-15T13:34:45Z] 0:06:26 [warn]: SQL Error: pq: value "-555555555500" is out of range for type integer in query INSERT INTO "public"."planet_osm_line" ("osm_id", "tags", "way", "aeroway", "access", "bridge", "highway", "name", "railway", "ref", "tunnel", "waterway", "z_order") VALUES ($1, $2, $3::Geometry, $4, $5, $6, $7, $8, $9, $10, $11, $12, [13:36:16] $13) ([367609239 "building"=>"roof", "layer"=>"-55555555550" 0102000020110F0000070000006453F730A4774341617B1931716F5141E68BB54CAC7743412D75FBB0726F5141DD93BF13AE7743412C6F51D66F6F51411CB13CA6AF774341C0597C536D6F51411164805BA8774341693D08CC6B6F514130B5DA92A7774341F7EBCBD56C6F51416453F730A4774341617B1931716F5141 -555555555500]) [13:37:16] would overriding the cache fix this? [13:37:19] failed again [13:37:21] :/ [13:37:51] i think it will rebuild the cache every time [13:38:22] 10serviceops, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 (10Andrew) >>! In T325244#8469338, @jijiki wrote: > That sounds alright, but if wikitech is still usi... [13:39:20] depends on whether we're doing appendcache or overwritecache afaict, any idea? [13:40:34] not sure [13:40:38] also cache is binary [13:40:42] so that will also be tricky [13:41:45] we could change the schema to bigint [13:44:23] I don't have a huge amount of insight here but it looks quite like that data is corrupt rather than it being out schema [13:44:31] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move away from system:node RBAC role - https://phabricator.wikimedia.org/T299236 (10JMeybohm) 05Open→03Resolved [13:44:33] gotta step away for a few minutes [13:44:37] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:44:43] we should probably open a ticket for this at least [13:46:00] I am late to the party and didn't understand if we are doing a fresh re-import or just trying to restart the imposm cron [13:46:27] Imposm cron fails since monday because of an out of range value [13:46:51] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Update cert-manager to 1.10.x - https://phabricator.wikimedia.org/T325292 (10JMeybohm) [13:47:25] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [13:48:21] on the latest fresh install, is that it? [13:48:25] filing a ticket [13:48:32] mbsantos: both [13:48:36] clusters [13:49:09] I see, thanks [13:50:42] 10serviceops, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) [13:50:57] 10serviceops, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) p:05Triage→03High [13:56:31] 10serviceops, 10Maps: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) From a quick look at the data this is a key=>value on hstore column "tags" [14:04:46] we could exclude this specific failing tag in the imposm mapping yaml [14:19:04] mbsantos: Theoretically the failing OSM id is: 367609239 can you help me figure out if its actually fixed in the current versions? [14:20:43] 10serviceops, 10Content-Transform-Team-WIP, 10Maps, 10Patch-For-Review: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 (10Jgiannelos) [15:24:54] jayme: thanks for review on flink-kubernetes-operator, resonded and pushed new patch. [15:25:14] ottomata: cool. I just added another thing I forgot :D [15:25:21] flink-app chart also ready for review: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/866510 as well as images: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865100 :) [15:25:23] oh! [15:25:50] oh my looking. [15:27:51] jayme: okay so, we need to make it possible for KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT to be overridden in the container envs? [15:28:10] unfortunately yes [15:29:16] there is support for setting e.g. operatorPod.env would that suffice? [15:33:04] hm i see kubernetesApi info is set in other values files...which I suppose I can't reference in the admin_ng value.yaml file, can I? ... [15:37:20] trying to do it without modifying the chart more. I can def inject those env vars without modifying, but I can't get the values from e.g. kubernetesApi.host. I'd have set them operatorPod.env directly in each of the cluster/env specific values files. [15:37:53] not sure which is better? modfiying the upstream chart, or adding a duplicate value in the cluster/env k8s values files for operatorPod.env [15:38:00] hmm probably modifying the chart.. [15:47:00] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10jhathaway) >>! In T325131#8470355, @Joe wrote: > My current idea for attacking this is the following: > * install msmtp and msmtp-mta in the php multiversion base image sounds good,... [15:47:28] oh, wait, kubernetesApi is set per admin_ng cluster release (helmfile) , not per cluster global. hm. jayme i see that [15:47:30] ./helmfile.d/admin_ng//values/common.yaml- host: kubernetes.default.svc.cluster.local [15:47:36] will kubernetes.default.svc.cluster.local usually work? [15:54:19] okay, i think i got it, i need to set it in the proper k8s cluster values file anyway. i can do htis without modifying the chart. [15:56:39] 10serviceops, 10Campaign-Tools, 10MW-on-K8s, 10Patch-For-Review: Setup sendmail on k8s container - https://phabricator.wikimedia.org/T325131 (10ldelench_wmf) [15:57:11] ah oops, pushedt to the wrong gerrit patchset [16:04:02] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10Vlad.shapik) a:05hnowlan→03Vlad.shapik [16:05:27] okay fixed. [16:05:37] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: High error rate on maps.wikimedia.org - https://phabricator.wikimedia.org/T325309 (10Jgiannelos) [16:18:05] ottomata: sorry, was away already. I can check back tomorrow. In this case I think it might be easier/better to add it to the chart rather than duplicating another value (at least this is what we do for other charts already) [16:37:49] 10serviceops, 10SRE: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) [16:50:14] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: High error rate on maps.wikimedia.org - https://phabricator.wikimedia.org/T325309 (10akosiaris) p:05Triage→03High [17:10:40] oh jayme okay, but it looks like the kubernetesApi values are duplicated per release anyway? (will add comment to patch ttyl) [17:48:54] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: High error rate on maps.wikimedia.org - https://phabricator.wikimedia.org/T325309 (10Jgiannelos) This has been resolved after we rolled-back to serve traffic back from codfw. Waiting for higher hit rate in eqiad maps storage. [17:49:00] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: High error rate on maps.wikimedia.org - https://phabricator.wikimedia.org/T325309 (10Jgiannelos) 05Open→03Resolved [17:49:04] 10serviceops, 10Content-Transform-Team-WIP, 10Maps: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) [18:38:19] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10Vlad.shapik) It seems that I found where is the trick. As it turned out the failed SVG file has a small body, as a result, the source in the prepare_sou... [18:48:25] what is the best way to have a service(s) deployed to codfw staging? I don't need codfw staging for anything per say, but the code currently deployed there has an open pool of connections with a database I want to decommission. Are there instructions somewhere that I could use to do it myself? Should I open a phab? [18:50:52] urandom: is it about a new service that does not exist yet or an existing service and just the actual deployment of a new version of it? [18:51:27] the former should be a Phab ticket, there is a template for it [18:51:42] the latter is .. needs deployment server shell access [18:51:49] mutante: the latter [18:52:26] it's services that are already running there, but need a deployment if they are to continue to, since they're connecting to a db cluster that needs to come offline [18:53:29] mutante: are there docs for this, https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments doesn't seem to cover it [18:53:31] urandom: so technically the deployment itself are the steps 5 to 7 on https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments [18:53:44] but you will need a code review and +2 first presumably [18:53:50] and that needs SRE [18:53:59] mutante: that doesn't cover codfw staging tho [18:54:39] oooh.. I see. because there are "staging, eqiad and codfw" only to pick from [18:55:04] yes, and "staging" only deploys to staging in eqiad [18:56:52] ok, I see. I suggest do the Phab ticket or send an email to serviceops team [18:57:18] people in Euro timezone will have a response I'm sure [18:57:28] yup, ok [20:27:04] 10serviceops, 10SRE, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Andrew) Huh, is anyone tasked with this? This is one of the few cases that's keeping Stretch alive in cloud-vps and prod. [20:39:02] urandom: there is a staging-codfw environment named on the deploy servers. I've never tried to target it specifically, but if you know the service namespace you should be able to do something like `kube_env $NAMESPACE staging-codfw; kubectl get pods` to verify that things are where you expect and then use helmfile to re-deploy. [20:39:39] Running `kube_environments` on a deployment server is a handy way to find out what clusters are available [22:03:44] 10serviceops, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudweb hosts are using the profile::mediawiki::nutcracker profile to configure nutcracker - https://phabricator.wikimedia.org/T325244 (10Andrew) 05Open→03Resolved a:03Andrew [22:27:54] bd808: so like...`helmfile -e staging-codfw ...` ? [22:28:23] yeah, that's what I would expect to work [22:30:20] Narrator: But it didn't [22:30:24] :) [22:30:34] :sad trombone: [22:30:48] `err: no releases found that matches specified selector() and environment(staging-codfw), in any helmfile` [22:31:39] Just to double check, you were in the directory of the service that you wanted to update when you ran that? [22:31:45] I was, yeah [22:32:19] Back to waiting for the folks who actually built this out to help you I guess. Sorry [22:32:53] which makes sense now that I see the error, because there is no selector for staging-codfw in those files [22:34:05] yeah, this is kind of a one-time deal... any almost any other situation, I could pretend codfw staging didn't exist and be OK [22:34:20] s/... any/.../