[06:57:19] ebernhardson: thanks! [09:57:20] lunch [12:47:33] o/ [13:00:40] dcausse pfischer if y'all need help today updating dashboards for new metrics LMK, I haven't done any dashboarding in awhile ;) [13:01:51] inflatador: thanks for the offer! perhaps an easy one to get started could be: https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1 ? [13:04:15] one of the diff is that now metrics are spread between two prometheus clusters, either we blend these two if possible or a new dropdown have to be added [13:04:17] dcausse cool, thinking the replacement metrics would be found here? https://gerrit.wikimedia.org/r/c/operations/alerts/+/1054317/2/team-search-platform/cirrussearch_k8s.yaml [13:05:32] inflatador: use codesearch instead I think, here search for "CirrusSearch.poolCounter" -> https://codesearch.wmcloud.org/search/?q=CirrusSearch.poolCounter&files=&excludeFiles=&repos= [13:06:28] then go up few lines to see something like ->getTiming( "pool_counter_seconds" ) this should be the corresponding prometheus metric [13:06:47] always preprend "mediawiki_CirrusSearch_" and you'll have it [13:07:01] dcausse ACK...thanks for advice [13:09:00] I suppose pool counters are per DC, it should be fine having a dropdown to select between codfw & eqiad for the prometheus k8s instances [13:20:11] \o [13:25:27] .o/ [13:26:58] o/ [13:42:21] o/ [14:10:56] do we just edit the mw-config private settings directly in the deployment-deploy04.deployment-prep host? Mostly ask because everything is owned by jenkins-deploy:wikidev and is only writable by jenkins-deploy [14:11:37] vs in prod how the files are owned by somewhat arbitrary list of users [14:11:42] "owned" [14:15:19] no clue :/ [14:15:56] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview#MediaWiki_config is not particularly verbose [14:24:50] yea i didn't see anything that really said, the perms could be some artifact of the repo being cloned from elsewhere, but the last commit isn't all that long ago, a couple months [14:31:45] my best guess is the deploy host changes from deploy03 to deploy04 after the last edit, and the file perms are still set as if it was a copy rather than the primary source [14:39:44] hm I should perhaps have stopped the dags before removing them [14:40:09] or I perhaps need to delete them as well in the airflow UI [14:40:52] dcausse: hmm, i think in the past i've deleted them from the ui [14:41:00] ok, doing this [14:43:52] hm... I don't see transfer_to_es_hourly that I deleted from the code [14:44:18] will check the logs [14:50:11] last run was for 2024-07-11T09:00:00... [14:53:02] err, :S [14:53:18] perhaps best it's now streaming and not a job that gets stuck :P [14:58:54] indeed :) [15:17:47] I wish we can move image_suggestions_weekly to a streaming approach too, I find myself having to click a bit too much on the airflow UI for this one :/ [15:18:01] will certainly make a mistake at some point [15:19:13] hmm, could we figure out a patch to provide them that would produce the events to eventgate from their job directly? [15:19:21] using the new schema [15:20:06] yes I hope so [15:20:42] still unsure about the volume and potential undesirable effects on backpressure for other streams [15:21:16] quasi-relatedly, feeling awkward about how we handle updating hive table definitions. Wrote the bits to add the two new columns to metrics but it just feels so awkward [15:21:37] with an "alter" dag? [15:21:47] yea. It's not the end of the world, but it's verbose [15:21:54] yes... [15:22:12] i guess i'm thinking we should look into how other teams are managing this, i think we are the only ones doing it direct from airflow [15:22:33] I think they do it by hand still? [15:22:58] there was a project with a "config store" but not sure it's moving forward [15:23:00] oh :S thats not really any better. I suppose i was imagining something closer to how webapp's have db migration support [15:23:08] but i don't want to write it :P [15:23:13] :) [15:24:11] for the volume and backpressure, i guess i'm still hoping rate limiting the event production is "good enough" [15:26:12] +1 [15:28:46] just looked at the mjolnir bulk updates dashboard because after more than one month of being stuck they're flowing again, just saw that glent consumer offsets are stuck at 253 both on eqiad & codfw... [15:29:12] "being stuck" I mean image suggestions [15:29:36] hmm [15:30:48] i guess i should look a little closer into glent, i think i've seen that in the past and dismissed it as some oddity of the way lag is measured since there are recent indices (most recent in eqiad is 20240720) [15:31:25] but i notice now that glent indices aren't being deleted as they should be [15:32:20] hmm, yea glent_production is 20240713, and rollback is 20240615, suggesting thta while indices in the middle exist they didn't make it to prod [16:18:16] dcausse: i haven't checked every last bit, but it looks like should be ok to make a discolytics release and update airflow now that the dags have been removed? [16:33:45] lunch, back in ~1h [16:42:45] ebernhardson: yes I think so [16:43:04] errand [16:44:54] thanks! will ship it out [17:16:27] back [18:11:27] * ebernhardson takes almost 5 minutes to figure out sessions has 4 s's and not 3 [18:12:51] still a long way to go on the CirrusSearch metrics replacement, but here's the first draft [18:13:14] https://grafana-rw.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1&forceLogin&viewPanel=5 [18:15:32] * ebernhardson sighs ... worked fine in the notebook, but the rerun data eems to have nulls for the new column. And i thought i was almost done :P [18:17:34] inflatador: personally, i kinda like having all of them on the same graph. Usually if i'm looking at pool counters it's to see if any of them have been firing. Maybe would be useful to split out Automated since those are expected to regularly fire, as opposed to the rest which should only fire if there are problems [18:24:52] ebernhardson ACK, so each type on the same graph except maybe "Automated"? [18:25:49] inflatador: yea, thats what i'm thinking when looking at it. A common thing i did on the old graph was to unselect Automated so only the others displayed [18:28:20] Cool, I'll start that up. Feel free to add any other suggestions, I'm just feeling my way through this [18:39:59] I'm also not sure why I'm getting 2 points on the graph for the same metric...probably did something dumb ;) [18:41:09] inflatador: my usual trick is to remove the legend so that it prints all the labels, which usually makes it clear what the variation is [18:41:59] inflatador: also i suspect you can do more like {status="failure",type!="Automated"} [18:42:08] and that will select everything except automated [18:44:26] ebernhardson good call, just added that. Is there value to separating out individual types as well, or is "not automated" good enough? [18:44:58] perhaps, sum by (type) (rate(mediawiki_CirrusSearch_pool_conuter_secounds_count{status="failure",type!="Automated"}[5m])) [18:45:15] that would give you a metric per type, and exclude Automated. The legend can then be `{{ type }}` [18:46:43] i poked at the graph and removed the legend to see the metrics, sadly it still doesn't seem to say why there are multiple Search things mentioned :S [18:47:41] inflatador: oh! it's that under options there is Type: Range, Instant, Both. You had both selected. I have no clue what that does :P [18:47:52] but Both draws 2 lines, Range or Instant each draw 1 [18:48:22] OK, that's progress ;) [18:52:19] ebernhardson: dcausse: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/tool/EvolveHiveTable.scala [18:52:46] (note: this is currently being updated in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1016808/28/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/tool/EvolveHiveTable.scala ) [18:53:02] that will create and alter hive tables based on jsonschemas [18:53:17] sort of like a migration ;) (cept it wont' go backwards) [18:55:18] ottomata: cool! [18:55:43] ottomata: and do you schedule, or just run manually? [19:15:30] * ebernhardson ponders if we can then also setup tests that assert output columns match schemas [20:02:54] inflatador ryankemper happen to remember about how long the data transfer of the split graph files (not the full graph) takes? [20:05:21] dr0ptp4kt: I don't, but I think 1-2 hours probably [20:08:54] thx ryankemper . oh say, for https://phabricator.wikimedia.org/T370754 do we have a guess on target date for that kicking off? it would be cool if we could sit on a meet for that together (for whoever is around) and record it and quietly celebrate [20:10:47] dr0ptp4kt the full graph took about 90m last time I did it [20:11:13] thx inflatador [20:12:13] dcausse: is https://phabricator.wikimedia.org/T370754 waiting on me? looks like the pre-req steps are sre-ish stuff (puppet repo etc) [20:13:03] dr0ptp4kt: hopefully we can kick off the reload tomorrow, so we could do a data xfer on monday [20:13:21] dr0ptp4kt: btw you meant data transfer and not an actual reload right? the reload I'd expect to take like 3-4 days IIRC [20:13:24] car inspection, back in ~30 [20:15:25] yeah, just the data transfer ryankemper was what i was curious about. now, that said, did you mean to say the xfer tomorrow, then the actual reload monday? 🪞 [20:19:11] hmm, for some reason i thought elasticsearch had the ability to report stat groups (from the `stats` key in search queries) across all indices...but now i'm only seeing it per-index [20:33:24] dr0ptp4kt: we need to do the reload before we can xfer [20:34:10] (i mean in prometheus stats exporter) [20:51:31] back [21:07:33] I think we did OK with the pool counter rejections rate panel based on its resemblance to the old panel. There are slight differences which (hopefully) stem from the visualization type. The viz. type the old panel uses is deprecated/grayed out [21:08:34] Probably will need help on getting requests per second on the other panel [21:45:45] OK, I'm off...see you Tuesday!