[07:06:09] i.nflatador and I were running into errors like the following on `deployment-prep` today: [07:06:10] [2022-08-30T06:38:10,807][ERROR][o.e.b.Bootstrap ] [deployment-elastic09-beta-search] Exceptionjava.lang.IllegalStateException: The index [[.tasks/vxr3cHCJT_ioV0xUKMDvSg]] was created with version [5.5.2] but the minimum compatible version is [6.0.0-beta1]. It should be re-indexed in Elasticsearch 6.x before upgrading to 7.10.2. [07:07:10] how do we normally reload indices on deployment-prep? we probably need to regenerate the indices entirely I'd wager [07:20:02] ryankemper: sigh... I'll rebuild the indices there [07:45:31] we need to rollback to 6.8 first I think [07:45:57] also seeing java.security.AccessControlException: access denied ("java.io.FilePermission" "/nonexistent/.aws/config" "read"), (S3 plugin) not sure if this known [07:48:19] gehel: if you have a minute: https://gerrit.wikimedia.org/r/c/operations/puppet/+/827567 [07:49:11] Looking [07:51:53] dcausse: so we need to rollback to 6.8, rebuild all 5.x indices, upgrade again to 7.10? [07:52:07] sadly yes :/ [07:53:17] or we can wipe out everything on disk a rebuild from scratch but that's not something we want to explore :) [07:54:01] thanks! [07:55:26] merged [07:55:36] do you need me to do the elasticsearch rollback as well? [08:01:44] gehel: I thought that puppet would do that? [08:01:51] elastic is down anyways [08:02:16] no, it won't rollback the packages, since we want to have more control as to exactly when those packages are upgraded [08:02:25] ah right [08:02:47] if it's just sudo apt I think I can do it [08:03:28] it should be run-puppet-agent; apt-get install ...; check that everything works and fix all mistakes [08:03:36] I'm trying it on elastic09 [08:03:42] ok [08:04:01] oh, and apt-get update as well, just to be sure [08:06:43] sudo apt-get install elasticsearch-oss=6.8.23 wmf-elasticsearch-search-plugins=6.8.23-5~bullseye [08:08:46] the node started apparently [08:09:50] cluster is green, I'll run the reindex [08:09:57] I had to restart it manually [08:10:15] ok [08:10:55] the other nodes had not been upgraded yet [08:11:02] I've re-enabled puppet on all nodes [08:11:31] I had to restart elasticsearch manually, but that's kind of expected. And would have been done automatically at the next puppet run [08:11:57] dcausse: let me know if you see anything weird, but we should be good for now [08:12:11] gehel: thanks! [08:14:14] bah Rejecting mapping update to [enwiki_content_1661847195] as the final mapping would have more than 1 type: [_doc, page]" :( [08:14:43] we need to revert cirrus patches as well... [08:18:52] hm no it's not that... es7 patches have not been merged yet [08:20:31] ah this one needs to be reverted https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/819073 [08:22:43] but I don't want this revert to be part of the train :/ [08:27:11] well... it should not hurt if it's part of the train actually [09:09:41] Erik made cirrus-streaming-updater to be tested with java 11 rather than 8 [09:10:08] hashar: yes we should use only java11 now [09:10:17] I am wondering whether `wikimedia/discovery/discovery-parent-pom` should be tested with both (right now it is only java 8) [09:10:35] hashar: it should indeed become the new default I suppose [09:11:05] and Debian Bullseye has Java 17 :D [09:12:14] we're not there yet! :) [09:12:45] but jdk8 won't be back-ported to debian forever [09:13:29] our SRE did the port since we had to upgrade the underlying Debian OS no more shipped it [09:13:51] so we probably have to upgrade everything now [09:14:05] true but they don't want to backport it forever yes [09:14:33] on our side only wdqs is major pain because of blazegraph [09:14:49] the rest should already be compatible with java11 [09:21:24] great [09:21:33] should I add java 11 to the parent pom? [09:21:42] or is solely using java 8 good enough? [09:22:09] compiling the parent pom with a jdk11 should have no impact [09:22:57] setting default compile settings to java11 in the pom will require downstream project to explicitely specify java8 if they want to stick to java8 [09:24:40] also I suppose that jdk11 is able to compile java8 classes (but I barely remember some issues doing so) [09:26:01] reindexing all beta wikis with es 6.8 now [09:28:02] I will leave the parent pom to java 8 for now ;) [09:29:36] ok, we should probably create a task to coordinate on this [09:32:58] on the parent pom tho since it has no classes to compile it's completely fine to use a jdk11 image for the CI [09:37:32] errand + lunch [10:52:48] lunch [12:29:10] almost done with the index cleanup on beta [12:37:07] hm... wikidata does not allow the archive type... I suppose it's on purpose, deleting this index [12:39:31] gehel: we should be good to revert back to es7 on deployment-prep now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/827574 [12:40:17] dcausse: can we wait for ryankemper to be around and take that over? [12:40:37] sure [12:40:47] I don't think I can find enough uninterrupted time this afternoon to take that on [12:41:06] ok no problem [15:04:36] \o [15:07:08] doh, got all the prod clusters in order but forgot to reindex beta :( [15:10:09] o/ [15:10:42] I missed this step too, we use beta too rarely to care :( [15:12:40] the bright side is that this cluster no longer contains old wiki indices like zerowiki :) [15:12:53] :) [15:14:32] hm... translate might require the es6 compat transport on the mw-config patch [15:14:42] looks like it does not reuse cirrus transport config :( [15:19:54] good find! [15:20:18] should we start force-merging the es7 branch stuff now too? [15:25:14] ebernhardson: I was hoping to have elastic7 running on deployment-prep before but it should not hurt much to merge (we might fail some updates tho without the BC transport) [15:25:34] hmm, yea i suppose we can wait it's no big rush [15:34:29] meh, super tedious to rebase or cherry-pick a merge commit. probably simpler to recreate the merge and re-use the Change-Id [15:34:48] not actually clear if it's necessary or not.... [15:36:26] ebernhardson: yes... [15:38:46] storm is coming I might drop randomly if power goes down [15:44:30] slightly jealous, i don't think we've had rain for months and months :P [15:46:38] well not really rain apparently it's hailing :) [15:47:22] bah was about to test one last thing on translate and my setup broke... No database connection (unknown) :( [16:07:04] Will roll the es7 upgrade on deployment-prep in ~15 mins [16:31:39] hmm, i suppose one curiosity is that on beta $wmgVersionNumber === 'master'. 'master' >= '1.39.0-wmf.28', so mostly by luck the transport wrapper will still end up configured [16:31:52] but it should be a functional no-op [16:32:05] it will add the line in, but elastic wont care [16:34:17] it'll re-add the _doc type to bulk requests indeed and we know it works because that's what we send with elastica6 I suppose [16:35:48] unrelated to the above, i realized i paused a dag in airflow and forgot to unpause it afterwards. mjolnir-bulk is now processing through some ores updates, but we need a better way to assert the set of dags that should be enabled [16:36:45] indeed :/ [16:37:46] I also stopped relforge query imports and you re-enabled it (but just realizing this now) [16:38:31] can airflow have some prometheus metrics exported? [16:40:00] not sure, there is a project from robinhood that exposes some metrics but i've never looked at it: https://github.com/robinhood/airflow-prometheus-exporter [16:40:33] robinhood is perhaps not everyones favorite company, and no clue how their tech prowess is, but it probably wont disappear overnight [16:40:47] oh, actually this is last updated 3 years ago though, so maybe it will :P [16:41:17] :) [16:43:53] looks like this runs in airflow context and talks directly to the db to pull information to report, seems plausible [16:47:33] would maybe prefer if it queried an external api somehow, but i guess in 1.10.6 the api is at /api/experimental/ so they might not have consider it stable enough to report against [16:51:48] I was hoping that airflow would natively reports something :/ [16:52:58] they document some statsd based metrics, but nothing for prometheus [16:53:44] there's a statsd_exporter [16:54:09] but when I see "_start" that might not work well with prometheus [17:05:18] meh... writing BC code with nonexistent classes is a bit of a pain with phan :( [17:12:40] * ryankemper can't remember how to do the equivalent of `puppet-merge` for deployment-prep [17:13:18] There's a `ryankemper@deployment-puppetmaster04` but `puppet-merge` doesn't seem to work on it (`Error reading variables from /etc/puppet-merge.conf`) [17:16:38] ryankemper: hmm, it seems to not like that MASTERS and WORKERS are blank, i suspect though that beta cluster doesn't use explicit merges and somehow auto-magically gets puppet updates [17:17:01] * ebernhardson looks for whatever automation does that [17:19:00] I can probably just manually pull or reset to the latest `origin/production` commit in `ryankemper@deployment-puppetmaster04:/var/lib/git/operations/puppet` [17:19:11] But yeah I also figure there's automation so it will prob auto update in like the next 20 mins [17:25:15] ryankemper: best guess would be this host uses role::puppetmaster::standalone (per horizon), which uses puppetmaster::gitsync and defaults to updates every 10 minutes [17:25:58] probably manually running git-sync-upstream will do it, but can probably wait [17:34:02] looks to have landed now into /var/lib/git/operations/puppet [17:39:22] ebernhardson: what are you seeing exactly? [17:39:40] I still see `631de0a8e55bf1a7ea2206784658caf37a2a57ee` as the HEAD (`elastic: add new certs`) [17:40:18] ryankemper: all the labs-only patches are rebased ontop of origin/production, scroll down a few pages [17:40:29] ryankemper: past all the [LOCAL HACK] ones [17:40:57] ah I see [17:40:58] thanks [17:49:09] ryankemper: another patch that should go out today: https://gerrit.wikimedia.org/r/c/operations/puppet/+/815783 [17:49:21] should [17:49:42] shouldn't require anything special, from c.white's comment it should happen auto-magically once deployed [17:50:11] ebernhardson: and it seems like so based off the commit msg but just to sanity check, I can merge that now right? [17:50:31] ryankemper: yup [17:51:30] (merged) [17:52:09] ebernhardson: and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/815784 needs to be merged after everything's on 7? so a few weeks from now [17:52:59] ryankemper: yup thats the plan, drop the template from logstash for now (it will remain in elasticsearch), then when everything is 7.x bring back the template with an updated version [17:53:32] dropping it from logstash prevents logstash from regularly pushing the template to the update endpoint [17:58:23] meh `beta-search-omega` is broken on `deployment-elastic09`, gonna poke around [18:02:13] might be that beta-search-omega was already borked and so index regeneration didn't properly happen on it, cause I'm seeing the following: [18:02:16] https://www.irccloud.com/pastebin/ch3X1Ccc/ [18:02:57] damn my bad I totally overlooked that we had a 3 cluster setup there... [18:03:16] I did run foreachwiki tho [18:03:44] but only double checked index version on the 9200 [18:03:46] port [18:03:54] sigh... [18:04:14] dcausse: lots of easy things to overlook in this environment :P how do we rebuild indices in deployment-prep, is it like the normal process but with the deployment prep mwmaint instead? [18:04:15] hmm, rollback will be harder because chi probably wont want to open the indices that 7.x upgraded when rolled back [18:04:54] yea i wonder if we can drop the indices from disk, especially f its only the metastore [18:05:01] ryankemper: yes these are the same scripts [18:05:04] not sure if elastic allows that though [18:05:52] not sure either [18:06:05] in mediawiki-config we only configure $wgCirrusSearchClusters with search-chi [18:06:19] for labs? [18:06:25] yeas in CirrusSearch-labs.php [18:06:39] hm.. so we could kill these clusters perhaps? [18:07:54] maybe, checking in shell.php [18:11:08] hmm, it does end up with MultiClusterAssignment, and testwiki thinks it belongs to omega (same as prod) [18:12:45] but it ends up in the condition 'If a replica has a single elasticsearch cluster then by definition everything goes there' [18:13:10] so it looks like even though they have the multi-cluster, since only one cluster is configured it all goes there. That was intended for cloudelastic but seems to have accidentally worked in deployment-prep [18:13:50] so my estimate would be that omega is unused and the data on disk could be purged to allow it to come up [18:14:12] does leave the question of how metastore got there ... [18:15:55] well I'm definitely in favor of blowing away every index in omega and proceeding with upgrading the rest of deployment-prep if we're confident that it won't actually impact anything [18:16:06] worst case those will be easy to recreate [18:16:36] what creates the metastore? is that something that mediawiki does? [18:17:47] most cirrussearch mainteance actions will create it, it holds the `namespace <-> name` lookup (that i think might be unused now that we do utr30 in php) along with some metadata about the git hashes when a particular index mapping was last updated [18:18:27] and some data about the per-wiki saneitizer state [18:20:32] Okay I'm gonna proceed with clearing out the indices in omega [18:25:40] Okay, indices gone on `beta-search-omega` [18:27:46] All the indices visible with `_cat/indices` are gone, but still seeing this when I try to restart omega on `deployment-09`: [18:27:48] https://www.irccloud.com/pastebin/XE2YsLML/ [18:28:08] Message is a bit confusing...does elasticsearch actually store tasks as indices under the hood? that seems...odd... [18:28:27] Also wondering if `_cat/indices` defaults to not showing indices that begin with `.` [18:29:40] ryankemper: yes there is an index called .tasks, but it shuld be a real index on disk [18:29:55] that's something maintained by elastic, i think the suggested action was to delete it and let it be recreated [18:30:07] i'm surprised they don't auto-magically handle that [18:30:27] ryankemper: I'll be 3 minutes late [18:30:48] gehel: ack [18:31:30] ebernhardson: yeah I suspect it's a hidden index because it's not showing up under `_cat/indices`. It looks like I need to set `expand_wildcard=all` perhaps: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/cat-indices.html#cat-indices-api-query-params [18:31:55] https://www.irccloud.com/pastebin/42Vks6On/ [18:32:17] Oh yeah I was looking at 7.10 docs but this is 6.8.23 [18:33:05] Well the 6.8.23 docs on `_cat/indices` aren't helpful at all: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/cat-indices.html#cat-indices-api-query-params [18:33:15] ryankemper: in theory, i think can rm -rf /srv/elasticsearch/beta-search-omega/nodes/0/indices/zC6MQljlQNSjXL8PjKZ7aA [18:33:33] not entirely sure what elastic would do, i guess i can quickly test in some docker containers to verify [18:51:05] elastic seems happy (not blowing up) with simply deleting the indices from the data directory if they complain [19:06:43] Okay we blew away everything in `/srv/elasticsearch/beta-search-omega/nodes/0/indices/` and that made omega happy on deployment-09 [19:07:03] Proceeded with the upgrade for `deployment-[10,11]`, so deployment-prep should be fully upgraded now [19:14:05] Erik and I are looking at some failures in the logs: https://beta-logs.wmcloud.org/goto/bdc24e7781646f77faef7abfdb4082bc [19:24:13] ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/828077 [19:30:13] problem from logs was that deployment-prep was running the heap too full so elastic was rejecting updates [19:50:42] we increased heap memory from 2G to 3G for the elasticsearch instances, and created a task to reduce shard counts to 1 for all wikis in beta cluster which should also save some memory [20:37:13] Can confirm we haven't seen anymore `CirrusSearchChangeFailed` so the heap fix definitely looks like it worked [20:41:41] About to start upgrades on `codfw`. Gonna do one host first and then switch to 3 hosts at a time if the first run looks good [21:00:26] all es710 branches merged, CI still looks reasonable and passes in mediawiki/core. One minor issue cropped up in the Elastica extension but easy to fix (a phan suppression that is no longer needed. Not sure why it didn't fail before) [21:03:49] there are a couple ci failures that happened recently related to cirrus though. monitoring to see if re-check's pass (they potentially only got updates to a few repos but not all) [21:32:29] * ebernhardson tries to avoid being cheeky in the risky patch template under 'how will it be verified' [21:32:37] something like 'amount of screaming/sec' sounds reasonable :P [21:40:17] :P