[09:15:53] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10BTullis) @JMeybohm - something to note, when setting up the new dse-k8s-ctrl100[1-2] servers recently under T310172, I observed a rac... [09:41:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10JMeybohm) >>! In T300744#8177252, @BTullis wrote: > @JMeybohm - something to note, when setting up the new dse-k8s-ctrl100[1-2] serve... [09:52:42] 10serviceops, 10Prod-Kubernetes: Clean up puppet from profile::docker::storage - https://phabricator.wikimedia.org/T315977 (10JMeybohm) [09:53:00] 10serviceops, 10Prod-Kubernetes: Clean up puppet from profile::docker::storage - https://phabricator.wikimedia.org/T315977 (10JMeybohm) [09:53:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [09:53:21] 10serviceops, 10Prod-Kubernetes: Clean up profile::docker::storage from puppet - https://phabricator.wikimedia.org/T315977 (10JMeybohm) [11:27:31] begging review on this if anyone has some spare time https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/813613 :) [11:27:52] might have to change things around in future as we restructure the thumbor changes made recently but the image will stay the same [11:34:30] rzl: hi there! this patch still has an open question and doesn't seem to be moving. Do you think you could take a look? https://gerrit.wikimedia.org/r/c/operations/puppet/+/820749 [11:35:34] hnowlan: I'll take that as an excuse to read up on blubber... 😇 [11:35:47] ...after lunc [11:36:57] jayme: thanks! I can talk you thought it if you fancy but it's pretty simple stuff [11:37:18] I'll try to RTFM first, thanks :) [11:58:54] 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) I think that eqiad is left in a stale state and we never deleted the backup schema after an issue we had with the import. On maps1009: ` schema_name | pg_size_pretty ----------------... [12:04:54] 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) Regarding SRE support, planet import is probably the most common task we do for maps maintenance so I think it might be a good opportunity for folks that are or will be involved with maps oper... [13:13:54] 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF) [13:14:25] 10serviceops: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF) [13:29:39] 10serviceops, 10SRE, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10Dzahn) @RLazarus @joe Just saw this again in the history after a while. re: https://config-master.wikimedia.org/pybal/eqiad/api-https My suggestion was to set **mw1307 through m... [13:39:16] 10serviceops, 10Parsoid, 10RESTBase: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10Dzahn) Just ran across this while looking through Phabricator history. It seems all subtasks are resolved. What is left to do here? Is decom'ing scandium part of this (at so... [13:51:29] 10serviceops, 10Parsoid, 10RESTBase: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10ssastry) No! :) scandium is now Parsoid/PHP. AFAICT, I think this task can be resolved. But, you all should evaluate in case there are any lingering Parsoid/JS things lying a... [13:51:33] 10serviceops, 10Sustainability (Incident Followup): mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10Krinkle) [13:52:41] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Roll out remote-DC gutter pool for /*/mw-wan/ - https://phabricator.wikimedia.org/T258779 (10Krinkle) [13:54:22] 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle) [14:01:31] 10serviceops, 10Parsoid, 10RESTBase, 10Patch-For-Review: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10Dzahn) @ssastry Alright, thanks. I found a small remnant. [14:14:21] 10serviceops, 10MediaWiki-libs-Rdbms, 10Performance-Team, 10Platform Engineering, and 3 others: Determine and implement multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) [14:15:35] 10serviceops, 10Citoid, 10SRE, 10Patch-For-Review: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10akosiaris) 05Open→03Resolved a:03akosiaris This has been done in https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/774848 and overall seems to work fine (as... [14:16:55] 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10akosiaris) Hi @Ottomata, @JArguello-WMF /me is back. Any updates on this one (even if just a rough timeline) ? Anything we can help... [14:17:31] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10akosiaris) @JMeybohm anything left to do here? [14:22:14] 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10JMeybohm) >>! In T277849#8178377, @akosiaris wrote: > @JMeybohm anything left to do here? Yeah. We did not do that during the helm3 migration. Maybe i... [15:50:05] jnuche: sorry, just getting online now here in UTC-7 :) I want to make sure _joe_'s concerns are resolved but let me see if I can get things moving [15:52:10] jnuche: ah, he's out of office one more day for a personal situation -- if he gets back to you tomorrow, will that work? [15:53:10] rlz: of course, it's not urgent, just been waiting for a while now. Thanks! [15:56:55] thanks for checking in! [16:29:56] 10serviceops, 10Performance-Team, 10SRE: Clean up testwiki experiments (Aug 2022) - https://phabricator.wikimedia.org/T314750 (10Krinkle) [16:31:14] 10serviceops, 10SRE, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10RLazarus) That sounds right to me; it would give us the same distribution as codfw, which is probably as much work as we need to do on this. I don't think it's worth investing time... [16:43:49] While I'm already abusing yizzer collective charity, I have another review beg of significantly greater magnitude https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/823143 [19:38:21] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Liz) I have been told, probably repeatedly, that Special Pages are generated automatically every 3 days but the ones I regularly use https://en.wikipedia.org/wiki... [19:40:12] 10serviceops, 10Diffusion, 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) 05Open→03Resolved a:03Dzahn @Yaron_Koren https://phabricator.wiki... [19:40:53] 10serviceops, 10Diffusion, 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) [19:46:08] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) The automatic update is a systemd timer (formerly 'cron job') that is running on the mw maintenance server. `mediawiki_job_update_special_pages.service` i... [19:47:08] is it normal that update_special_pages on all wikis takes more than 3 days? [19:47:40] or does it mean the mw maint job crashed somehow if it's running since the 19th [19:56:57] mutante: afaik yes [19:57:03] It's normal [19:57:47] yeah, I think it sometimes has slightly overrun, right now doing wikidata for the aug 19 run [19:58:06] there are open patches to make it per-shard so it's more reliable for finishing on time [19:58:29] thanks RhinosF1 and taavi [20:00:15] I was commented on https://phabricator.wikimedia.org/T307314#8179490 that it's simply still running [20:00:18] since the 19th [20:00:24] commenting [20:02:17] just..it seems to be a problem if it runs every 3 days and takes more than 3 days [20:03:06] analyzing the systemd timer tells me it has 1d 9h left to complete [20:04:54] "on the 1st of the month and every 3 days after that" [20:06:30] that probably means that the next trigger is in 1d9h, not that it will finish then? [20:06:49] but >4d is definitely unusual, I haven't seen that before [20:06:52] yea, exactly [20:07:07] I meant it has 1d and 9h left to be done before it will try to start it again [20:07:13] the few times I looked into this before it was just slightly over 3d [20:07:19] unless we reduce it to just run every 4 or 6 days [20:07:46] # TODO: Instead of flock, make this unit run every N time units after it's finished. [20:07:49] command => 'flock -n /var/lock/update-special-pages /usr/local/bin/foreachwiki updateSpecialPages.php', [20:08:20] that TODO is a side-note [20:10:51] that'd problematic since people rely on the pages for being (at least somewhat) up-to-date [20:10:53] from the status I can tell it's at wikidatawiki and since it's foreachewiki it still has everything after wikidatawiki in https://noc.wikimedia.org/conf/highlight.php?file=dblists/all.dblist [20:11:36] gut feeling is that wikidata takes the longest to update [20:12:02] we can give it 24h more hours though [20:12:31] I wonder if Amir1's db config reloading patches have made it slower [20:13:07] I feel like once it finally gets beyond wikidatawiki the rest are small and hopefully that isnt much then [20:13:13] taavi: It's practically noop given that the config is not applied there [20:13:31] and also it shouldn't have any calls to reconfigure() in those special pages [20:15:27] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Ladsgroup) The underlying problem is that these scripts when getting updated do extremely expensive queries, like full table scans on tables with billions of rows... [20:17:22] last thing in status log is: wikidatawiki: DisambiguationPageLinks [QueryPage] got 0 rows in 0.01s. and that was like half an hour ago [20:18:42] mutante: if you can review and get this deployed, it'll help a bit https://gerrit.wikimedia.org/r/804788 [20:19:43] ok. the log is: root@mwmaint1002:/var/log/mediawiki# tail -f mediawiki_job_update_special_pages/syslog.log [20:19:46] looking [20:20:44] Amir1: on it [20:24:53] thanks [20:27:51] sharded periodic jobs..aah [20:30:49] going to deploy the change [20:31:20] also existing job is now at wikimania wikis and it's as expected.. once wikidata was over it's now moving fast.. as if it's going to be done in minutes [20:34:21] xmfwiki.. yiwiki...yiwikisource.. yowiki...zawiki.. zh_min_nanwiki. [20:48:05] waits for it to be finished to _then_ make the change.. zhwiki... [21:03:53] 10serviceops, 10Diffusion, 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10QChris) Great debugging @Dzahn! As there was talk about mitigating the MINA u... [21:09:28] well, I merged the change that uses shared_periodic_jobs before shared_periodic_jobs existed.. but I will amend / bold / merge.. in a moment [21:26:31] 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) @Liz the existing job is super close to being finished.. it's currently at zhwikisource going alphabetically. Also we are about to switch this to multiple... [22:14:58] 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) deployed first https://gerrit.wikimedia.org/r/804788 and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/804800 on mwmaint2002/1002 by @Legoktm from puppet point of... [22:42:02] 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) @Liz and all: We now have one job per shard ,the command lines are: ` [mwmaint1002:~] $ for shard in 1 2 3 4 5 6 7 8 11; do grep Exec /lib/systemd/system/mediawiki_job_update... [22:46:44] 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) And just to spell it out, enwiki is the only wiki in s1, so it should never miss a run because some other wiki has delayed it. [22:56:31] 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) old (flock) processeds killed. now we are just waiting for the new jobs to run. the scheduled time for s1 (enwiki) is: `Trigger: Thu 2022-08-25 05:00:00 UTC; 1 day 6h left`