[09:15:53] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10BTullis) @JMeybohm - something to note, when setting up the new dse-k8s-ctrl100[1-2] servers recently under T310172, I observed a rac...
[09:41:02] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10JMeybohm) >>! In T300744#8177252, @BTullis wrote: > @JMeybohm - something to note, when setting up the new dse-k8s-ctrl100[1-2] serve...
[09:52:42] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Clean up puppet from profile::docker::storage - https://phabricator.wikimedia.org/T315977 (10JMeybohm)
[09:53:00] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Clean up puppet from profile::docker::storage - https://phabricator.wikimedia.org/T315977 (10JMeybohm)
[09:53:03] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm)
[09:53:21] <wikibugs>	 10serviceops, 10Prod-Kubernetes: Clean up profile::docker::storage from puppet - https://phabricator.wikimedia.org/T315977 (10JMeybohm)
[11:27:31] <hnowlan>	 begging review on this if anyone has some spare time https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/813613 :) 
[11:27:52] <hnowlan>	 might have to change things around in future as we restructure the thumbor changes made recently but the image will stay the same
[11:34:30] <jnuche>	 rzl: hi there! this patch still has an open question and doesn't seem to be moving. Do you think you could take a look? https://gerrit.wikimedia.org/r/c/operations/puppet/+/820749
[11:35:34] <jayme>	 hnowlan: I'll take that as an excuse to read up on blubber... 😇
[11:35:47] <jayme>	 ...after lunc
[11:36:57] <hnowlan>	 jayme: thanks! I can talk you thought it if you fancy but it's pretty simple stuff
[11:37:18] <jayme>	 I'll try to RTFM first, thanks :)
[11:58:54] <wikibugs>	 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) I think that eqiad is left in a stale state and we never deleted the backup schema after an issue we had with the import.  On maps1009: `     schema_name     | pg_size_pretty  ----------------...
[12:04:54] <wikibugs>	 10serviceops, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10Jgiannelos) Regarding SRE support, planet import is probably the most common task we do for maps maintenance so I think it might be a good opportunity for folks that are or will be involved with maps oper...
[13:13:54] <wikibugs>	 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF)
[13:14:25] <wikibugs>	 10serviceops: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF)
[13:29:39] <wikibugs>	 10serviceops, 10SRE, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10Dzahn) @RLazarus @joe  Just saw this again in the history after a while. re:  https://config-master.wikimedia.org/pybal/eqiad/api-https  My suggestion was to set **mw1307 through m...
[13:39:16] <wikibugs>	 10serviceops, 10Parsoid, 10RESTBase: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10Dzahn) Just ran across this while looking through Phabricator history. It seems all subtasks are resolved. What is left to do here?  Is decom'ing scandium part of this (at so...
[13:51:29] <wikibugs>	 10serviceops, 10Parsoid, 10RESTBase: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10ssastry) No! :) scandium is now Parsoid/PHP. AFAICT, I think this task can be resolved. But, you all should evaluate in case there are any lingering Parsoid/JS things lying a...
[13:51:33] <wikibugs>	 10serviceops, 10Sustainability (Incident Followup): mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10Krinkle)
[13:52:41] <wikibugs>	 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Roll out remote-DC gutter pool for /*/mw-wan/ - https://phabricator.wikimedia.org/T258779 (10Krinkle)
[13:54:22] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle)
[14:01:31] <wikibugs>	 10serviceops, 10Parsoid, 10RESTBase, 10Patch-For-Review: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10Dzahn) @ssastry Alright, thanks. I found a small remnant.
[14:14:21] <wikibugs>	 10serviceops, 10MediaWiki-libs-Rdbms, 10Performance-Team, 10Platform Engineering, and 3 others: Determine and implement multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle)
[14:15:35] <wikibugs>	 10serviceops, 10Citoid, 10SRE, 10Patch-For-Review: Create a readiness probe for zotero - https://phabricator.wikimedia.org/T213689 (10akosiaris) 05Open→03Resolved a:03akosiaris This has been done in https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/774848 and overall seems to work fine (as...
[14:16:55] <wikibugs>	 10serviceops, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10akosiaris) Hi @Ottomata, @JArguello-WMF   /me is back. Any updates on this one (even if just a rough timeline) ? Anything we can help...
[14:17:31] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10akosiaris) @JMeybohm anything left to do here?
[14:22:14] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10SRE, 10Kubernetes: Convert helm releases to the new release naming schema - https://phabricator.wikimedia.org/T277849 (10JMeybohm) >>! In T277849#8178377, @akosiaris wrote: > @JMeybohm anything left to do here?  Yeah. We did not do that during the helm3 migration. Maybe i...
[15:50:05] <rzl>	 jnuche: sorry, just getting online now here in UTC-7 :) I want to make sure _joe_'s concerns are resolved but let me see if I can get things moving
[15:52:10] <rzl>	 jnuche: ah, he's out of office one more day for a personal situation -- if he gets back to you tomorrow, will that work?
[15:53:10] <jnuche>	 rlz: of course, it's not urgent, just been waiting for a while now. Thanks!
[15:56:55] <rzl>	 thanks for checking in!
[16:29:56] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE: Clean up testwiki experiments (Aug 2022) - https://phabricator.wikimedia.org/T314750 (10Krinkle)
[16:31:14] <wikibugs>	 10serviceops, 10SRE, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10RLazarus) That sounds right to me; it would give us the same distribution as codfw, which is probably as much work as we need to do on this. I don't think it's worth investing time...
[16:43:49] <hnowlan>	 While I'm already abusing yizzer collective charity, I have another review beg of significantly greater magnitude https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/823143
[19:38:21] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Liz) I have been told, probably repeatedly, that Special Pages are generated automatically every 3 days but the ones I regularly use https://en.wikipedia.org/wiki...
[19:40:12] <wikibugs>	 10serviceops, 10Diffusion, 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn) 05Open→03Resolved a:03Dzahn @Yaron_Koren https://phabricator.wiki...
[19:40:53] <wikibugs>	 10serviceops, 10Diffusion, 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10Dzahn)
[19:46:08] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) The automatic update is a systemd timer (formerly 'cron job') that is running on the mw maintenance server.  `mediawiki_job_update_special_pages.service` i...
[19:47:08] <mutante>	 is it normal that update_special_pages on all wikis takes more than 3 days?
[19:47:40] <mutante>	 or does it mean the mw maint job crashed somehow if it's running since the 19th
[19:56:57] <RhinosF1>	 mutante: afaik yes
[19:57:03] <RhinosF1>	 It's normal
[19:57:47] <taavi>	 yeah, I think it sometimes has slightly overrun, right now doing wikidata for the aug 19 run
[19:58:06] <taavi>	 there are open patches to make it per-shard so it's more reliable for finishing on time
[19:58:29] <mutante>	 thanks RhinosF1 and taavi
[20:00:15] <mutante>	 I was commented on https://phabricator.wikimedia.org/T307314#8179490 that it's simply still running
[20:00:18] <mutante>	 since the 19th
[20:00:24] <mutante>	 commenting
[20:02:17] <mutante>	 just..it seems to be a problem if it runs every 3 days and takes more than 3 days
[20:03:06] <mutante>	 analyzing the systemd timer tells me it has 1d 9h left to complete
[20:04:54] <mutante>	 "on the 1st of the month and every 3 days after that"
[20:06:30] <taavi>	 that probably means that the next trigger is in 1d9h, not that it will finish then?
[20:06:49] <taavi>	 but >4d is definitely unusual, I haven't seen that before
[20:06:52] <mutante>	 yea, exactly
[20:07:07] <mutante>	 I meant it has 1d and 9h left to be done before it will try to start it again
[20:07:13] <taavi>	 the few times I looked into this before it was just slightly over 3d
[20:07:19] <mutante>	 unless we reduce it to just run every 4 or 6 days
[20:07:46] <mutante>	         # TODO: Instead of flock, make this unit run every N time units after it's finished.
[20:07:49] <mutante>	         command  => 'flock -n /var/lock/update-special-pages /usr/local/bin/foreachwiki updateSpecialPages.php',
[20:08:20] <mutante>	 that TODO is a side-note
[20:10:51] <taavi>	 that'd problematic since people rely on the pages for being (at least somewhat) up-to-date
[20:10:53] <mutante>	 from the status I can tell it's at wikidatawiki and since it's foreachewiki it still has everything after wikidatawiki in https://noc.wikimedia.org/conf/highlight.php?file=dblists/all.dblist
[20:11:36] <mutante>	 gut feeling is that wikidata takes the longest to update
[20:12:02] <mutante>	 we can give it 24h more hours though
[20:12:31] <taavi>	 I wonder if Amir1's db config reloading patches have made it slower
[20:13:07] <mutante>	 I feel like once it finally gets beyond wikidatawiki the rest are small and hopefully that isnt much then
[20:13:13] <Amir1>	 taavi: It's practically noop given that the config is not applied there
[20:13:31] <Amir1>	 and also it shouldn't have any calls to reconfigure() in those special pages
[20:15:27] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Ladsgroup) The underlying problem is that these scripts when getting updated do extremely expensive queries, like full table scans on tables with billions of rows...
[20:17:22] <mutante>	 last thing in status log is: wikidatawiki:  DisambiguationPageLinks        [QueryPage] got 0 rows in 0.01s.  and that was like half an hour ago
[20:18:42] <Amir1>	 mutante: if you can review and get this deployed, it'll help a bit https://gerrit.wikimedia.org/r/804788
[20:19:43] <mutante>	 ok. the log is: root@mwmaint1002:/var/log/mediawiki# tail -f mediawiki_job_update_special_pages/syslog.log
[20:19:46] <mutante>	 looking
[20:20:44] <mutante>	 Amir1: on it
[20:24:53] <Amir1>	 thanks
[20:27:51] <mutante>	 sharded periodic jobs..aah
[20:30:49] <mutante>	 going to deploy the change
[20:31:20] <mutante>	 also existing job is now at wikimania wikis and it's as expected.. once wikidata was over it's now moving fast.. as if it's going to be done in minutes
[20:34:21] <mutante>	 xmfwiki.. yiwiki...yiwikisource.. yowiki...zawiki.. zh_min_nanwiki.
[20:48:05] <mutante>	 waits for it to be finished to _then_ make the change.. zhwiki...
[21:03:53] <wikibugs>	 10serviceops, 10Diffusion, 10Gerrit, 10serviceops-collab, 10Patch-For-Review: Gerrit replication to codfw (gerrit-replica.wikimedia.org) stopped working after Gerrit 3.4.5 upgrade - https://phabricator.wikimedia.org/T315942 (10QChris) Great debugging @Dzahn!  As there was talk about mitigating the MINA u...
[21:09:28] <mutante>	 well, I merged the change that uses shared_periodic_jobs before shared_periodic_jobs existed.. but I will amend / bold / merge.. in a moment
[21:26:31] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown, 10Patch-For-Review: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) @Liz the existing job is super close to being finished.. it's currently at zhwikisource going alphabetically.  Also we are about to switch this to multiple...
[22:14:58] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) deployed first  https://gerrit.wikimedia.org/r/804788 and then https://gerrit.wikimedia.org/r/c/operations/puppet/+/804800 on mwmaint2002/1002 by @Legoktm  from puppet point of...
[22:42:02] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) @Liz and all:  We now have one job per shard ,the command lines are:   ` [mwmaint1002:~] $ for shard in 1 2 3 4 5 6 7 8 11; do grep Exec /lib/systemd/system/mediawiki_job_update...
[22:46:44] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Legoktm) And just to spell it out, enwiki is the only wiki in s1, so it should never miss a run because some other wiki has delayed it.
[22:56:31] <wikibugs>	 10serviceops, 10DBA, 10WMF-General-or-Unknown: Lag in updating Special Pages? - https://phabricator.wikimedia.org/T307314 (10Dzahn) old (flock) processeds killed. now we are just waiting for the new jobs to run. the scheduled time for s1 (enwiki) is:  `Trigger: Thu 2022-08-25 05:00:00 UTC; 1 day 6h left`