[01:50:22] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Quiddity)
[02:25:16] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Patch-For-Review: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Krinkle)
[06:51:14] <wikibugs>	 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) 05Ope...
[07:47:58] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[08:23:54] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar)
[08:23:58] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar) Checking on the server, there is a mixup with recently uploaded documentation owned by...
[08:26:27] <hashar>	 good morning, the doc.wikimedia.org hosts requires a manual recursively chown after a system user had its uid/gid changed to a reserved one.  https://phabricator.wikimedia.org/T333294#8732640
[08:26:53] <hashar>	 namely on `P:doc` hosts a root need to `chown -R doc-uploader:doc-uploader /srv/doc`  , that got forgotten yesterday 
[08:30:46] <claime>	 hashar: on it
[08:32:07] <claime>	 hashar: I agree puppet should not do the chow on its own, but if there's any more drift after that one time fix, we should propbably enforce.
[08:33:11] <hashar>	 claime: that would kill puppet for sure given there are millions of files under that directory :]
[08:33:40] <hashar>	 I think it is really a one time fix, I suspect the uid got reserved/fixed to a known value to work around an issue with our rsync modules 
[08:33:46] <claime>	 ack
[08:33:54] <claime>	 If it doesn't drift again, no problem
[08:34:09] <hashar>	 (namely rsync do not run as root and the modules have use_chroot=yes, so it is unable to do the name > uid mapping)
[08:34:26] <claime>	 If we end up having to do it again, probably a systemd.timer with a chown can be a good idea
[08:34:34] <claime>	 cumin run done
[08:34:39] <hashar>	 so having the uid hard set to be the same on all hosts ensure rsync "works" ;)
[08:34:45] <hashar>	 great
[08:35:06] <hashar>	 hopefully `P:doc` matched all four hosts :]
[08:35:12] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10Clement_Goubert) ` cgoubert@cumin1001:~$ sudo cumin 'P:doc' 'chown -R doc-uploader:doc-uploade...
[08:35:19] <claime>	 It did
[08:37:08] <wikibugs>	 10serviceops, 10Data-Persistence, 10SRE, 10Traffic-Icebox: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10Marostegui)
[08:44:44] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar) 05Open→03Resolved a:03hashar That has fixed the publishing of the MediaWiki docu...
[08:45:04] <hashar>	 claime: fix confirmed. Merci beaucoup!
[08:45:14] <claime>	 hashar: Cool :)
[08:54:02] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate charts away from deprecated typology annotations - https://phabricator.wikimedia.org/T325066 (10JMeybohm) @ayounsi deployed the calico update to all clusters, thanks!
[09:23:47] <wikibugs>	 10serviceops, 10Kubernetes: Set  scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium
[09:27:31] <elukey>	 folks I am going to upgrade another kafka main node to bullseye via dist-upgrade (like yesterday)
[09:28:20] <wikibugs>	 10serviceops, 10Thumbor, 10Kubernetes: Investigate whether configuring hardware P-states would help with performance on k8s - https://phabricator.wikimedia.org/T333317 (10kamila)
[09:31:50] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Joe) Sorry, I'm getting confused; to my understanding,  WDQS...
[09:35:52] <wikibugs>	 10serviceops, 10Maps, 10Wikimedia-Hackathon-2023: Improve tile storage for maps.wikimedia.org - https://phabricator.wikimedia.org/T333318 (10Jgiannelos)
[09:35:56] <wikibugs>	 10serviceops, 10Maps, 10Wikimedia-Hackathon-2023: Improve tile storage for maps.wikimedia.org - https://phabricator.wikimedia.org/T333318 (10jijiki)
[09:37:47] <wikibugs>	 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Detailed list of commands:  ` disable-puppet "elukey - prep for dist-upgrade"  sudo sed -e 's/debian-security buster\/updates/debian-security bullseye-security/g' /etc/apt/sources.list -i sudo sed -e 's/buster/bul...
[09:38:50] <wikibugs>	 10serviceops, 10Maps, 10Wikimedia-Hackathon-2023: Improve tile storage for maps.wikimedia.org - https://phabricator.wikimedia.org/T333318 (10Jgiannelos)
[09:46:08] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10dcausse) >>! In T330507#8732991, @Joe wrote: > Sorry, I'm ge...
[09:51:56] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[09:55:01] <elukey>	 I created a script to automate the dist-upgrade: https://phabricator.wikimedia.org/T332013#8733165
[09:55:08] <elukey>	 will try it with the next node
[09:59:44] <hnowlan>	 akosiaris: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/899654/ does this line up with the stuff we discussed last week as regards increasing thumbor capacity/limits? 
[10:00:17] * akosiaris looking
[10:02:56] <akosiaris>	 wait, I think I just spotted a config error 
[10:07:02] <elukey>	 kafka-main1001 up and running with bullseye (also rebooted in the new kernel)
[10:07:11] <elukey>	 this time the interface renaming didn't bite me :)
[10:07:27] <akosiaris>	 hnowlan: leaving comments on the change, but I had to first submit this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/903614
[10:11:01] <hnowlan>	 akosiaris: wow, oh dear. 
[10:11:06] <hnowlan>	 glad we always pooled eqiad 
[10:14:57] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) @Joe, @dcausse, Thank you for your advice...
[10:23:39] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10dcausse) @ItamarWMDE once https://gerrit.wikimedia.org/...
[10:27:05] <hnowlan>	 akosiaris: just noticed on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902066 you moved the mw-web namespace change to mw-debug, I assume that was a typo
[10:28:44] <akosiaris>	 hnowlan: there's a followup fixing it
[10:28:52] <akosiaris>	 I noticed it late yesterday :-(
[10:29:56] <hnowlan>	 ahh ok
[10:42:10] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @ayounsi thanks for the response.  Overall I've no objection so let's proceed.  I agree in terms of a...
[11:04:08] <wikibugs>	 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar)
[11:28:12] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[11:30:34] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fnegri) I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Runbooks/...
[11:49:40] <elukey>	 any objection if I complete the bullseye upgrade in kafka main eqiad?
[11:49:43] <elukey>	 one node left
[11:49:55] <elukey>	 (maintenance is in one hour, I should be done in less)
[11:55:48] * elukey proceeds
[12:15:01] <_joe_>	 oh noes!
[12:15:09] <_joe_>	 elukey: <3
[12:16:24] <elukey>	 :)
[12:16:47] <elukey>	 way easier with the script, I should be able to do the other 3 nodes by the end of week
[12:25:50] <elukey>	 done!
[12:25:56] <elukey>	 kafka-main eqiad is on bullseye
[12:27:19] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) Oh, I misunderstood, I thought that WDQS updater w...
[12:29:15] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi)
[12:33:37] <wikibugs>	 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Kafka main eqiad moved to bullseye, next steps:  * reimage kafka-main200[1-3] using the dist-upgrade procedure outlined above.
[12:37:30] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10dcausse) >>! In T330507#8734369, @Ottomata wrote: > Oh, I mi...
[12:50:45] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh)
[12:58:29] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switch...
[13:02:39] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[13:06:03] <hnowlan>	 looks like k8s doesn't like the thumbor scale-up even with the lower request in place: "0/22 nodes are available: 16 Insufficient cpu"
[13:18:00] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switch...
[13:20:46] <akosiaris>	 hnowlan: Hmmm, we actually don't have a lower request now that I look at the change again
[13:21:33] <akosiaris>	 let me hotpatch and try something
[13:21:44] <akosiaris>	 thumbor page
[13:22:54] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh)
[13:35:55] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[13:41:09] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto)
[13:44:23] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10herron)
[13:46:11] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto)
[13:49:45] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for 2:00:00 on 249 h...
[13:54:07] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond)
[13:55:47] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon)
[13:56:35] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[13:59:35] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[14:02:02] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond)
[14:23:50] <hnowlan>	 lots of takeaways from this for thumbor-k8s, but main one that sticks out atm is that thumbor-k8s in codfw is currently serving as many qps as thumbor-metal was in codfw before the maintenance 
[14:24:19] <claime>	 hnowlan: that's surprisingly good
[14:25:07] <hnowlan>	 yeah, encouraging 
[14:26:18] <hnowlan>	 however, performance is terribad as expected https://grafana.wikimedia.org/goto/LLmziyf4z?orgId=1 vs https://grafana.wikimedia.org/goto/xssziyB4z?orgId=1 
[14:26:38] <hnowlan>	 but I'll take slow and no errors as a point to iterate from 
[14:26:57] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[14:27:42] <_joe_>	 ugh that's 4 to 5x slower for normal workloads
[14:27:46] <_joe_>	 it's indeed quite bad
[14:32:52] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches...
[14:46:09] <wikibugs>	 10serviceops, 10Kubernetes: Set  scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Clement_Goubert) ` cgoubert@cumin1001:~$ sudo cumin 'P{C:cpufrequtils} and P{P:kubernetes::node} and P{F:is_virtual = false}' "cpufreq-info -p | awk '{print \$3}'" 62 hosts will...
[14:48:03] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches...
[14:50:30] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi)
[14:54:06] <hnowlan>	 so current state of affairs is now that we're back to normal capacity in thumbor - thumbor and thumbor-k8s are serving traffic 50/50 in codfw
[14:54:18] <hnowlan>	 I'd like to try to get to that state in eqiad also 
[14:54:29] <hnowlan>	 not to leave overnight or anything but just to see how we do in eqiad also 
[14:54:37] <hnowlan>	 Scaling up gradually 
[14:55:17] <akosiaris>	 👍
[14:55:39] <claime>	 ack
[14:59:12] <wikibugs>	 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) @Joe we discussed the use of page_content_change i...
[14:59:59] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) The switch upgrade itself went smoothly as well, like the other rows.  One issue was that gerrit1001 was missing from the list. This is bec...
[15:20:48] <hnowlan>	 we're moderately happily doing 50/50 in both datacentres
[15:21:02] <claime>	 hnowlan: how moderately?
[15:21:04] <claime>	 :P
[15:21:15] <hnowlan>	 think sad clown doing happy tricks 
[15:21:30] <hnowlan>	 for my next trick I'd like to briefly see if k8s can handle codfw on its own 
[15:21:40] <hnowlan>	 (I am fairly sure it can) 
[15:22:22] <claime>	 I'm wondering why processing times in k8s are about double in codfw compared to eqiad
[15:22:35] <claime>	 Although that may just be an artifact of when it was pooled, etc.
[15:23:00] <claime>	 wait no I'm reading it wrong
[15:24:06] <claime>	 yeah no, it depends on the type, which probably means too early to tell
[15:24:14] <hnowlan>	 yeah most likely 
[15:24:43] <hnowlan>	 the time distribution of requests for some of the wackier formats will look weird also
[15:37:05] <hnowlan>	 alright, gonna try doing k8s-only in codfw for a little bit and then i'll set all k8s thumbor workers to inactive 
[15:49:38] <hnowlan>	 tbh over time those processing graphs don't look so bad when you consider the averages for things like imagemagick and ghostscript 
[15:49:52] <hnowlan>	 djvu is particularly bad though, gif isn't great either. They're the two most expensive though
[15:50:09] <hnowlan>	 anyway, experiment over, putting things back to metal
[16:35:18] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap: Error: failed to download "wmf-stable/mediawiki" when deploying to MW-on-K8s - https://phabricator.wikimedia.org/T333382 (10jnuche) In case it's useful, list of arguments helmfile was called with: `   1: diff (4 bytes)   2: upgrade (7 bytes)   3: --reset-values (14 bytes)...
[17:10:44] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Scap: Error: failed to download "wmf-stable/mediawiki" when deploying to MW-on-K8s - https://phabricator.wikimedia.org/T333382 (10dancy) The error message means that accessing https://helm-charts.wikimedia.org/stable/.... failed.
[20:53:37] <wikibugs>	 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)
[21:43:10] <wikibugs>	 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) Thank you, @JMeybohm. That's very helpful.  It does seem like the `...