[01:50:22] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Quiddity) [02:25:16] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Patch-For-Review: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Krinkle) [06:51:14] 10serviceops, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) 05Ope... [07:47:58] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:23:54] 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar) [08:23:58] 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar) Checking on the server, there is a mixup with recently uploaded documentation owned by... [08:26:27] good morning, the doc.wikimedia.org hosts requires a manual recursively chown after a system user had its uid/gid changed to a reserved one. https://phabricator.wikimedia.org/T333294#8732640 [08:26:53] namely on `P:doc` hosts a root need to `chown -R doc-uploader:doc-uploader /srv/doc` , that got forgotten yesterday [08:30:46] hashar: on it [08:32:07] hashar: I agree puppet should not do the chow on its own, but if there's any more drift after that one time fix, we should propbably enforce. [08:33:11] claime: that would kill puppet for sure given there are millions of files under that directory :] [08:33:40] I think it is really a one time fix, I suspect the uid got reserved/fixed to a known value to work around an issue with our rsync modules [08:33:46] ack [08:33:54] If it doesn't drift again, no problem [08:34:09] (namely rsync do not run as root and the modules have use_chroot=yes, so it is unable to do the name > uid mapping) [08:34:26] If we end up having to do it again, probably a systemd.timer with a chown can be a good idea [08:34:34] cumin run done [08:34:39] so having the uid hard set to be the same on all hosts ensure rsync "works" ;) [08:34:45] great [08:35:06] hopefully `P:doc` matched all four hosts :] [08:35:12] 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10Clement_Goubert) ` cgoubert@cumin1001:~$ sudo cumin 'P:doc' 'chown -R doc-uploader:doc-uploade... [08:35:19] It did [08:37:08] 10serviceops, 10Data-Persistence, 10SRE, 10Traffic-Icebox: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10Marostegui) [08:44:44] 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar) 05Open→03Resolved a:03hashar That has fixed the publishing of the MediaWiki docu... [08:45:04] claime: fix confirmed. Merci beaucoup! [08:45:14] hashar: Cool :) [08:54:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Migrate charts away from deprecated typology annotations - https://phabricator.wikimedia.org/T325066 (10JMeybohm) @ayounsi deployed the calico update to all clusters, thanks! [09:23:47] 10serviceops, 10Kubernetes: Set scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [09:27:31] folks I am going to upgrade another kafka main node to bullseye via dist-upgrade (like yesterday) [09:28:20] 10serviceops, 10Thumbor, 10Kubernetes: Investigate whether configuring hardware P-states would help with performance on k8s - https://phabricator.wikimedia.org/T333317 (10kamila) [09:31:50] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Joe) Sorry, I'm getting confused; to my understanding, WDQS... [09:35:52] 10serviceops, 10Maps, 10Wikimedia-Hackathon-2023: Improve tile storage for maps.wikimedia.org - https://phabricator.wikimedia.org/T333318 (10Jgiannelos) [09:35:56] 10serviceops, 10Maps, 10Wikimedia-Hackathon-2023: Improve tile storage for maps.wikimedia.org - https://phabricator.wikimedia.org/T333318 (10jijiki) [09:37:47] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Detailed list of commands: ` disable-puppet "elukey - prep for dist-upgrade" sudo sed -e 's/debian-security buster\/updates/debian-security bullseye-security/g' /etc/apt/sources.list -i sudo sed -e 's/buster/bul... [09:38:50] 10serviceops, 10Maps, 10Wikimedia-Hackathon-2023: Improve tile storage for maps.wikimedia.org - https://phabricator.wikimedia.org/T333318 (10Jgiannelos) [09:46:08] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10dcausse) >>! In T330507#8732991, @Joe wrote: > Sorry, I'm ge... [09:51:56] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [09:55:01] I created a script to automate the dist-upgrade: https://phabricator.wikimedia.org/T332013#8733165 [09:55:08] will try it with the next node [09:59:44] akosiaris: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/899654/ does this line up with the stuff we discussed last week as regards increasing thumbor capacity/limits? [10:00:17] * akosiaris looking [10:02:56] wait, I think I just spotted a config error [10:07:02] kafka-main1001 up and running with bullseye (also rebooted in the new kernel) [10:07:11] this time the interface renaming didn't bite me :) [10:07:27] hnowlan: leaving comments on the change, but I had to first submit this https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/903614 [10:11:01] akosiaris: wow, oh dear. [10:11:06] glad we always pooled eqiad [10:14:57] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, 10wdwb-tech: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10ItamarWMDE) @Joe, @dcausse, Thank you for your advice... [10:23:39] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off - https://phabricator.wikimedia.org/T331405 (10dcausse) @ItamarWMDE once https://gerrit.wikimedia.org/... [10:27:05] akosiaris: just noticed on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/902066 you moved the mw-web namespace change to mw-debug, I assume that was a typo [10:28:44] hnowlan: there's a followup fixing it [10:28:52] I noticed it late yesterday :-( [10:29:56] ahh ok [10:42:10] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @ayounsi thanks for the response. Overall I've no objection so let's proceed. I agree in terms of a... [11:04:08] 10serviceops, 10Continuous-Integration-Infrastructure, 10MediaWiki-Documentation, 10Release-Engineering-Team: MediaWiki periodic Doxygen Jenkins job fails to publish - https://phabricator.wikimedia.org/T333294 (10hashar) [11:28:12] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [11:30:34] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fnegri) I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Runbooks/... [11:49:40] any objection if I complete the bullseye upgrade in kafka main eqiad? [11:49:43] one node left [11:49:55] (maintenance is in one hour, I should be done in less) [11:55:48] * elukey proceeds [12:15:01] <_joe_> oh noes! [12:15:09] <_joe_> elukey: <3 [12:16:24] :) [12:16:47] way easier with the script, I should be able to do the other 3 nodes by the end of week [12:25:50] done! [12:25:56] kafka-main eqiad is on bullseye [12:27:19] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) Oh, I misunderstood, I thought that WDQS updater w... [12:29:15] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [12:33:37] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Kafka main eqiad moved to bullseye, next steps: * reimage kafka-main200[1-3] using the dist-upgrade procedure outlined above. [12:37:30] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10dcausse) >>! In T330507#8734369, @Ottomata wrote: > Oh, I mi... [12:50:45] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [12:58:29] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switch... [13:02:39] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:06:03] looks like k8s doesn't like the thumbor scale-up even with the lower request in place: "0/22 nodes are available: 16 Insufficient cpu" [13:18:00] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switch... [13:20:46] hnowlan: Hmmm, we actually don't have a lower request now that I look at the change again [13:21:33] let me hotpatch and try something [13:21:44] thumbor page [13:22:54] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [13:35:55] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:41:09] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [13:44:23] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10herron) [13:46:11] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [13:49:45] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for 2:00:00 on 249 h... [13:54:07] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond) [13:55:47] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon) [13:56:35] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:59:35] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [14:02:02] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond) [14:23:50] lots of takeaways from this for thumbor-k8s, but main one that sticks out atm is that thumbor-k8s in codfw is currently serving as many qps as thumbor-metal was in codfw before the maintenance [14:24:19] hnowlan: that's surprisingly good [14:25:07] yeah, encouraging [14:26:18] however, performance is terribad as expected https://grafana.wikimedia.org/goto/LLmziyf4z?orgId=1 vs https://grafana.wikimedia.org/goto/xssziyB4z?orgId=1 [14:26:38] but I'll take slow and no errors as a point to iterate from [14:26:57] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [14:27:42] <_joe_> ugh that's 4 to 5x slower for normal workloads [14:27:46] <_joe_> it's indeed quite bad [14:32:52] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches... [14:46:09] 10serviceops, 10Kubernetes: Set scaling_governor to performance for wikikube workers - https://phabricator.wikimedia.org/T332788 (10Clement_Goubert) ` cgoubert@cumin1001:~$ sudo cumin 'P{C:cpufrequtils} and P{P:kubernetes::node} and P{F:is_virtual = false}' "cpufreq-info -p | awk '{print \$3}'" 62 hosts will... [14:48:03] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches... [14:50:30] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [14:54:06] so current state of affairs is now that we're back to normal capacity in thumbor - thumbor and thumbor-k8s are serving traffic 50/50 in codfw [14:54:18] I'd like to try to get to that state in eqiad also [14:54:29] not to leave overnight or anything but just to see how we do in eqiad also [14:54:37] Scaling up gradually [14:55:17] 👍 [14:55:39] ack [14:59:12] 10serviceops, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10), 10Patch-For-Review, 10Service-deployment-requests: New Service Request mediawiki-page-content-change-enrichment - https://phabricator.wikimedia.org/T330507 (10Ottomata) @Joe we discussed the use of page_content_change i... [14:59:59] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) The switch upgrade itself went smoothly as well, like the other rows. One issue was that gerrit1001 was missing from the list. This is bec... [15:20:48] we're moderately happily doing 50/50 in both datacentres [15:21:02] hnowlan: how moderately? [15:21:04] :P [15:21:15] think sad clown doing happy tricks [15:21:30] for my next trick I'd like to briefly see if k8s can handle codfw on its own [15:21:40] (I am fairly sure it can) [15:22:22] I'm wondering why processing times in k8s are about double in codfw compared to eqiad [15:22:35] Although that may just be an artifact of when it was pooled, etc. [15:23:00] wait no I'm reading it wrong [15:24:06] yeah no, it depends on the type, which probably means too early to tell [15:24:14] yeah most likely [15:24:43] the time distribution of requests for some of the wackier formats will look weird also [15:37:05] alright, gonna try doing k8s-only in codfw for a little bit and then i'll set all k8s thumbor workers to inactive [15:49:38] tbh over time those processing graphs don't look so bad when you consider the averages for things like imagemagick and ghostscript [15:49:52] djvu is particularly bad though, gif isn't great either. They're the two most expensive though [15:50:09] anyway, experiment over, putting things back to metal [16:35:18] 10serviceops, 10MW-on-K8s, 10Scap: Error: failed to download "wmf-stable/mediawiki" when deploying to MW-on-K8s - https://phabricator.wikimedia.org/T333382 (10jnuche) In case it's useful, list of arguments helmfile was called with: ` 1: diff (4 bytes) 2: upgrade (7 bytes) 3: --reset-values (14 bytes)... [17:10:44] 10serviceops, 10MW-on-K8s, 10Scap: Error: failed to download "wmf-stable/mediawiki" when deploying to MW-on-K8s - https://phabricator.wikimedia.org/T333382 (10dancy) The error message means that accessing https://helm-charts.wikimedia.org/stable/.... failed. [20:53:37] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [21:43:10] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) Thank you, @JMeybohm. That's very helpful. It does seem like the `...