[01:10:07] 06serviceops, 06collaboration-services, 10MW-on-K8s, 06SRE: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11462618 (10Dzahn) 05Open→03Resolved file transfers to and between releases servers are now encrypted [10:22:56] 06serviceops, 10Wikifunctions, 06Abstract Wikipedia team (26Q2 (Oct–Dec)), 07Essential-Work: WF memcached service is dc-local but used for dc-global content - https://phabricator.wikimedia.org/T411807#11463249 (10akosiaris) p:05Unbreak!→03High Lowering to high while the analysis and recommendation is b... [12:04:01] 06serviceops, 13Patch-For-Review: wikikube-ctrl200[4-5] implementation tracking - https://phabricator.wikimedia.org/T390861#11463543 (10JMeybohm) Two questions/suggestions in this regard: * I see that we also have wikikube-ctrl2006 racked (T406596), would it make sense to do all three at once? * Given we moved... [12:09:46] 06serviceops, 10Page Content Service: Production error: worker died, restarting - https://phabricator.wikimedia.org/T394659#11463595 (10MLechvien-WMF) Hi @Jgiannelos do you have any recent example of this happening? Can we mark this as closed if not? [12:10:10] 06serviceops, 10Page Content Service: Production error: worker died, restarting - https://phabricator.wikimedia.org/T394659#11463597 (10MLechvien-WMF) a:03Jgiannelos [12:21:14] 06serviceops, 10MW-on-K8s, 07Datacenter-Switchover, 13Patch-For-Review: Update DC switchover cookbooks to handle maintenance scripts on k8s - https://phabricator.wikimedia.org/T359130#11463632 (10Clement_Goubert) @jasmine_ I don't think you ran into any actual issues regarding maintenance jobs during the l... [12:25:15] 06serviceops, 06MediaWiki-Platform-Team, 06SRE Observability: Improve MediaWiki periodic job alerts - https://phabricator.wikimedia.org/T412799 (10Clement_Goubert) 03NEW [12:38:58] 06serviceops, 06Release-Engineering-Team, 06SRE Observability: Proof of Concept: Train Health Dashboard - https://phabricator.wikimedia.org/T412801 (10jijiki) 03NEW [12:55:46] 06serviceops, 10Observability-Logging, 13Patch-For-Review: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616#11463869 (10JMeybohm) 05Open→03Resolved Closing again because it seems to work fine mostly and we can't reproduce failures [12:58:09] 06serviceops, 06MediaWiki-Platform-Team, 06SRE Observability: Improve MediaWiki periodic job alerts - https://phabricator.wikimedia.org/T412799#11463879 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [13:01:30] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11463897 (10JMeybohm) 05Open→03Resolved a:03JMeybohm With {T352245} resolved, this has now been completed. [13:02:52] 06serviceops, 06MediaWiki-Platform-Team, 06SRE Observability, 13Patch-For-Review: Improve MediaWiki periodic job alerts - https://phabricator.wikimedia.org/T412799#11463908 (10Clement_Goubert) Patch pushed that implements a better logstash link, as well as a command-line that dumps all logs for the alertin... [13:07:05] 06serviceops, 10MinT, 10Prod-Kubernetes, 06SRE, and 3 others: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058#11463940 (10Nikerabbit) See also {T386371} which mentions that one pod uses more memory than others. [13:14:23] 06serviceops, 06MediaWiki-Platform-Team, 06SRE Observability: MediaWiki periodic job startupregistrystats-mediawikiwiki failed - https://phabricator.wikimedia.org/T410764#11464036 (10Clement_Goubert) We should now keep the last failure for up to a week, subtask was created for alert improvements and waiting... [13:15:02] 06serviceops, 06SRE Observability, 06MediaWiki-Platform-Team (Kanban Board), 13Patch-For-Review: Improve MediaWiki periodic job alerts - https://phabricator.wikimedia.org/T412799#11464038 (10DAlangi_WMF) [13:23:06] 06serviceops, 06Infrastructure-Foundations, 13Patch-For-Review: Improve release process of Spicerack and service catalog - https://phabricator.wikimedia.org/T412700#11464076 (10MLechvien-WMF) Thanks for the response and the proposed patch! > It would be still nice to have a check in the Puppet repo to remi... [13:23:29] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11464079 (10MoritzMuehlenhoff) 05Resolved→03Open The various certs still need to be cleaned out, reopening [13:40:08] 06serviceops, 06Commons, 10MediaWiki-File-management, 07Wikimedia-production-error: MediaWiki periodic job cleanup-upload-stash failed - https://phabricator.wikimedia.org/T412325#11464178 (10Clement_Goubert) Logs in [[ https://logstash.wikimedia.org/goto/f12049c5bbfe9cbdf318d334c1f6a08d | logstash ]] indic... [13:47:04] 06serviceops, 10MW-on-K8s: Add configuration for MESH_CHECK_SKIP in periodic job puppet definition - https://phabricator.wikimedia.org/T412818 (10Clement_Goubert) 03NEW [14:00:23] 06serviceops, 06Commons, 10MediaWiki-File-management, 07Wikimedia-production-error: MediaWiki periodic job cleanup-upload-stash failed - https://phabricator.wikimedia.org/T412325#11464294 (10A_smart_kitten) FWIW that seems //similar// on its face to {T346971}, but with a slightly different message. (Though... [14:02:00] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11464299 (10JMeybohm) a:05JMeybohm→03MoritzMuehlenhoff Thanks for volunteering to remove the remaining certs and cergen config during your January cleanup [14:22:10] 06serviceops: Build envoy-build-tools image locally - https://phabricator.wikimedia.org/T265357#11464395 (10JMeybohm) 05Open→03Declined Since we package envoy binaries now, this is no longer required. [14:53:00] 06serviceops, 13Patch-For-Review: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251#11464641 (10akosiaris) >>! In T390251#11436436, @elukey wrote: > @akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could b... [15:06:36] 06serviceops, 06Commons, 10MediaWiki-File-management, 07Wikimedia-production-error: MediaWiki periodic job cleanup-upload-stash failed - https://phabricator.wikimedia.org/T412325#11464750 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Yup, that's the same error. It's usually transient,... [16:41:33] 06serviceops: Document where the docker images used by WMF are located - https://phabricator.wikimedia.org/T412787#11465158 (10Pppery) [20:24:03] anyone here knows about: "external-services, hive-analytics" and "external-services, mariadb-external-storage-codfw, Endpoints (v1)? [20:24:25] I am seeing merged but undeployed changes for these. [20:24:45] which made me say NO to a diff for something unrelated [20:42:10] mutante: external-services diffs are almost always safe to deploy [20:51:00] cdanis: hmm, thanks! but it's also removing an IP address from mariadb-external-storage-codfw .. hrmmm [20:51:13] probably a host that got turned down [20:51:28] external-services is something like the k8s version of ferm rules, and they only get reconciled during deploys [20:52:13] one IP is sretest2003, another IP does not exist in DNS [20:53:17] my own chance is not urgent or big at all. does not matter to me that it's quick [22:33:43] 06serviceops, 10MW-on-K8s, 06SRE: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11466660 (10jsn.sherman) This happened again in the UTC late backport window: https://sal.toolforge.org/log/5jwwKZsBvg159pQrFeSI https://spiderpig.wikimedia.org/j...