[00:00:05] brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211112T0000). [00:00:05] tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:04:45] o/ [00:05:50] PROBLEM - Check systemd state on graphite1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:02] I'll do the deploy [00:10:32] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments image recommendations on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738284 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [00:11:17] (03Merged) 10jenkins-bot: Enable GrowthExperiments image recommendations on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738284 (https://phabricator.wikimedia.org/T294878) (owner: 10Gergő Tisza) [00:12:01] Grow experiments image: https://commons.wikimedia.org/wiki/File:ISS-06_Sprouts_on_the_Russian_plant_growth_experiment.jpg [00:12:04] Growth* [00:14:56] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:738284|Enable GrowthExperiments image recommendations on eswiki (T294878)]] (duration: 00m 56s) [00:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:00] T294878: Add Image: Enable on pilot wikis in dark mode - https://phabricator.wikimedia.org/T294878 [00:15:05] !log UTC late deploys done [00:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:18] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:30:20] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:49:12] (03CR) 10Huji: [C: 03+1] Change votewiki language back to English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738222 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [02:00:48] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 8024.32 ms [02:07:00] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 235.09 ms [02:21:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:25:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:26:27] (KubernetesCalicoDown) firing: ml-serve1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [02:29:54] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 90, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:31:26] (KubernetesCalicoDown) resolved: ml-serve1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [02:54:04] (03PS1) 10Seddon: Enable changes to mediasearch tab order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738293 (https://phabricator.wikimedia.org/T284208) [02:54:49] (03CR) 10jerkins-bot: [V: 04-1] Enable changes to mediasearch tab order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738293 (https://phabricator.wikimedia.org/T284208) (owner: 10Seddon) [04:45:48] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [04:51:56] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 3.25 ms [05:51:10] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 8.276e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:00:02] (03PS1) 10Marostegui: Revert "db1104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/738233 [06:26:00] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 32, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:40] PROBLEM - Host mr1-ulsfo.oob is DOWN: CRITICAL - Network Unreachable (198.24.47.102) [06:26:42] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:30:16] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 33, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:30] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 74.15 ms [06:45:32] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 76.90 ms [06:46:02] (03PS1) 10Urbanecm: uzwiki: Enable VisualEditor by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738296 (https://phabricator.wikimedia.org/T294245) [06:46:04] (03PS1) 10Urbanecm: uzwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738297 (https://phabricator.wikimedia.org/T294245) [06:47:32] (03CR) 10Elukey: [V: 03+1] "To keep archives happy - after a chat with John we agreed that some follow up is needed, the last use case to consider is when puppet trie" [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [06:52:01] mmm that BGP alert for ml-serve is weird [07:00:42] (03CR) 10Marostegui: [C: 03+2] Revert "db1104: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/738233 (owner: 10Marostegui) [07:01:17] it seems that a pod died and got re-created [07:01:34] calico-node-v7p9t, that was running on ml-serve1001 [07:01:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 5%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17728 and previous config saved to /var/cache/conftool/dbconfig/20211112-070141-root.json [07:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add weight for db1104', diff saved to https://phabricator.wikimedia.org/P17729 and previous config saved to /var/cache/conftool/dbconfig/20211112-070236-marostegui.json [07:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:18] Nov 12 02:17:15 ml-serve1001 kubelet[31935]: I1112 02:17:15.207931 31935 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim ephemeral-storage [07:13:21] :( :( [07:17:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 10%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17730 and previous config saved to /var/cache/conftool/dbconfig/20211112-071752-root.json [07:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:29:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:30:26] (KubernetesCalicoDown) firing: ml-serve1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [07:32:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 20%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17731 and previous config saved to /var/cache/conftool/dbconfig/20211112-073255-root.json [07:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:12] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 90, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:35:17] ahhh ok there is a pod that spams in the logs [07:35:26] (KubernetesCalicoDown) resolved: ml-serve1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [07:47:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 25%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17732 and previous config saved to /var/cache/conftool/dbconfig/20211112-074759-root.json [07:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:39] (03CR) 10Muehlenhoff: Add ownership annotations for additional Data Persistence services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211112T0800) [08:01:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 40%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17733 and previous config saved to /var/cache/conftool/dbconfig/20211112-080302-root.json [08:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:36] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [08:03:37] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2021-12-12 08:02:36 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [08:04:14] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [08:06:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 50%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17734 and previous config saved to /var/cache/conftool/dbconfig/20211112-081806-root.json [08:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:31] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 1.501e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:27:59] !log imported openjdk-8 8u312-b07-1~deb11u1 to component/jdk8 for bullseye-wikimedia (rebuild of latest Java 8 security release for Bullseye) [08:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 75%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17735 and previous config saved to /var/cache/conftool/dbconfig/20211112-083310-root.json [08:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:03] (03CR) 10Hashar: cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) (owner: 10Dzahn) [08:48:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 100%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17736 and previous config saved to /var/cache/conftool/dbconfig/20211112-084813-root.json [08:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:14] (03CR) 10Hashar: "The report only shows that there are barely any hit so we can deploy to both the primary and replica." [puppet] - 10https://gerrit.wikimedia.org/r/737968 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [09:46:18] (03CR) 10Hashar: zuul: use releng list rather than jenkins-bot for email (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [09:49:08] (03CR) 10Jbond: [C: 03+1] scripts: clean temporary code from PuppetDB import [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738274 (owner: 10Volans) [09:55:59] (03PS1) 10David Caro: worker: drop support for .config directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738352 (https://phabricator.wikimedia.org/T294541) [09:56:32] (03PS10) 10Ema: varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) [10:02:45] (03PS11) 10Ema: varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) [10:04:40] (03PS1) 10Jcrespo: Revert "install_server: manually setup db1139 and db2100" [puppet] - 10https://gerrit.wikimedia.org/r/738236 [10:05:02] (03Abandoned) 10Jcrespo: Revert "install_server: manually setup db1139 and db2100" [puppet] - 10https://gerrit.wikimedia.org/r/738236 (owner: 10Jcrespo) [10:05:08] (03CR) 10Hashar: "Great thank you for the double check!" [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [10:07:54] (03CR) 10Vgutierrez: [C: 03+1] varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:08:11] (03PS1) 10Jcrespo: install_server: Wipe db1139, reimage keeping data db2100 [puppet] - 10https://gerrit.wikimedia.org/r/738354 (https://phabricator.wikimedia.org/T280979) [10:08:33] (03PS1) 10Elukey: kserve-inference: improve network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738355 (https://phabricator.wikimedia.org/T289834) [10:08:56] (03CR) 10Jcrespo: [C: 03+2] install_server: Wipe db1139, reimage keeping data db2100 [puppet] - 10https://gerrit.wikimedia.org/r/738354 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [10:09:39] (03PS12) 10Ema: varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) [10:12:04] (03CR) 10Vgutierrez: [C: 03+1] varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:13:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: cloud: eqiad1: relocate hiera key from common/ to eqiad/ [puppet] - 10https://gerrit.wikimedia.org/r/738271 (owner: 10Arturo Borrero Gonzalez) [10:17:47] !log A:cp disable-puppet to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/737424 on cp4027 T293879 [10:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:51] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [10:18:30] (03CR) 10Ema: [C: 03+2] varnish: add varnish::logging::mtail [puppet] - 10https://gerrit.wikimedia.org/r/737424 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:18:32] (03PS4) 10Arturo Borrero Gonzalez: ceph::control: enable auth deploy on eqiad and remove unused vars [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [10:21:57] (03PS5) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [10:22:48] (03CR) 10JMeybohm: admin_ng: Create Certificates for ingressgateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [10:25:17] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host db1139.eqiad.wmnet with OS buster [10:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:53] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Marostegui) [10:26:05] (03CR) 10Jhernandez: [C: 04-1] "https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/737503/comment/29a0b821_5e21bdfb/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:26:27] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Lower automatic query killing threshold to 55 seconds - https://phabricator.wikimedia.org/T293533 (10Marostegui) 05Open→03Declined I am going to close this as declined as I don't see much interest on doing this for now. If some... [10:27:22] (03PS7) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:30:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nitpick, but otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [10:35:53] !log A:cp re-enable puppet after successful testing of https://gerrit.wikimedia.org/r/c/operations/puppet/+/737424 on cp4027 T293879 [10:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:57] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [10:36:39] (03PS6) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [10:37:04] (03CR) 10JMeybohm: admin_ng: Create Certificates for ingressgateway (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [10:39:41] (03CR) 10Elukey: [C: 03+2] kserve-inference: improve network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738355 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [10:39:57] (03PS1) 10Vgutierrez: haproxy: Allow disabling check_haproxy based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/738356 (https://phabricator.wikimedia.org/T290005) [10:41:10] (03PS8) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:41:55] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1139.eqiad.wmnet with OS buster [10:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [10:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:15] (03PS1) 10Vgutierrez: cache:haproxy: Disable check_haproxy based checks [puppet] - 10https://gerrit.wikimedia.org/r/738357 (https://phabricator.wikimedia.org/T290005) [10:42:22] (03CR) 10Jhernandez: "Patchset 7 removes the copypasted doc comments and fixes the answers keys to be the ones included in WikimediaMessages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:43:12] (03CR) 10Jhernandez: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:43:45] (03CR) 10Jhernandez: [C: 03+1] "I've tested the config locally, changing the page ids to a couple of my local pages and everything works fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:43:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32384/console" [puppet] - 10https://gerrit.wikimedia.org/r/738356 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:45:03] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32385/console" [puppet] - 10https://gerrit.wikimedia.org/r/738357 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:45:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:01] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host db2100.codfw.wmnet with OS buster [10:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [10:50:12] (03CR) 10Jhernandez: [C: 03+1] "Once deployed, after the frontend assets are refreshed -which may take a few minutes- the surveys should show up here:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [10:53:12] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update copyright [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/737203 (owner: 10JMeybohm) [10:53:19] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add simple-cfssl image for development and e2e tests [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736809 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [10:53:31] (03PS1) 10Jcrespo: install_server: Revert db1139/db2100 to the regular install recipe [puppet] - 10https://gerrit.wikimedia.org/r/738358 (https://phabricator.wikimedia.org/T280979) [10:53:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Implement CFSSL API signer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/736808 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [10:58:48] (03PS1) 10Elukey: knative-serving: improve network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738360 (https://phabricator.wikimedia.org/T289834) [11:05:37] !log jynus@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2100.codfw.wmnet with OS buster [11:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:09] (03PS1) 10JMeybohm: Fix README link to sample-external-issuer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/738364 [11:06:29] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix README link to sample-external-issuer [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/738364 (owner: 10JMeybohm) [11:08:50] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add cfssl-issuer docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/737329 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:10:47] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Cookbook to manage 2FA state for a user - https://phabricator.wikimedia.org/T295579 (10MoritzMuehlenhoff) [11:12:57] (03CR) 10Arturo Borrero Gonzalez: P:openstack::base::cloudgw: drop unneeded profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737774 (owner: 10Jbond) [11:13:02] 10SRE, 10SRE Observability (FY2021/2022-Q2): DX App Synthetic Monitoring App - watchmouse alert flapping due to CA expiration - https://phabricator.wikimedia.org/T292603 (10Volans) p:05Medium→03High The problem I see with the prolonging of this issue is that all recipients of those emails are most likely b... [11:13:19] (03PS1) 10David Caro: CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) [11:13:49] (03PS1) 10JMeybohm: Vendor dependencies [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/738367 [11:14:02] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Vendor dependencies [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/738367 (owner: 10JMeybohm) [11:15:13] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32387/console" [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [11:16:02] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32386/console" [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:16:20] (03PS1) 10JMeybohm: cfssl-issuer: Fix path typo in dockerfile [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/738368 (https://phabricator.wikimedia.org/T294560) [11:16:37] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Fix path typo in dockerfile [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/738368 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:18:38] (03CR) 10Jcrespo: [C: 03+2] install_server: Revert db1139/db2100 to the regular install recipe [puppet] - 10https://gerrit.wikimedia.org/r/738358 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [11:19:13] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738352 (https://phabricator.wikimedia.org/T294541) (owner: 10David Caro) [11:19:48] (03PS7) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [11:20:39] (03CR) 10jerkins-bot: [V: 04-1] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [11:22:05] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications for db1116, db1139, db2097, db2100 [puppet] - 10https://gerrit.wikimedia.org/r/737963 (owner: 10Jcrespo) [11:28:37] (03PS8) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [11:34:16] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32388/console" [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:34:37] (03CR) 10Jelto: [C: 03+1] admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [11:35:04] (03CR) 10David Caro: [V: 03+1 C: 03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:49:11] (03PS9) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [11:49:45] (03PS1) 10Hashar: zuul: send errors from git-daemon to client [puppet] - 10https://gerrit.wikimedia.org/r/738370 (https://phabricator.wikimedia.org/T187897) [11:56:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "merging together after lunch!" [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [12:02:27] 10SRE, 10vm-requests: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) [12:03:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/738356 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:04:17] (03CR) 10Jcrespo: "After some interactions at T295312, I believe they call a database replica "backup". This is confusing, and not only not WMF SRE-speak, it" [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [12:04:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/738357 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:05:34] (03CR) 10Jcrespo: Add ownership annotations for additional Data Persistence services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [12:07:06] 10SRE, 10vm-requests: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) [12:09:26] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy: Allow disabling check_haproxy based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/738356 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:10:10] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache:haproxy: Disable check_haproxy based checks [puppet] - 10https://gerrit.wikimedia.org/r/738357 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:10:51] 10SRE, 10vm-requests: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) [12:14:16] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:14:40] 10SRE, 10vm-requests: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) [12:23:47] (03PS1) 10Ayounsi: [WIP] Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) [12:24:22] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [12:40:00] (03PS6) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [12:41:22] 10SRE, 10vm-requests: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) [12:50:03] (03PS1) 10Arturo Borrero Gonzalez: cloud: introduce role for cloudbackup-dev [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584) [12:50:09] (03CR) 10Btullis: [C: 03+1] "I'm fine with this. 👍" [cookbooks] - 10https://gerrit.wikimedia.org/r/737706 (owner: 10Elukey) [12:50:54] (03CR) 10jerkins-bot: [V: 04-1] cloud: introduce role for cloudbackup-dev [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [13:17:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph::control: enable auth deploy on eqiad and remove unused vars [puppet] - 10https://gerrit.wikimedia.org/r/737936 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [13:21:34] (03CR) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:30:57] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, 10observability: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) [13:32:45] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, 10observability: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) @jbond do you know, for example, of any recent change applied to pki::get_cert... [13:51:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/738370 (https://phabricator.wikimedia.org/T187897) (owner: 10Hashar) [14:04:14] (03PS1) 10Hashar: alertmanager: send releng alerts to both irc and mail [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) [14:06:43] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:11:38] (03PS1) 10Arturo Borrero Gonzalez: cloud: eqiad1: deploy cinder ceph auth creds [puppet] - 10https://gerrit.wikimedia.org/r/738383 (https://phabricator.wikimedia.org/T293752) [14:21:09] 10SRE, 10Gerrit, 10observability, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10hashar) a:05hashar→03None [14:22:26] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10Ottomata) https://wikitech.wikimedia.org/wiki/PKI/Cloud ? maybe we can make it work? [14:24:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Ottomata) Yes, approved. [14:25:18] (03PS7) 10Btullis: Add the first eventgate alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) [14:30:04] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) >>! In T291905#7500211, @Ottomata wrote: > https://wikitech.wikimedia.org/wiki/PKI/Cloud ? maybe we ca... [14:36:54] (03CR) 10Btullis: [C: 03+1] Declare airflow-dags scap for analytics instance (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [14:37:57] (03CR) 10Btullis: [C: 03+1] cassandra: add stub values for new credentials format [labs/private] - 10https://gerrit.wikimedia.org/r/738272 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [14:38:35] !log installing 5.10.70 kernels on bullseye systems (just the update, no coordinated reboot) [14:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:33] (03CR) 10Btullis: "Looks good. I'll wait until the labs/private change has been merged and pcc has been re-run." [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [14:48:03] (03CR) 10Btullis: Add the first eventgate alert to Alertmanager (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:58:26] (03CR) 10Btullis: Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:59:07] (03PS1) 10David Caro: openstakc:codfw1: remove cinder keyring [puppet] - 10https://gerrit.wikimedia.org/r/738411 (https://phabricator.wikimedia.org/T293752) [14:59:09] (03PS1) 10David Caro: openstack::codfw: enable cinder key generation [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) [14:59:11] (03PS1) 10David Caro: openstack::eqiad: Remove cinder key generation from cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) [14:59:13] (03PS1) 10David Caro: openstack::eqiad: enable cinder keyring generation on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) [15:05:04] (03CR) 10Ottomata: [C: 03+1] Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:08:34] (03CR) 10Btullis: [C: 03+2] Add checks for druid datasources to alertmanager (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/736279 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:10:10] (03CR) 10Btullis: [C: 03+2] Remove prometheus based Druid checks [puppet] - 10https://gerrit.wikimedia.org/r/736280 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [15:10:12] (03PS10) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [15:11:33] (03CR) 10jerkins-bot: [V: 04-1] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [15:11:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [15:12:31] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [15:14:35] (03PS3) 10Btullis: Add more alerts to the data-engineering team [alerts] - 10https://gerrit.wikimedia.org/r/735669 (https://phabricator.wikimedia.org/T293399) [15:17:18] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Jelto) a:03Jelto Most deployment charts where updated in [736227](https://gerrit.wikimedia.org/r/736227). The services were re-deployed with com... [15:17:32] (03PS11) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [15:19:59] (03CR) 10jerkins-bot: [V: 04-1] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [15:20:01] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32389/console" [puppet] - 10https://gerrit.wikimedia.org/r/738193 (owner: 10Giuseppe Lavagetto) [15:24:14] (03PS12) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [15:24:56] (03CR) 10jerkins-bot: [V: 04-1] Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [15:26:12] (03CR) 10David Caro: [C: 03+2] worker: drop support for .config directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738352 (https://phabricator.wikimedia.org/T294541) (owner: 10David Caro) [15:26:44] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Ottomata) Hiya! Yes, I would love to upgrade. I tried to do this once but got confused and filed {T291848}. Joe answered many of my questions,... [15:27:02] (03Merged) 10jenkins-bot: worker: drop support for .config directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738352 (https://phabricator.wikimedia.org/T294541) (owner: 10David Caro) [15:27:57] (03CR) 10Ottomata: [C: 03+2] pontoon - Don't use http proxy for apt [puppet] - 10https://gerrit.wikimedia.org/r/735654 (owner: 10Ottomata) [15:28:01] (03PS2) 10Ottomata: pontoon - Don't use http proxy for apt [puppet] - 10https://gerrit.wikimedia.org/r/735654 [15:28:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] pontoon - Don't use http proxy for apt [puppet] - 10https://gerrit.wikimedia.org/r/735654 (owner: 10Ottomata) [15:28:52] (03PS2) 10Ottomata: Include base::puppet in profile::puppetmaster::pontoon [puppet] - 10https://gerrit.wikimedia.org/r/735650 [15:29:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Include base::puppet in profile::puppetmaster::pontoon [puppet] - 10https://gerrit.wikimedia.org/r/735650 (owner: 10Ottomata) [15:29:18] (03PS1) 10Muehlenhoff: Add ownership annotations for more o11y services [puppet] - 10https://gerrit.wikimedia.org/r/738416 (https://phabricator.wikimedia.org/T216088) [15:30:44] (03CR) 10Ottomata: [V: 03+1] Declare airflow-dags scap for analytics instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:31:13] (03PS23) 10Ottomata: Declare airflow-dags scap for analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:33:15] (03PS24) 10Ottomata: Declare airflow-dags scap for analytics-test instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:35:27] (03PS25) 10Ottomata: Declare airflow-dags scap for analytics-test instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:35:33] (03PS15) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [15:35:35] (03PS8) 10Giuseppe Lavagetto: deployment-prep: install php 7.4 on a mw appserver [puppet] - 10https://gerrit.wikimedia.org/r/738194 [15:36:07] (03PS26) 10Ottomata: Declare airflow-dags scap for analytics-test instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) [15:40:33] (03CR) 10Btullis: [C: 03+1] Declare airflow-dags scap for analytics-test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:41:56] (03CR) 10Ottomata: [C: 03+2] Declare airflow-dags scap for analytics-test instance [puppet] - 10https://gerrit.wikimedia.org/r/737770 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:42:48] (03PS13) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [15:45:54] (03PS7) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [15:47:01] (03PS2) 10David Caro: CI: add style checks and formatting script [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) [15:47:03] (03PS1) 10David Caro: controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) [15:47:58] (03CR) 10jerkins-bot: [V: 04-1] controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) (owner: 10David Caro) [15:48:03] PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [15:49:34] this is me ^ [15:49:49] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:25] (03PS1) 10Vgutierrez: cache:haproxy: Enable request logging to a ring buffer [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [15:53:30] (03PS2) 10BBlack: lvs recnds: remove last remaining revdns comments [dns] - 10https://gerrit.wikimedia.org/r/556230 (https://phabricator.wikimedia.org/T239993) [15:53:51] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:02] (03PS14) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [15:54:14] (03CR) 10Jbond: "ready for review" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [15:54:22] (03PS3) 10BBlack: lvs recnds: remove last remaining revdns comments [dns] - 10https://gerrit.wikimedia.org/r/556230 (https://phabricator.wikimedia.org/T239993) [15:56:19] (03CR) 10BBlack: [C: 03+2] lvs recnds: remove last remaining revdns comments [dns] - 10https://gerrit.wikimedia.org/r/556230 (https://phabricator.wikimedia.org/T239993) (owner: 10BBlack) [16:00:11] RECOVERY - Keyholder SSH agent on deploy1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [16:00:11] (03PS1) 10Muehlenhoff: Add ownership annotations for more Service SRE services [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) [16:05:03] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) >>! In T239993#7467564, @ayounsi wrote: > @BBlack > Looking at router config I found: > ` > /* Temporary for T239993 */ > route 208.80.153.254/32 { > next-hop 208.80.153.111; > readve... [16:06:28] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] noc: move template used only by NOC to it [puppet] - 10https://gerrit.wikimedia.org/r/738193 (owner: 10Giuseppe Lavagetto) [16:11:16] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [16:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:27] (03PS1) 10Muehlenhoff: Add ownership annotations for more Search Platform services [puppet] - 10https://gerrit.wikimedia.org/r/738432 (https://phabricator.wikimedia.org/T216088) [16:15:03] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:59] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:46] (03CR) 10Ebernhardson: [C: 03+1] Add ownership annotations for more Search Platform services [puppet] - 10https://gerrit.wikimedia.org/r/738432 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [16:26:33] (03PS2) 10Elukey: knative-serving: improve network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738360 (https://phabricator.wikimedia.org/T289834) [16:34:41] (03CR) 10Cwhite: [C: 03+2] Add ownership annotations for more o11y services [puppet] - 10https://gerrit.wikimedia.org/r/738416 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [16:35:44] (03CR) 10Cwhite: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/738416 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [16:36:54] !log otto@deploy1002 Started deploy [airflow-dags/analytics@093f067] (hadoop-test): (no justification provided) [16:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:06] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@093f067] (hadoop-test): (no justification provided) (duration: 01m 12s) [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:41] (03CR) 10Elukey: [C: 03+2] knative-serving: improve network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738360 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [16:42:43] RECOVERY - Check systemd state on graphite1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:46] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.11 point update - https://phabricator.wikimedia.org/T292838 (10MoritzMuehlenhoff) [16:45:02] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [16:48:44] (03Merged) 10jenkins-bot: knative-serving: improve network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738360 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [16:51:18] (03PS1) 10Ottomata: scap::target - add $manage_ssh_key parameter [puppet] - 10https://gerrit.wikimedia.org/r/738436 (https://phabricator.wikimedia.org/T295380) [16:52:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:24] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32390/console" [puppet] - 10https://gerrit.wikimedia.org/r/738436 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [16:57:31] (03CR) 10Ottomata: [V: 03+1] "PCC is a no-op where appropriate." [puppet] - 10https://gerrit.wikimedia.org/r/738436 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [16:57:33] (03CR) 10Ottomata: [V: 03+1 C: 03+2] scap::target - add $manage_ssh_key parameter [puppet] - 10https://gerrit.wikimedia.org/r/738436 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [16:59:27] !log otto@deploy1002 Started deploy [airflow-dags/analytics@093f067] (hadoop-test): (no justification provided) [16:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:32] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@093f067] (hadoop-test): (no justification provided) (duration: 00m 04s) [16:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:53] (03PS1) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [17:06:31] (03PS8) 10Btullis: Add the first eventgate alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) [17:06:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738366 (https://phabricator.wikimedia.org/T295063) (owner: 10David Caro) [17:10:19] (03PS2) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [17:13:31] (03PS1) 10Elukey: knative-serving: allow networking-istio to contact the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/738438 (https://phabricator.wikimedia.org/T289834) [17:15:17] !log restarting and arming keyholder on deploy1002 - T295380 [17:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:20] T295380: [Airflow] Set up scap deployment - https://phabricator.wikimedia.org/T295380 [17:17:46] (03CR) 10Btullis: [C: 03+2] Add the first eventgate alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [17:20:01] (03Merged) 10jenkins-bot: Add the first eventgate alert to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/736490 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [17:20:41] (03CR) 10Elukey: [C: 03+2] knative-serving: allow networking-istio to contact the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/738438 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [17:22:02] (03PS1) 10Jbond: P:mediabackup::storage: update mino daemon to use the chained certificate [puppet] - 10https://gerrit.wikimedia.org/r/738439 (https://phabricator.wikimedia.org/T295594) [17:22:39] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, and 2 others: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jbond) @jcrespo This looks like it is caused because the daemon is only sending it leaf certi... [17:23:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:18] (03PS1) 10Ottomata: Use airflow-dags/analytics on an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/738440 (https://phabricator.wikimedia.org/T295380) [17:25:56] (03Merged) 10jenkins-bot: knative-serving: allow networking-istio to contact the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/738438 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [17:28:02] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32392/console" [puppet] - 10https://gerrit.wikimedia.org/r/738440 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [17:28:34] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Cmjohnson) @Marostegui 15 Nov 1000 Local 1500GMT ? [17:33:40] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Use airflow-dags/analytics on an-launcher [puppet] - 10https://gerrit.wikimedia.org/r/738440 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [17:33:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32393/console" [puppet] - 10https://gerrit.wikimedia.org/r/738439 (https://phabricator.wikimedia.org/T295594) (owner: 10Jbond) [17:41:05] (03CR) 10Jbond: [V: 03+1] "could be tested by applying the following manually (this is a symlink)" [puppet] - 10https://gerrit.wikimedia.org/r/738439 (https://phabricator.wikimedia.org/T295594) (owner: 10Jbond) [17:45:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:38] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10dr0ptp4kt) Approved [17:52:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10dr0ptp4kt) Approved [17:58:03] (03PS1) 10Elukey: knative-serving: allow webhook pods to contact the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/738446 (https://phabricator.wikimedia.org/T289834) [17:59:20] (03PS3) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [18:07:01] (03CR) 10Elukey: [C: 03+2] knative-serving: allow webhook pods to contact the k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/738446 (https://phabricator.wikimedia.org/T289834) (owner: 10Elukey) [18:08:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [18:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:21] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [18:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:44] (03PS1) 10Jbond: WIP: add typing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [18:37:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: add typing [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [18:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [18:49:52] (03CR) 10Jbond: [C: 04-1] P:openstack::base::cloudgw: drop unneeded profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737774 (owner: 10Jbond) [18:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:07:54] (03PS1) 10Jgreen: Add host/service notes_url to frack icinga passive check service definition. [puppet] - 10https://gerrit.wikimedia.org/r/738458 (https://phabricator.wikimedia.org/T295383) [19:13:41] (03CR) 10Jgreen: "I'm fairly confident this will only effect fundraising alerts, but would appreciate feedback from folks who are more knowledgable about ou" [puppet] - 10https://gerrit.wikimedia.org/r/738458 (https://phabricator.wikimedia.org/T295383) (owner: 10Jgreen) [19:29:18] (03PS1) 10Ahmon Dancy: mediawiki: Ensure mwdeploy user is a member of the www-data group [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) [19:29:57] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Ensure mwdeploy user is a member of the www-data group [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [19:31:02] 10SRE, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): wikitech-static down - https://phabricator.wikimedia.org/T295266 (10RLazarus) >>! In T295266#7488306, @Marostegui wrote: > https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/wikitech-static-jessie.wikimedia.org i this needs updating, bu... [19:32:17] (03PS2) 10Ahmon Dancy: mediawiki: Ensure mwdeploy user is a member of the www-data group [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) [19:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:07:40] 10SRE, 10serviceops: Package php 7.4 for wikimedia production - https://phabricator.wikimedia.org/T293449 (10Legoktm) The tricks I used to build these differently named PHP 7.4 packages is now documented at https://wikitech.wikimedia.org/wiki/PHP_packaging [20:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:50:28] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) FTR - I did a quick 1-hour capture on these today just to see whether there was any sign of remaining cases, and there still are some. Probably a 24+ hour capture would get more of them, but... [20:57:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:15] 10SRE, 10API Platform, 10Desktop Improvements, 10MediaWiki-REST-API, and 10 others: Private wikis with new vector return autocomplete search results - https://phabricator.wikimedia.org/T292763 (10sbassett) >>! In T292763#7498439, @daniel wrote: > Looks like this is fixed Yes. This might get a mention wit... [21:23:19] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:25] 10SRE, 10API Platform, 10Desktop Improvements, 10MediaWiki-REST-API, and 10 others: Private wikis with new vector return autocomplete search results - https://phabricator.wikimedia.org/T292763 (10sbassett) [21:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [23:25:57] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1202.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:26:45] (03PS1) 10Jdlrobson: MobileWebUIActions tracks init event [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738399 (https://phabricator.wikimedia.org/T294738) [23:43:23] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:49:35] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:50:53] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:56:57] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 224.79 ms