[00:15:55] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:17:06] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:02] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:31:32] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:10] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:00] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:53:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:00:08] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:24] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_task_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:22] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:55] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:59:32] (03CR) 10Giuseppe Lavagetto: varnish/tests: improve UX, refactor run.py (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto) [05:04:38] (03PS4) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) [05:30:34] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) 05Open→03Resolved Thank you John, I just started mysql. Closing this for now. I will reopen if this crashes again. [05:38:48] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:40:06] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto) [05:41:25] (03PS5) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) [05:44:06] (03Merged) 10jenkins-bot: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto) [05:46:02] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:00:30] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:00:50] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:09:46] (03PS1) 10Urbanecm: eswiki: Enable structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834891 (https://phabricator.wikimedia.org/T310905) [06:20:44] (03CR) 10Urbanecm: [C: 03+2] eswiki: Enable structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834891 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [06:22:14] (03Merged) 10jenkins-bot: eswiki: Enable structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834891 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [06:29:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:30:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:30:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:31:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:32:28] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d2d2c08fc6e0dd5c0c85fbe31f85201721871aa9: eswiki: Enable structured mentor list (T310905) (duration: 04m 30s) [06:32:32] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [06:36:32] !log clean up my old home dir on matomo1002, ran `apt-get clean` + some other clean up steps on matomo1002 to free space on the root partition [06:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:41:26] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:45:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:45:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:48:01] (03PS7) 10Elukey: knative: backport patch to tune pod DNS settings from version 1.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/834553 (https://phabricator.wikimedia.org/T313915) [06:48:22] RECOVERY - Disk space on matomo1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=matomo1002&var-datasource=eqiad+prometheus/ops [06:49:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:50:04] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:39] (03CR) 10Elukey: [V: 03+2 C: 03+2] knative: backport patch to tune pod DNS settings from version 1.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/834553 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:00:04] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T0700). [07:00:05] Jhs and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:18] <_joe_> o/ [07:00:29] o/ [07:00:34] hah, i completely forgot. but i'm here [07:00:44] i can deploy today! [07:00:55] (03PS2) 10Jon Harald Søby: Add wordmark and tagline for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834653 (https://phabricator.wikimedia.org/T318478) [07:01:06] <_joe_> urbanecm: ping me when you're done with Jhs's patch, I can do my stuff ofc [07:01:12] _joe_: sure thing [07:01:16] (03CR) 10Urbanecm: [C: 03+2] Add wordmark and tagline for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834653 (https://phabricator.wikimedia.org/T318478) (owner: 10Jon Harald Søby) [07:01:55] <_joe_> oh I see we now have scap backport! cool! [07:01:58] (03Merged) 10jenkins-bot: Add wordmark and tagline for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834653 (https://phabricator.wikimedia.org/T318478) (owner: 10Jon Harald Søby) [07:02:18] Jhs: pulled to mwdebug1001 [07:02:26] <_joe_> kudos jnuche dancy [07:02:49] * urbanecm admits he continues to do backports in the old way [07:02:51] urbanecm, confirmed 👍 [07:02:55] Jhs: great, syncing [07:04:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:05:11] urbanecm, is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/831203 something that should be done in a window like this too? (Like maybe this one, hehe) [07:07:23] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: 81f66621e923cd2ee3aac6f8b5be0ba2e85fb51d: Add wordmark and tagline for mnwiki (T318478; 1/2) (duration: 03m 40s) [07:07:26] T318478: Deploy new translated logos in Mongolian to vector-2022 and mn.m.wikipedia.org - https://phabricator.wikimedia.org/T318478 [07:08:25] Jhs: yeah :) [07:08:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:08:52] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:08:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:09:05] urbanecm, aight, i'll add it to the schedule [07:11:09] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 81f66621e923cd2ee3aac6f8b5be0ba2e85fb51d: Add wordmark and tagline for mnwiki (T318478) (duration: 03m 46s) [07:11:29] (03PS1) 10Elukey: knative: build new images for the net-istio namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/835068 (https://phabricator.wikimedia.org/T313915) [07:11:31] And, first patch deployed [07:12:06] (03PS3) 10Urbanecm: Add ami, bjn, blk, dag, guw, ig, kcg, lmo, pcm, pwn, and shi to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831203 (owner: 10Jon Harald Søby) [07:12:09] (03CR) 10Urbanecm: [C: 03+2] Add ami, bjn, blk, dag, guw, ig, kcg, lmo, pcm, pwn, and shi to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831203 (owner: 10Jon Harald Søby) [07:12:54] (03Merged) 10jenkins-bot: Add ami, bjn, blk, dag, guw, ig, kcg, lmo, pcm, pwn, and shi to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831203 (owner: 10Jon Harald Søby) [07:12:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:18:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:18:20] Jhs: pulled to mwdebug1001 if you can test this [07:19:11] urbanecm, 👍 confirmed [07:19:56] Jhs: great, syncing [07:20:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:20:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:22:48] (03PS6) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) [07:23:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:23:58] !log urbanecm@deploy1002 Synchronized wmf-config/InterwikiSortOrders.php: 620bb80e3534c812d7f4de25547d92104b8609a0: Add ami, bjn, blk, dag, guw, ig, kcg, lmo, pcm, pwn, and shi to InterwikiSortOrders (duration: 03m 40s) [07:24:04] Jhs: and live! [07:24:10] _joe_: i believe you can go ahead now :) [07:24:28] <_joe_> yeah waiting for CI [07:24:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:24:38] (y) [07:25:25] (03Merged) 10jenkins-bot: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:25:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:26:08] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:823681|Move 100% of cookie-accepting clients to php 7.4 (T271736)]] [07:26:12] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [07:26:31] !log oblivian@deploy1002 oblivian and oblivian: Backport for [[gerrit:823681|Move 100% of cookie-accepting clients to php 7.4 (T271736)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:26:35] <_joe_> urbanecm: do you use scap backport? [07:26:46] <_joe_> the output was a bit strange tbh [07:26:50] nope, i do it old-school [07:26:55] i've used it a few times [07:28:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:29:20] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: remove ms-be20[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/833007 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [07:29:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:29:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:29:59] (03PS1) 10Elukey: admin_ng: update knative serving images for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/835070 (https://phabricator.wikimedia.org/T313915) [07:30:39] (03CR) 10Elukey: [V: 03+2 C: 03+2] knative: build new images for the net-istio namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/835068 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:30:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [07:31:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "Actually we should be changing the metric name in HELP and TYPE too. I recommend introducing another variable or re-using the same after t" [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [07:31:40] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:823681|Move 100% of cookie-accepting clients to php 7.4 (T271736)]] (duration: 05m 31s) [07:31:44] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [07:31:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:33:08] (03PS2) 10Elukey: admin_ng: update knative serving images for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/835070 (https://phabricator.wikimedia.org/T313915) [07:34:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "At the very least you'll have to force ipv4 and use a proxy, similarly to profile::wikifunctions::beta" [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [07:34:55] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: remove ms-be10[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/834503 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon) [07:39:27] (03CR) 10Klausman: [C: 03+1] admin_ng: update knative serving images for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/835070 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:42:54] (03CR) 10Klausman: [C: 03+1] knative: backport patch to tune pod DNS settings from version 1.5 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/834553 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [07:42:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:52:18] PROBLEM - Disk space on prometheus2006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/analytics 385 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2006&var-datasource=codfw+prometheus/ops [07:52:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:01:45] (03PS8) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [08:04:16] !log add 20G to prometheus/analytics in codfw [08:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:35] !log upgrade grafana to 8.5.13 [08:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:24] (03CR) 10MVernon: [C: 03+2] hieradata: remove ms-be20[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/833007 (https://phabricator.wikimedia.org/T294549) (owner: 10MVernon) [08:11:39] (03CR) 10MVernon: [C: 03+2] swift: remove ms-be10[28-39] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/834503 (https://phabricator.wikimedia.org/T294550) (owner: 10MVernon) [08:13:36] RECOVERY - Disk space on prometheus2006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus2006&var-datasource=codfw+prometheus/ops [08:15:55] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:34:07] (03CR) 10Elukey: [C: 03+2] admin_ng: update knative serving images for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/835070 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [08:35:20] (03PS1) 10Urbanecm: arwiki: Grant enrollasmentor to editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835081 (https://phabricator.wikimedia.org/T310905) [08:35:41] (03PS1) 10Jelto: gitlab: fix ssh listen address for gitlab test instance [puppet] - 10https://gerrit.wikimedia.org/r/835082 (https://phabricator.wikimedia.org/T297411) [08:38:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:38:25] (03CR) 10Urbanecm: [C: 03+2] arwiki: Grant enrollasmentor to editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835081 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [08:39:10] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:40:14] (03Merged) 10jenkins-bot: arwiki: Grant enrollasmentor to editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835081 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [08:42:08] PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:36] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37349/console" [puppet] - 10https://gerrit.wikimedia.org/r/835082 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [08:44:02] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2057 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:47:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0a5486780a0543d7fb1c637d2abe48855e753d13: arwiki: Grant enrollasmentor to editor (T310905) (duration: 03m 40s) [08:47:12] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [08:47:15] i/me done [08:47:17] * urbanecm done [08:47:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:48:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:48:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:49:33] ...not really [08:49:39] (03PS1) 10Urbanecm: arwiki: Properly grant enrollasmentor to editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835083 (https://phabricator.wikimedia.org/T310905) [08:49:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:49:55] (03CR) 10Urbanecm: [C: 03+2] arwiki: Properly grant enrollasmentor to editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835083 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [08:51:35] (03Merged) 10jenkins-bot: arwiki: Properly grant enrollasmentor to editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835083 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [08:53:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: fix ssh listen address for gitlab test instance [puppet] - 10https://gerrit.wikimedia.org/r/835082 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [08:54:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:55:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:55:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:56:26] !log adding 80GB of virtual disk to matomo1002 [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:58:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 033ab75917932a6b6e1cda8cc26f5f069448e3b9: arwiki: Properly grant enrollasmentor to editor (T310905) (duration: 03m 46s) [08:58:37] T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905 [09:00:30] (03CR) 10Clément Goubert: [C: 03+1] mediawiki::api: fix kernel parameter name ip_local_port_range (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn) [09:05:55] (03CR) 10Jcrespo: "I'm sorry, but I have absolutely no context or understanding of this patch, I've never handled beta, including its disks. I wonder if some" [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [09:06:02] (03PS1) 10Elukey: knative-serving: add support for config-features [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) [09:07:02] (03PS1) 10Arturo Borrero Gonzalez: wmcs: maintain_dbusers: run black -l100 [puppet] - 10https://gerrit.wikimedia.org/r/835087 [09:07:04] (03PS1) 10Arturo Borrero Gonzalez: wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) [09:08:40] RECOVERY - Check systemd state on ms-be2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:48] (03PS2) 10Arturo Borrero Gonzalez: wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) [09:10:04] (03CR) 10CI reject: [V: 04-1] wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) (owner: 10Arturo Borrero Gonzalez) [09:10:57] (03PS2) 10Elukey: knative-serving: add support for config-features [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) [09:11:25] (03CR) 10CI reject: [V: 04-1] wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) (owner: 10Arturo Borrero Gonzalez) [09:11:44] (03PS3) 10Arturo Borrero Gonzalez: wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) [09:11:59] (03CR) 10CI reject: [V: 04-1] knative-serving: add support for config-features [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:14:02] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/835087 (owner: 10Arturo Borrero Gonzalez) [09:15:26] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2057 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:16:45] (03CR) 10David Caro: "LGTM, once it passes the tests" [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) (owner: 10Arturo Borrero Gonzalez) [09:17:13] (03PS1) 10Jelto: gitlab: set ssh listen address for gitlab test instance [puppet] - 10https://gerrit.wikimedia.org/r/835089 (https://phabricator.wikimedia.org/T297411) [09:17:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: maintain_dbusers: run black -l100 [puppet] - 10https://gerrit.wikimedia.org/r/835087 (owner: 10Arturo Borrero Gonzalez) [09:20:10] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37353/console" [puppet] - 10https://gerrit.wikimedia.org/r/835089 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [09:25:38] (03PS3) 10Elukey: knative-serving: add support for config-features [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) [09:25:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) (owner: 10JMeybohm) [09:26:14] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: set ssh listen address for gitlab test instance [puppet] - 10https://gerrit.wikimedia.org/r/835089 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [09:39:17] !log btullis@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM matomo1002.eqiad.wmnet [09:39:47] (03CR) 10Btullis: [C: 03+2] Deploy Spark 3 on the whole production cluster [puppet] - 10https://gerrit.wikimedia.org/r/834500 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [09:41:34] (03PS1) 10Elukey: admin_ng: enable dnsConfig option for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/835093 (https://phabricator.wikimedia.org/T313915) [09:41:42] (03PS4) 10Arturo Borrero Gonzalez: wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) [09:42:16] (03CR) 10Arturo Borrero Gonzalez: wmcs: maintain_dbusers: don't halt account population loop on errors (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) (owner: 10Arturo Borrero Gonzalez) [09:44:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:44:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:44:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34903 and previous config saved to /var/cache/conftool/dbconfig/20220926-094502-ladsgroup.json [09:45:06] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:45:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:45:54] PROBLEM - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@matomo1002.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on matomo1002.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:47:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:48:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34904 and previous config saved to /var/cache/conftool/dbconfig/20220926-094812-ladsgroup.json [09:48:16] (03CR) 10Klausman: knative-serving: add support for config-features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [09:53:52] (03CR) 10Ladsgroup: snapshot: Add linktarget (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822631 (https://phabricator.wikimedia.org/T315063) (owner: 10Ladsgroup) [10:00:34] RECOVERY - MariaDB Replica IO: matomo on db1108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:01:16] (03CR) 10Elukey: knative-serving: add support for config-features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:01:36] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:32] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM matomo1002.eqiad.wmnet [10:09:33] (03CR) 10Klausman: [C: 03+1] knative-serving: add support for config-features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:11:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: maintain_dbusers: don't halt account population loop on errors [puppet] - 10https://gerrit.wikimedia.org/r/835088 (https://phabricator.wikimedia.org/T318047) (owner: 10Arturo Borrero Gonzalez) [10:15:16] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:17:08] ACKNOWLEDGEMENT - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1108) taken on 2022-09-20 02:57:27 is 551 MiB, but the previous one was 231 MiB, a change of +138.8 % Btullis This increase is due to T315613 which indicates an increase in usage. Its a good thing! https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:19:50] (03CR) 10Elukey: [C: 03+2] knative-serving: add support for config-features [deployment-charts] - 10https://gerrit.wikimedia.org/r/835086 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:20:04] (03CR) 10Elukey: [C: 03+2] admin_ng: enable dnsConfig option for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/835093 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [10:22:55] 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10Aklapper) a:05hnowlan→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022... [10:24:46] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:25:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:30:11] 10SRE, 10WMF-Legal, 10SEO: (Automate) adding wikinews language versions to the Google Publisher Center / Google News - https://phabricator.wikimedia.org/T254437 (10Aklapper) a:05ovasileva→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the... [10:30:59] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to t... [10:31:29] 10SRE, 10User-MoritzMuehlenhoff: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 20... [10:32:07] 10SRE, 10SRE-OnFire, 10Observability-Metrics: write up impact estimation procedure - https://phabricator.wikimedia.org/T246739 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on Au... [10:32:15] 10SRE: puppet-merge lockout/tagout - https://phabricator.wikimedia.org/T248872 (10Aklapper) a:05CDanis→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022. Please assign this task to yourself... [10:32:29] 10SRE, 10Traffic-Icebox: Track WMF owned non-canonical domains - https://phabricator.wikimedia.org/T247618 (10Aklapper) a:05Vgutierrez→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022. P... [10:34:21] 10SRE, 10User-MoritzMuehlenhoff: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See... [10:34:41] 10SRE, 10DC-Ops: determine/process/document bios firmware tracking/updating policies - https://phabricator.wikimedia.org/T141128 (10Aklapper) a:05wiki_willy→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee o... [10:34:53] 10SRE, 10Analytics-Radar, 10Privacy Engineering, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10Aklapper) a:05ssingh→03None Removing task assignee due to inactivity as this open task has been assigned for mo... [10:36:57] 10SRE, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email... [10:43:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:43:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:46:26] (03PS1) 10MVernon: hieradata: remove ms-be10[28-39] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/835106 (https://phabricator.wikimedia.org/T294550) [10:46:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:51:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:53:51] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 [10:57:01] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 (owner: 10Jbond) [11:02:32] (03CR) 10Btullis: [C: 03+2] Failback hive to the primary server [dns] - 10https://gerrit.wikimedia.org/r/832294 (owner: 10Btullis) [11:04:05] 10SRE-swift-storage: Decom ms-be20[28-39] - https://phabricator.wikimedia.org/T294549 (10MatthewVernon) [11:08:09] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 [11:08:16] (03Abandoned) 10Daniel Kinzler: Demo: load a config variable from JSON file in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739336 (owner: 10Ppchelko) [11:11:29] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 (owner: 10Jbond) [11:12:25] (03PS1) 10Ladsgroup: mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 [11:14:54] jouncebot: nowandnext [11:14:54] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [11:14:54] In 1 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1300) [11:15:02] awesome [11:15:56] (03CR) 10CI reject: [V: 04-1] mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 (owner: 10Ladsgroup) [11:18:20] (03CR) 10Pikne: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835127 (https://phabricator.wikimedia.org/T318530) (owner: 10Pikne) [11:23:01] (03PS2) 10Ladsgroup: mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 [11:24:46] (03CR) 10CI reject: [V: 04-1] mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 (owner: 10Ladsgroup) [11:25:54] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) 05Resolved→03Open @Jclark-ctr the host went down again. Now DIMM A6: ` ------------------------------------------------------------------------------- Record: 85 Date/Time: 09/26/2022 07:13:3... [11:26:21] (03PS1) 10Muehlenhoff: Extend access for sannita [puppet] - 10https://gerrit.wikimedia.org/r/835149 [11:29:06] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for sannita [puppet] - 10https://gerrit.wikimedia.org/r/835149 (owner: 10Muehlenhoff) [11:29:22] (03PS3) 10Ladsgroup: mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 [11:32:04] (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikiversity vhost [puppet] - 10https://gerrit.wikimedia.org/r/835151 (https://phabricator.wikimedia.org/T273179) [11:32:52] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wwwportals: Make sure portal assets are also visible in wikiversity vhost [puppet] - 10https://gerrit.wikimedia.org/r/835151 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [11:47:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10taavi) [11:49:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neato!" [alerts] - 10https://gerrit.wikimedia.org/r/835117 (owner: 10Ladsgroup) [11:57:42] (03CR) 10Ladsgroup: [C: 03+2] mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 (owner: 10Ladsgroup) [11:59:20] (03Merged) 10jenkins-bot: mysql-replication-lag: Only alert on core dbs [alerts] - 10https://gerrit.wikimedia.org/r/835117 (owner: 10Ladsgroup) [12:10:04] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 [12:11:41] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: drop firmware-file flag [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 [12:14:15] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [12:14:34] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 [12:15:14] (03PS1) 10Jon Harald Søby: Set default sortkey for prefixed pages [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835130 (https://phabricator.wikimedia.org/T315551) [12:15:38] 10SRE, 10serviceops: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541 (10MoritzMuehlenhoff) [12:15:45] 10SRE, 10serviceops: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:18:03] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [12:20:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) @Volans if you you have time to check this today that will be great so i can proceed with the task. @Marostegui thanks for the partman recipe thanks [12:23:35] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/835108 (owner: 10Jbond) [12:24:42] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 [12:25:47] !log installing unzip security updates [12:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:10] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [12:28:55] (03PS1) 10Jelto: gitlab_runner: enable unprivileged_userns_clone in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) [12:38:06] jouncebot: nowandnext [12:38:06] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:06] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1300) [12:38:08] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [12:38:11] cool [12:41:31] (03PS3) 10Vlad.shapik: Update the logic to run test coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [12:41:56] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835169 (https://phabricator.wikimedia.org/T273179) [12:42:18] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [12:42:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835169 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [12:44:39] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835169 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [12:44:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:835169|Bump portals to HEAD (T273179)]] [12:45:00] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [12:45:18] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:835169|Bump portals to HEAD (T273179)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [12:51:01] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:835169|Bump portals to HEAD (T273179)]] (duration: 06m 05s) [12:51:06] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [12:51:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:51:34] !log installing bind9 security updates on Bullseye [12:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:14] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 [12:52:18] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [12:53:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:53:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:54:48] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [12:54:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:56:27] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [12:56:33] !log awight@deploy1002 Started deploy [kartotherian/deploy@d1bd7dc]: Enable geopoints on production [12:57:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:58:20] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [12:59:13] !log awight@deploy1002 Finished deploy [kartotherian/deploy@d1bd7dc]: Enable geopoints on production (duration: 02m 40s) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1300). [13:00:04] Pikne and Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:32] o/ [13:00:37] I can deploy! [13:00:51] 👋 [13:02:04] brb [13:02:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:05:08] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [13:12:11] (03PS2) 10JMeybohm: Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) [13:12:45] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835127 (https://phabricator.wikimedia.org/T318530) (owner: 10Pikne) [13:13:32] (03CR) 10CI reject: [V: 04-1] Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) (owner: 10JMeybohm) [13:15:45] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackups: run nfs backups from labstore1004 rather than 1005 [puppet] - 10https://gerrit.wikimedia.org/r/834720 (https://phabricator.wikimedia.org/T317643) (owner: 10Andrew Bogott) [13:21:28] alright, I’m back [13:21:43] Pikne: are you around for the backport+config window? [13:21:53] Yup [13:22:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Maintenance script run scheduled for tomorrow." [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835130 (https://phabricator.wikimedia.org/T315551) (owner: 10Jon Harald Søby) [13:22:04] ok [13:23:07] (03PS3) 10Lucas Werkmeister (WMDE): Enable wgCiteResponsiveReferences on etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835127 (https://phabricator.wikimedia.org/T318530) (owner: 10Pikne) [13:23:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable wgCiteResponsiveReferences on etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835127 (https://phabricator.wikimedia.org/T318530) (owner: 10Pikne) [13:23:26] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:24:20] (03Merged) 10jenkins-bot: Set default sortkey for prefixed pages [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/835130 (https://phabricator.wikimedia.org/T315551) (owner: 10Jon Harald Søby) [13:24:41] let’s do the config change first and then the incubator backport [13:24:55] it should be merged in a few seconds too [13:24:56] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835127 (https://phabricator.wikimedia.org/T318530) (owner: 10Pikne) [13:24:59] tada [13:25:24] Pikne: the etwiki change is on mwdebug1001, can you test it? [13:26:16] Yes, works as intended. [13:26:24] ok, thanks! [13:26:48] syncing [13:28:46] Jhs: while we wait for the sync to finish – do you know how to test changes on mwdebug? [13:28:50] (03PS3) 10JMeybohm: Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) [13:28:53] Lucas_WMDE, yup [13:28:56] ok [13:29:24] Lucas_WMDE, this is my third deploy today, haha XD [13:29:31] :D [13:29:51] I should’ve scrolled up ^^ [13:30:10] (03CR) 10CI reject: [V: 04-1] Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) (owner: 10JMeybohm) [13:30:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:39] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:835127|Enable wgCiteResponsiveReferences on etwiki (T318530)]] (duration: 03m 53s) [13:30:42] T318530: Enable wgCiteResponsiveReferences on etwiki - https://phabricator.wikimedia.org/T318530 [13:31:15] Jhs: okay, the incubator change should be on mwdebug1001, please test [13:31:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:32:09] Lucas_WMDE, it works (i refreshed a category, and all pages were immediately sorted correctly – wasn't expecting that!) [13:32:22] (03PS4) 10JMeybohm: Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) [13:32:27] \o/ [13:32:37] interesting, I wouldn’t expect that eithe [13:32:38] *either [13:32:41] but let’s sync [13:32:53] WikimediaIncubator.php first, then extension.json, then there shouldn’t be any errors I think [13:33:04] Lucas_WMDE: thanks! [13:33:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:34:17] (03PS1) 10Elukey: kserve-inference: add ndots dnsConfig Pod spec to all isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) [13:35:56] (03PS5) 10JMeybohm: Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) [13:35:58] (03CR) 10CI reject: [V: 04-1] kserve-inference: add ndots dnsConfig Pod spec to all isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:37:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.2/extensions/WikimediaIncubator/includes/WikimediaIncubator.php: Backport: [[gerrit:835130|Set default sortkey for prefixed pages (T315551)]] (1/2) (duration: 03m 51s) [13:37:21] T315551: Sort pages in the Wikimedia Incubator by real page title (without prefix) - https://phabricator.wikimedia.org/T315551 [13:37:34] syncing extension.json now [13:38:03] (03CR) 10JMeybohm: [C: 03+2] Add tanuja to ldap_only_users for wmde access [puppet] - 10https://gerrit.wikimedia.org/r/834512 (https://phabricator.wikimedia.org/T317613) (owner: 10JMeybohm) [13:38:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:39:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:39:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:40:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:41:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.2/extensions/WikimediaIncubator/extension.json: Backport: [[gerrit:835130|Set default sortkey for prefixed pages (T315551)]] (2/2) (duration: 03m 39s) [13:42:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10JMeybohm) 05In progress→03Resolved Added `tanuja` to `wmde` LDAP group as well as #wmf-nda [13:42:42] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Foundational Technology Requests, 10User-fgiunchedi: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10EChetty) [13:43:00] ooh yup now updateCollation.php --dry-run reports a lot more pages to fix :) [13:44:13] yeah, i noticed one place it has a visible side-effect – on categories with pagination (200+ pages). should be fine until tomorrow though :) [13:44:44] except for maintenance categories, there aren't many categories like that anyways [13:44:50] ok [13:45:43] !log UTC afternoon backport+config window done [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:16] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:46:58] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@a69b031]: Make Airflow jobs use Spark 3 on anlytics [airflow-dags@a69b031] [13:47:08] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@a69b031]: Make Airflow jobs use Spark 3 on anlytics [airflow-dags@a69b031] (duration: 00m 10s) [13:51:14] (03PS2) 10Elukey: kserve-inference: add ndots dnsConfig Pod spec to all isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) [13:52:13] (03CR) 10CI reject: [V: 04-1] kserve-inference: add ndots dnsConfig Pod spec to all isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [13:56:58] !log installing mako security updates [13:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:41] (03PS2) 10Andrew Bogott: Dumps: stop mounting the old labstore100x servers on VMs [puppet] - 10https://gerrit.wikimedia.org/r/828103 (https://phabricator.wikimedia.org/T309346) [13:57:43] (03PS1) 10Andrew Bogott: Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/835192 (https://phabricator.wikimedia.org/T309346) [13:59:25] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@a69b031]: Make Airflow jobs use Spark 3 on anlytics_test [airflow-dags@a69b031] [13:59:34] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@a69b031]: Make Airflow jobs use Spark 3 on anlytics_test [airflow-dags@a69b031] (duration: 00m 09s) [14:00:30] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [14:00:37] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [14:05:08] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/835195 [14:07:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/835162 (https://phabricator.wikimedia.org/T307810) (owner: 10Jelto) [14:13:41] (03CR) 10Volans: [C: 03+1] "LGTM, couple of optional nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [14:16:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] vopsbot: join #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/825346 (owner: 10Clément Goubert) [14:19:30] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:22:16] (03CR) 10Volans: [C: 03+1] "One typo and a possible alternative approach inline. LGTM otherwise. Apart the typo all comments are optional." [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [14:26:11] <_joe_> (this will be changed soon) [14:39:42] (03CR) 10JMeybohm: [C: 04-1] Add golang 1.18 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah) [14:41:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:26] (03PS2) 10Majavah: Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 [14:41:51] (03CR) 10Majavah: Add golang 1.18 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah) [14:44:23] (03PS3) 10Elukey: kserve-inference: add ndots dnsConfig Pod spec to all isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) [14:46:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:39] (03CR) 10JMeybohm: Add golang 1.18 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah) [14:47:14] (03PS3) 10Majavah: Add golang 1.18 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 [14:48:06] (03CR) 10Majavah: Add golang 1.18 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/833792 (owner: 10Majavah) [14:48:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [14:49:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [14:49:01] (03CR) 10Hashar: [C: 03+1] aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [14:49:56] (03PS2) 10Filippo Giunchedi: customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) [14:49:59] (03CR) 10Majavah: [C: 04-1] "modules/aptrepo/files/distributions-wikimedia also needs to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [14:52:10] (03CR) 10Hashar: [C: 03+1] "Note contint* machines (which run the service-pipeline jobs) are on Buster and currently have docker-ce 5:20.10.12~3-0~debian-buster" [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [14:53:37] (03PS3) 10Sbailey: Enable Linter write of namespace tag and template fields on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) [14:54:37] (03PS1) 10Giuseppe Lavagetto: services_proxy: add a keepalive timeout for image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/835205 (https://phabricator.wikimedia.org/T313973) [14:54:45] (03CR) 10Filippo Giunchedi: "Thank you for the feedback and apologies for the long hiatus on this, please see inline." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:56:29] (03CR) 10JMeybohm: [C: 03+1] services_proxy: add a keepalive timeout for image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/835205 (https://phabricator.wikimedia.org/T313973) (owner: 10Giuseppe Lavagetto) [14:57:02] (03PS1) 10DLynch: Disable MobileFrontend default editor a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835206 (https://phabricator.wikimedia.org/T302356) [14:57:24] (03PS4) 10Sbailey: Enable Linter write of namespace tag and template fields on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) [14:58:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: add a keepalive timeout for image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/835205 (https://phabricator.wikimedia.org/T313973) (owner: 10Giuseppe Lavagetto) [15:00:40] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:26] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [15:02:28] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [15:06:09] (03PS4) 10Elukey: kserve-inference: refactor rendering of isvc configs and add dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) [15:07:39] (03PS5) 10Elukey: kserve-inference: refactor rendering of isvc configs and add dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) [15:11:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:07] (03CR) 10CI reject: [V: 04-1] kserve-inference: refactor rendering of isvc configs and add dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:13:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.498 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:14:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48683 bytes in 1.407 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:07] (03PS1) 10Majavah: P:toolforge::prometheus: disable k8s label map [puppet] - 10https://gerrit.wikimedia.org/r/835213 [15:19:02] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [15:20:46] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:40] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: disable k8s label map [puppet] - 10https://gerrit.wikimedia.org/r/835213 (owner: 10Majavah) [15:25:45] (03CR) 10Kosta Harlan: "thanks! Is this automatically deployed?" [puppet] - 10https://gerrit.wikimedia.org/r/835205 (https://phabricator.wikimedia.org/T313973) (owner: 10Giuseppe Lavagetto) [15:26:26] (03PS4) 10Vlad.shapik: Update the logic to run test coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [15:26:28] (03PS6) 10Elukey: kserve-inference: refactor rendering of isvc configs and add dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) [15:30:04] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1530). nyaa~ [15:32:55] jouncebot: nowandnext [15:32:55] For the next 0 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1530) [15:32:55] In 1 hour(s) and 27 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1700) [15:34:09] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/835195 (owner: 10Muehlenhoff) [15:36:18] 10SRE, 10Wikimedia-Mailing-lists: Archive coolest-tool-academy mailing list - https://phabricator.wikimedia.org/T317185 (10Aklapper) 05Open→03Declined Big thanks Legoktm for making us look again into this. Declining this request. There may be small logistical advantages (e.g. adding one group email addres... [15:36:24] (03CR) 10Klausman: kserve-inference: refactor rendering of isvc configs and add dnsConfig (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:37:20] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Migrate wikiversity.org to the modern portals (duration: 03m 49s) [15:37:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul) [15:38:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Maint Done', diff saved to https://phabricator.wikimedia.org/P34906 and previous config saved to /var/cache/conftool/dbconfig/20220926-153807-ladsgroup.json [15:38:49] (03CR) 10Elukey: kserve-inference: refactor rendering of isvc configs and add dnsConfig (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:40:57] !log ladsgroup@deploy1002 Synchronized portals: Migrate wikiversity.org to the modern portals (duration: 03m 36s) [15:43:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) p:05Medium→03High @bblack, Now that we're back from our SRE summit, I'd like to pick this back up! We have the eqsin shipment pending, so getting this tested is now urgent. Can you... [15:43:55] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [15:44:18] (03CR) 10Klausman: [C: 03+1] kserve-inference: refactor rendering of isvc configs and add dnsConfig (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:44:45] (03CR) 10Elukey: [C: 03+2] kserve-inference: refactor rendering of isvc configs and add dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/835186 (https://phabricator.wikimedia.org/T313915) (owner: 10Elukey) [15:47:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:51:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:52:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:52:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Maint Done', diff saved to https://phabricator.wikimedia.org/P34907 and previous config saved to /var/cache/conftool/dbconfig/20220926-155312-ladsgroup.json [15:53:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:54:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:55:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:57:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:57:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:58:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:01:41] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10Papaul) [16:03:43] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:04:06] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:07:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:07:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Maint Done', diff saved to https://phabricator.wikimedia.org/P34908 and previous config saved to /var/cache/conftool/dbconfig/20220926-160817-ladsgroup.json [16:10:03] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [16:10:22] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:14:30] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:14:44] (03PS2) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) [16:14:46] (03PS3) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [16:14:48] (03PS3) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [16:15:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:15:43] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:15:44] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:15:58] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [16:16:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34909 and previous config saved to /var/cache/conftool/dbconfig/20220926-161632-ladsgroup.json [16:16:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [16:16:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:17:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:20:28] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:22:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:22:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:23:07] (03CR) 10Majavah: [C: 04-1] aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [16:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Maint Done', diff saved to https://phabricator.wikimedia.org/P34910 and previous config saved to /var/cache/conftool/dbconfig/20220926-162322-ladsgroup.json [16:25:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:25:35] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host db2184.mgmt.codfw.wmnet with reboot policy FORCED [16:26:36] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2184.mgmt.codfw.wmnet with reboot policy FORCED [16:31:37] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [16:31:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P34911 and previous config saved to /var/cache/conftool/dbconfig/20220926-163138-ladsgroup.json [16:32:12] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: Present user with a list of current files [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 [16:34:52] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [16:35:05] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [16:35:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2183.mgmt.codfw.wmnet with reboot policy FORCED [16:35:37] 10SRE-OnFire, 10Discovery-Search, 10Observability-Alerting, 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10bking) [16:35:50] 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10bking) [16:37:20] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:37:33] (03PS1) 10Zabe: deployment-prep: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/835234 [16:43:22] (03PS3) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) [16:43:24] (03PS4) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [16:43:26] (03PS4) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [16:45:12] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [16:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P34912 and previous config saved to /var/cache/conftool/dbconfig/20220926-164645-ladsgroup.json [16:52:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2184.mgmt.codfw.wmnet with reboot policy FORCED [16:53:55] (03PS7) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [16:54:35] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [16:55:18] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2031 [16:55:35] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [16:55:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2031 [16:56:04] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2032 [16:56:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2032 [16:57:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:00:05] ryankemper: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T1700). Please do the needful. [17:00:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:01:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:01:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34913 and previous config saved to /var/cache/conftool/dbconfig/20220926-170151-ladsgroup.json [17:01:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:01:55] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:02:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34914 and previous config saved to /var/cache/conftool/dbconfig/20220926-170213-ladsgroup.json [17:05:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2184.mgmt.codfw.wmnet with reboot policy FORCED [17:07:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2031.mgmt.codfw.wmnet with reboot policy FORCED [17:07:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:07:52] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2036 [17:08:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2036 [17:08:50] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logstash2037 [17:09:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2031.mgmt.codfw.wmnet with reboot policy FORCED [17:10:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logstash2037 [17:10:25] (03CR) 10Ladsgroup: "Generally looks good to me, it'd be better for someone from the team to also review this before merging it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [17:15:11] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2183'] [17:15:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2183'] [17:16:05] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2184'] [17:16:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2184'] [17:17:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) [17:26:33] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2036.mgmt.codfw.wmnet with reboot policy FORCED [17:27:13] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash2036.mgmt.codfw.wmnet with reboot policy FORCED [17:27:42] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [17:28:28] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logstash2037.mgmt.codfw.wmnet with reboot policy FORCED [17:29:19] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2031.mgmt.codfw.wmnet with reboot policy FORCED [17:29:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host logstash2036.mgmt.codfw.wmnet with reboot policy FORCED [17:30:14] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2031.mgmt.codfw.wmnet with reboot policy FORCED [17:30:35] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2032.mgmt.codfw.wmnet with reboot policy FORCED [17:31:11] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2032.mgmt.codfw.wmnet with reboot policy FORCED [17:35:52] (03PS1) 10Papaul: Add db218[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835241 (https://phabricator.wikimedia.org/T313979) [17:36:41] (03CR) 10CI reject: [V: 04-1] Add db218[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835241 (https://phabricator.wikimedia.org/T313979) (owner: 10Papaul) [17:36:59] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/835157 (owner: 10Jbond) [17:37:01] (03PS5) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [17:39:20] (03CR) 10Sbailey: "Hi can someone take a quick look at this patch to enable Linter write to 3 new fields, dark launched code, so it runs on the labs cloud. " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [17:39:55] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Krinkle) [17:40:17] (03PS2) 10Papaul: Add db218[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835241 (https://phabricator.wikimedia.org/T313979) [17:40:31] Jouncebot is down [17:41:38] (03CR) 10Papaul: [C: 03+2] Add db218[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835241 (https://phabricator.wikimedia.org/T313979) (owner: 10Papaul) [17:42:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2183.codfw.wmnet with OS bullseye [17:42:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2183.codfw.wmnet with OS bullseye [17:42:40] (03CR) 10Volans: [C: 03+1] "Did a quick pass and LGTM, I can do a full pass tomorrow." [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [17:44:00] (03PS8) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [17:44:01] (03PS6) 10Jbond: sre.hardware.upgrade-firmware: Add support for driver updates [cookbooks] - 10https://gerrit.wikimedia.org/r/835212 [17:53:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logstash2036.mgmt.codfw.wmnet with reboot policy FORCED [17:56:59] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10Papaul) [17:57:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2031.mgmt.codfw.wmnet with reboot policy FORCED [18:03:25] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) p:05Medium→03Lowest [18:06:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) a:05Papaul→03None [18:06:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) [18:10:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2183.codfw.wmnet with reason: host reimage [18:13:50] PROBLEM - MegaRAID on an-worker1085 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:13:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2183.codfw.wmnet with reason: host reimage [18:14:01] (03PS9) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 [18:14:52] (03PS1) 10Bartosz Dziewoński: wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835245 (https://phabricator.wikimedia.org/T317070) [18:15:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul) [18:16:07] (03PS1) 10Jdlrobson: Web team config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835246 (https://phabricator.wikimedia.org/T316568) [18:17:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2031.mgmt.codfw.wmnet with reboot policy FORCED [18:18:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2032.mgmt.codfw.wmnet with reboot policy FORCED [18:21:52] PROBLEM - Host sretest1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:25:50] (03CR) 10Jbond: sre.hardware.upgrade-firmware: add a cache for firmware answers (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/835168 (owner: 10Jbond) [18:27:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2183.codfw.wmnet with OS bullseye [18:27:36] RECOVERY - Host sretest1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [18:27:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2183.codfw.wmnet with OS bullseye completed: - db2183... [18:29:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2184.codfw.wmnet with OS bullseye [18:29:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2184.codfw.wmnet with OS bullseye [18:44:57] RECOVERY - MegaRAID on an-worker1085 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:47:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2184.codfw.wmnet with reason: host reimage [18:49:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2032.mgmt.codfw.wmnet with reboot policy FORCED [18:51:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2184.codfw.wmnet with reason: host reimage [19:00:25] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) @Marostegui Thank you so i did typo previous comment i had swapped with A6 I have pulled another TSR report and submitted to dell. Thank you for your assistance [19:04:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2184.codfw.wmnet with OS bullseye [19:04:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2184.codfw.wmnet with OS bullseye completed: - db2184... [19:06:25] (03PS1) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) [19:07:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) [19:08:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db218[34] - https://phabricator.wikimedia.org/T313979 (10Papaul) 05Open→03Resolved @Marostegui all your's [19:08:33] (03CR) 10Samtar: "Per the stalled-looking T265726, and what appears to be satisfactory responses/resolutions to the concerns raised. Boldly making a patch t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [19:10:08] 10SRE, 10RESTBase-API, 10Traffic, 10Performance-Team (Radar): Text cluster is being hit with an average of 1.8k PURGE requests per second per host - https://phabricator.wikimedia.org/T318349 (10Krinkle) [19:11:47] (03CR) 10Jdlrobson: wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835245 (https://phabricator.wikimedia.org/T317070) (owner: 10Bartosz Dziewoński) [19:13:37] (03PS1) 10Cmjohnson: Adding new kafka-logging servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835254 (https://phabricator.wikimedia.org/T313960) [19:16:50] (03PS1) 10Bartosz Dziewoński: Fix VisualEditor on wikis where RESTBase was never set up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835255 (https://phabricator.wikimedia.org/T318325) [19:26:42] (03PS3) 10BCornwall: Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) [19:26:58] (03CR) 10BCornwall: Prometheus: Remove ATS gauge periods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [19:27:46] (03PS4) 10BCornwall: Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) [19:31:24] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10BBlack) CP Nodes' mapping for replacements and where NVMEs go: | cp nodes | Current | Replacement | Disks | text | 21-26, 33, 34 | 37-44 | Single NVME | upload | 27-32, 35, 36 | 45-52 | Dual N... [19:32:13] (03CR) 10Bartosz Dziewoński: wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835245 (https://phabricator.wikimedia.org/T317070) (owner: 10Bartosz Dziewoński) [19:32:44] (03CR) 10Cmjohnson: [C: 03+2] Adding new kafka-logging servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835254 (https://phabricator.wikimedia.org/T313960) (owner: 10Cmjohnson) [19:40:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye [19:40:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye [19:40:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-logging1004.eqiad.wmnet with OS bullseye [19:40:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye executed... [19:42:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-logging1004.eqiad.wmnet with OS bullseye [19:42:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye [19:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:48:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:50:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:50:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [19:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T314041)', diff saved to https://phabricator.wikimedia.org/P34918 and previous config saved to /var/cache/conftool/dbconfig/20220926-195019-ladsgroup.json [19:50:25] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:52:40] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T2000). [20:00:04] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] hey [20:00:18] Hi! [20:00:22] * TheresNoTime can deploy! [20:01:45] MatmaRex: going to start with 835245 :) [20:02:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835245 (https://phabricator.wikimedia.org/T317070) (owner: 10Bartosz Dziewoński) [20:02:26] thanks [20:03:35] (03Merged) 10jenkins-bot: wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835245 (https://phabricator.wikimedia.org/T317070) (owner: 10Bartosz Dziewoński) [20:03:37] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2031'] [20:03:49] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835245|wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (T317070)]] [20:03:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti2031'] [20:03:52] T317070: MobileFormatter has quadratic performance - https://phabricator.wikimedia.org/T317070 [20:04:11] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:835245|wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (T317070)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:04:18] MatmaRex: live on 1002 :) [20:04:32] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2031'] [20:04:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti2031'] [20:05:07] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2032'] [20:05:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti2032'] [20:05:29] TheresNoTime: looks good, i see collapsible sections on https://en.m.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh [20:05:29] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2032'] [20:05:38] syncin' [20:05:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti2032'] [20:06:18] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2036'] [20:06:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash2036'] [20:06:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash2036'] [20:06:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash2036'] [20:06:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:00] (03PS2) 10Samtar: Fix VisualEditor on wikis where RESTBase was never set up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835255 (https://phabricator.wikimedia.org/T318325) (owner: 10Bartosz Dziewoński) [20:07:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10Papaul) [20:07:49] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install logstash203[67] - https://phabricator.wikimedia.org/T313848 (10Papaul) [20:07:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:07:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:17] (03CR) 10BCornwall: varnish/tests: improve UX, refactor run.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto) [20:09:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:03] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835245|wgMFMobileFormatterOptions: Set maxImages and maxHeadings to very high values (T317070)]] (duration: 06m 13s) [20:10:07] T317070: MobileFormatter has quadratic performance - https://phabricator.wikimedia.org/T317070 [20:10:13] Cool, now doing 835255 :) [20:10:22] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:10:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835255 (https://phabricator.wikimedia.org/T318325) (owner: 10Bartosz Dziewoński) [20:11:16] (03Merged) 10jenkins-bot: Fix VisualEditor on wikis where RESTBase was never set up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835255 (https://phabricator.wikimedia.org/T318325) (owner: 10Bartosz Dziewoński) [20:11:32] !log samtar@deploy1002 Started scap: Backport for [[gerrit:835255|Fix VisualEditor on wikis where RESTBase was never set up (T318325)]] [20:11:36] T318325: VisualEditor throws "Error contacting the Parsoid/RESTBase server (HTTP 404): (no message)" on affiliate wiki - https://phabricator.wikimedia.org/T318325 [20:11:52] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:835255|Fix VisualEditor on wikis where RESTBase was never set up (T318325)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [20:12:07] MatmaRex: and that one is live on 1002, is it testable? [20:12:53] (ah looks to be) [20:13:05] looking [20:13:30] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-logging1004.eqiad.wmnet with OS bullseye [20:13:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, and 2 others: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-logging1004.eqiad.wmnet with OS bullseye executed... [20:14:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:07] TheresNoTime: also looks good. https://romd.wikimedia.org/wiki/Pagina_principal%C4%83?veaction=edit loads [20:14:23] Awesome, syncing [20:15:06] (03CR) 10Zabe: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [20:15:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:16:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:25] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:835255|Fix VisualEditor on wikis where RESTBase was never set up (T318325)]] (duration: 06m 52s) [20:18:30] T318325: VisualEditor throws "Error contacting the Parsoid/RESTBase server (HTTP 404): (no message)" on affiliate wiki - https://phabricator.wikimedia.org/T318325 [20:18:34] MatmaRex: everything sync'd :) [20:19:08] thanks! [20:19:18] You're welcome! [20:19:48] * TheresNoTime will still be around for a little longer if anyone has patches to deploy? [20:22:56] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Split edge caches between php versions - https://phabricator.wikimedia.org/T311479 (10Krinkle) a:05Krinkle→03Joe Given the rollout of the PHP74 cookie campaign, I assume this has since been resolved. [20:23:29] 10SRE, 10Traffic, 10serviceops, 10Performance-Team (Radar): Split edge caches between php versions - https://phabricator.wikimedia.org/T311479 (10Krinkle) 05Open→03Resolved a:03Krinkle [20:30:59] Last call for any patches to deploy ^^ [20:31:51] !log closing UTC late backport window [20:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:19] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:33] wasn't me.. [20:32:39] (03CR) 10BCornwall: lvs: Convert ::lvs::configuration to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [20:33:04] looking [20:33:39] (03CR) 10BCornwall: lvs: Convert ::lvs::configuration to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [20:33:45] * volans around too if needed, but is starting to be a bit late [20:33:48] rzl: need a hand? [20:33:53] here as well [20:34:09] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10SDunlap) [20:34:25] volans: sure, if you're offering :) don't stay later than you're comfortable though [20:36:34] from https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=1664220977646&to=1664224577647 the probes started getting upset around 20:29, and time-correlated with shellbox-syntaxhighlight [20:37:08] and yep there it is on https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox-syntaxhighlight&var-release=main [20:37:15] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:37:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:37:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:37:37] indeed [20:39:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:21] and yeah it looks like shellbox-syntaxhighlight is getting CPU throttled https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s&var-namespace=shellbox-syntaxhighlight&var-pod=shellbox-main-7795bf4db7-5rhdj&var-container=All [20:41:10] it looks like we'd need to triple the resources to keep up, I might try that at least as an emergency measure [20:41:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host centrallog1002.mgmt.eqiad.wmnet with reboot policy FORCED [20:42:09] rzl: do we have any indication that those requests are legit [20:42:09] ? [20:42:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:25] (the alert cleared but probes are still failing, just not 100% of them) [20:42:28] (haha never mind) [20:42:52] volans: I don't have any indications either way -- I can check that next, but when we've seen this in the past it's been legit due to spiky editing patterns [20:43:01] got it [20:43:46] e.g. with lilypond there was a visualeditor bug where it would send the thing to shellbox to generate previews every time you stopped typing for a sec -- so we got a lot of them at once, but only because those were the times somebody was editing scores [20:44:33] wouldn't be surprised if there's a similar pattern with syntaxhighlight except with no need for that render-on-pause behavior [20:45:09] ack, I've checked some other random graphs and I didn't find any particular smoking gun so far [20:45:57] I assume the logs for syntaxhighlight are somewhere in logstash? [20:46:18] yeah just looking for them now [20:47:13] https://wikitech.wikimedia.org/wiki/Shellbox#Logs says it should be https://logstash.wikimedia.org/goto/5cc6c66d2a04810e9adc8e33b22616ec but that's coming up empty for me, still fiddling [20:47:18] (ProbeDown) resolved: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:32] oh, the traffic stopped [20:48:08] classic "boa constrictor digesting an elephant" graph: https://grafana.wikimedia.org/goto/hMI7Jg44k [20:48:26] shellbox recovered somehow [20:48:48] does look like quite the elephant [20:49:18] yeah [20:49:24] (doh, ignore that logstash link, I got my tabs mixed up and started from the wrong dashboard) [20:51:06] (03PS1) 10Bking: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) [20:53:07] https://logstash.wikimedia.org/goto/32dbd4933dd7ab85a734e1b947387c65 if I read this right, shellbox got a spike in traffic all at once, and then took that long to churn through it? [20:53:31] possible I'm misinterpreting that [20:54:25] either way I'm inclined to leave this alone -- if it keeps happening we can look into provisioning for the traffic and/or figuring out where in the frontend it's coming from (like we did with shellbox-lilypond) [20:54:47] rzl: do the logs show the query? I don't see it in logstash, or am I doing something wrong? [20:55:49] here are some shellbox logs https://logstash.wikimedia.org/goto/913eae91e2daea91327dddc1565ba13a (beside using kubectl logs) [20:56:00] I'm not sure -- wikitech says we drop httpd access logs for 200s just because there are so many of them and they're mostly not helpful [20:56:36] (03CR) 10Urbanecm: [C: 04-2] "This change needs WMF legal signoff." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [20:56:42] we should have better data via MW logs though, if anyone wants to dig around for it [20:56:54] I'm not super inclined unless this recurs, but I could see going either way [20:58:24] (03PS1) 10Papaul: Add new ganeti node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835271 (https://phabricator.wikimedia.org/T313856) [20:59:04] jelto: thanks, I didn't see anything interesting, did kubectl logs show anything? [20:59:45] (03CR) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [21:00:05] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220926T2100). [21:04:13] jhathaway: same here, I did not find anything in logstash. In kubectl logs there are a lot of POSTs to http://localhost:6027/shell/syntaxhighlight-pygments - multipart/mixed . But I have absolutly no idea if that's "normal" [21:04:58] jelto: thanks [21:05:11] it seems POST requests now are way smaller than during the incident [21:05:14] jelto: do you notice a sharp increase in how *many* requests there are, starting around 20:29? [21:05:23] ah yeah, or a decrease after will work too :) [21:05:28] that's a good sign you're looking at the right traffic [21:05:51] doesn't get us the POST bodies I guess, we can still go to MW logs for that [21:06:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host centrallog1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:06:50] (03CR) 10Papaul: [C: 03+2] Add new ganeti node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835271 (https://phabricator.wikimedia.org/T313856) (owner: 10Papaul) [21:09:48] rzl: yes there are a lot more of this requests starting at 20:27 [21:10:37] I'm not totally sure how to parse the httpd access logs, but I quess the post requests are also bigger (or take longer, depending on what that column stands for) [21:11:41] 👍 [21:12:06] they at least took longer for sure, wouldn't be shocked if they're also bigger [21:12:42] you can also try that using something like kubectl logs -n shellbox-syntaxhighlight shellbox-main-7795bf4db7-5rhdj -c shellbox-main-httpd | less on deploy host [21:15:45] ah thanks! see https://wikitech.wikimedia.org/wiki/Apache_log_format -- it looks like the request time went way up but the response sizes were pretty reasonable [21:17:12] (03CR) 10Urbanecm: [C: 04-2] InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [21:18:57] (03CR) 10Urbanecm: [C: 04-2] InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [21:22:36] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:39:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2031.codfw.wmnet with OS bullseye [21:39:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2031.codfw.wmnet with OS bullseye [21:40:18] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:57] rzl: I'm off. I tried some jq grep action on centrallog but was not able to find anything shellbox/syntaxhighlight/api related which looked unnormal. Feel free to double check that if it happens again, I'm still a bit stiff with jq [21:46:29] sounds good, thanks for looking! have a good night [21:49:32] 10SRE, 10Data Engineering Planning, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10nshahquinn-wmf) Just for the record: as @mpopov said above, Inuka is using `wprov` as the main source of data on Wikipedia Preview, so it would be essential to cons... [21:55:18] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:14:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2031.codfw.wmnet with reason: host reimage [22:18:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2031.codfw.wmnet with reason: host reimage [22:22:34] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:33:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2031.codfw.wmnet with OS bullseye [22:33:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2031.codfw.wmnet with OS bullseye completed: - ganeti2031 (**PAS... [22:37:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2032.codfw.wmnet with OS bullseye [22:37:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2032.codfw.wmnet with OS bullseye [22:56:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2032.codfw.wmnet with reason: host reimage [22:57:12] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:59:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2032.codfw.wmnet with reason: host reimage [23:00:11] (03PS2) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [23:01:41] (03PS3) 10Ryan Kemper: elastic: remove elastic10[48-52] from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/835269 (https://phabricator.wikimedia.org/T316728) (owner: 10Bking) [23:14:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2032.codfw.wmnet with OS bullseye [23:14:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti203[12] - https://phabricator.wikimedia.org/T313856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2032.codfw.wmnet with OS bullseye completed: - ganeti2032 (**PAS... [23:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T314041)', diff saved to https://phabricator.wikimedia.org/P34919 and previous config saved to /var/cache/conftool/dbconfig/20220926-231915-ladsgroup.json [23:19:20] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:21:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org [23:34:11] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices1004.wikimedia.org [23:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P34920 and previous config saved to /var/cache/conftool/dbconfig/20220926-233422-ladsgroup.json [23:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P34921 and previous config saved to /var/cache/conftool/dbconfig/20220926-234928-ladsgroup.json [23:56:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1005.wikimedia.org [23:58:13] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook