[00:06:14] (03CR) 10Krinkle: [C: 03+1] "Scheduled for tomorrow https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700" [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [00:46:43] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:45] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:17] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247 [00:48:17] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:20] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [00:49:49] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:13:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:33:41] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) p:05Triage→03Medium [01:37:46] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:07] !log krinkle@deploy1002 Started deploy [integration/docroot@f59119c]: (no justification provided) [01:50:21] !log krinkle@deploy1002 Finished deploy [integration/docroot@f59119c]: (no justification provided) (duration: 00m 14s) [01:57:46] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:38] (03CR) 10Eevans: "I think what @joe was alluding to was that if you used a name other than `profile::swift::accounts_keys` for the Hash[String Hash] structu" [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [02:08:40] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247 [02:08:43] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [02:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:17:08] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) [02:17:46] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:40:43] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:41:08] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247 [02:41:11] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [02:42:15] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:46:49] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247 [02:46:52] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0300) [03:01:59] PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/876376 (https://phabricator.wikimedia.org/T325581) [03:07:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/876376 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [03:12:10] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247 [03:12:13] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [03:24:33] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/876376 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [03:25:21] RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:11] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Eileenmcnaughton) 05Open→03Resolved OK - I think this is resolved - my understanding from https://phabricator.wikimedia.org/T321494 is that 'done' looks like 'I can access ht... [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0400) [04:04:47] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:40] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Slaporte) Thanks for resolving this while I was out. There are no legal concer... [05:27:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39018/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [05:32:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:37] !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync idm-test1001 - slyngshede@cumin1001" [05:39:32] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync idm-test1001 - slyngshede@cumin1001" [05:40:15] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39019/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [05:40:50] (03PS1) 10KartikMistry: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877219 (https://phabricator.wikimedia.org/T326278) [05:44:21] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39020/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [06:01:37] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:02:11] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:22:06] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39021/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [06:22:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T326133 [06:22:31] T326133: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T326133 [06:22:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T326133 [06:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1100 with weight 0 T326133', diff saved to https://phabricator.wikimedia.org/P42938 and previous config saved to /var/cache/conftool/dbconfig/20230110-062309-ladsgroup.json [06:42:14] (03PS2) 10Ladsgroup: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/874826 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot) [06:42:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/874826 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot) [06:51:39] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10ayounsi) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Could you work with @Marostegui to get this SFP-T replaced? see the errors on https://librenms.wikimedia.org/device/device=160/tab=port/port=15307/ [06:52:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:57:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0700) [07:00:05] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0700). [07:00:14] o/ [07:01:42] !log Starting s5 eqiad failover from db1130 to db1100 - T326133 [07:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:45] T326133: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T326133 [07:01:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T326133', diff saved to https://phabricator.wikimedia.org/P42939 and previous config saved to /var/cache/conftool/dbconfig/20230110-070152-ladsgroup.json [07:01:56] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [07:02:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T326133', diff saved to https://phabricator.wikimedia.org/P42940 and previous config saved to /var/cache/conftool/dbconfig/20230110-070223-ladsgroup.json [07:03:10] !log remove static routes for legacy dns-rec-lb IPs - T239993 [07:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:13] T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 [07:03:56] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [07:05:03] (03PS2) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/874827 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot) [07:05:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/874827 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot) [07:06:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1130 T326133', diff saved to https://phabricator.wikimedia.org/P42941 and previous config saved to /var/cache/conftool/dbconfig/20230110-070628-ladsgroup.json [07:10:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:10:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [07:11:48] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:14:59] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: check if dns update is needed after change of rec-dns-lb IPs status - ayounsi@cumin1001" [07:16:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: check if dns update is needed after change of rec-dns-lb IPs status - ayounsi@cumin1001" [07:16:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:16:30] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) a:05ayounsi→03BCornwall Static routes removed! Next step is to remove the IPs from the servers: That means removing everything related to "legacy_vip" in Puppet https://github.com/wikime... [07:19:14] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) asw2-b-eqiad:fpc1:1/1 is still showing errors... Next step will be to replace the fiber between the two (already replaced) optics. @Jclark-ctr let me know when woul... [07:22:08] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [07:22:22] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2031.codfw.wmnet [07:22:49] (03PS1) 10Ayounsi: Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877221 (https://phabricator.wikimedia.org/T316532) [07:23:07] (03PS2) 10Ayounsi: Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877221 (https://phabricator.wikimedia.org/T316532) [07:27:38] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [07:28:30] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877221 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [07:28:54] !log depool ulsfo for network maintenance - T316532 [07:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:58] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [07:32:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:33:21] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [07:33:53] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mc2044.codfw.wmnet [07:36:04] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2031.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [07:37:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2031.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [07:37:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:37:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2031.codfw.wmnet [07:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:42:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:45:27] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2032.codfw.wmnet [07:52:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:52:57] (03PS3) 10KartikMistry: ContentTranslation: Increase MT threshold for publishing in cswiki by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) [07:55:28] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:28] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:01:33] * kart_ is around and will go for deployment.. [08:02:05] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [08:02:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) (owner: 10KartikMistry) [08:02:55] (03Merged) 10jenkins-bot: ContentTranslation: Increase MT threshold for publishing in cswiki by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) (owner: 10KartikMistry) [08:03:34] !log kartik@deploy1002 Started scap: Backport for [[gerrit:875192|ContentTranslation: Increase MT threshold for publishing in cswiki by 20% (T324721)]] [08:03:39] T324721: Modify Machine Translation in Czech Wikipedia by 20% or more to publish a translation - https://phabricator.wikimedia.org/T324721 [08:05:28] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:06:34] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39022/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [08:07:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:08:24] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:875192|ContentTranslation: Increase MT threshold for publishing in cswiki by 20% (T324721)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:09:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:09:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [08:10:28] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:11:46] (03PS7) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [08:15:00] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Peachey88) p:05Unbreak!→03High Changing from Unbreak to High because it was resolved this morning. [08:17:17] Anything with mw1418 particular? Got this: "8:11:41 Check 'Logstash Error rate for mw1418.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.02, After: 2.00, Threshold: 1.00)" [08:17:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [08:18:54] mw1418 is a canary, but nothing else is special with it [08:19:49] zabe: yeah. [08:20:55] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:875192|ContentTranslation: Increase MT threshold for publishing in cswiki by 20% (T324721)]] (duration: 17m 21s) [08:20:59] T324721: Modify Machine Translation in Czech Wikipedia by 20% or more to publish a translation - https://phabricator.wikimedia.org/T324721 [08:22:26] Moving to the next patch. [08:22:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [08:22:58] 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) 05Open→03Resolved [08:26:35] (03PS1) 10Muehlenhoff: Fix up package list after ldapsupportlib removal [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) [08:27:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Jelto) [08:36:52] (03Merged) 10jenkins-bot: CX: Fix usage of categories translation unit as array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [08:37:07] !log kartik@deploy1002 Started scap: Backport for [[gerrit:877138|CX: Fix usage of categories translation unit as array (T326278)]] [08:37:15] T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278 [08:38:56] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:877138|CX: Fix usage of categories translation unit as array (T326278)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:48:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) (owner: 10Jelto) [08:49:15] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:877138|CX: Fix usage of categories translation unit as array (T326278)]] (duration: 12m 08s) [08:49:18] T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278 [08:49:20] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.5 [puppet] - 10https://gerrit.wikimedia.org/r/877958 (https://phabricator.wikimedia.org/T326616) [08:50:12] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:50:27] (03CR) 10Jelto: [C: 03+2] admin: add zabe to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) (owner: 10Jelto) [08:51:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877219 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [08:51:41] (It seems backport deployment will stretch a bit or maybe be a byte!) [08:52:45] (03PS1) 10KartikMistry: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877223 (https://phabricator.wikimedia.org/T326278) [08:53:16] (03CR) 10Slyngshede: role:IDM assign IDM role to test VM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [08:54:03] !log upgrade thanos to 0.30.1 on prometheus2006 - T303154 [08:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:09] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [08:56:54] !log upgrade thanos to 0.30.1 on thanos-fe1001 - T303154 [08:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:01] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) @Slaporte - I am thinking of modifying the script to check that the i... [08:58:06] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2032.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [09:03:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Jelto) 05Open→03Resolved a:03Jelto @Zabe you should have access to `deployment` group now. Happy to have you on board! I'm closing this task. Feel free to re-open i... [09:04:29] (03PS2) 10Jelto: sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) [09:05:15] (03Merged) 10jenkins-bot: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877219 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry) [09:05:32] !log kartik@deploy1002 Started scap: Backport for [[gerrit:877219|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] [09:05:35] T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278 [09:06:38] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:06:38] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:07:17] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:877219|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:08:19] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [09:08:48] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:10:20] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:11:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877958 (https://phabricator.wikimedia.org/T326616) (owner: 10Jelto) [09:13:39] (03PS1) 10Slyngshede: idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 [09:13:41] (03CR) 10Muehlenhoff: "Looks good, but you also need to update the Admin::UID Variant for the new names." [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn) [09:14:26] (03PS2) 10Slyngshede: idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 [09:14:52] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:877219|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] (duration: 09m 20s) [09:14:53] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Same issue with `rcp: /var/run/./vjunos-install.sh: Read-only file system` and then `mount: /dev/ad0s1a : Resource temporarily unavailable`, which... [09:14:55] T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278 [09:15:08] (03CR) 10Muehlenhoff: "Better set these in hieradata/role/common/idm.yaml, then they apply to all future test hosts as well." [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede) [09:15:38] (03PS1) 10Ayounsi: Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/877224 (https://phabricator.wikimedia.org/T316532) [09:15:40] !log Done: UTC morning backport window [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:05] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.5 [puppet] - 10https://gerrit.wikimedia.org/r/877958 (https://phabricator.wikimedia.org/T326616) (owner: 10Jelto) [09:17:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2032.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [09:17:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:17:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2032.codfw.wmnet [09:18:20] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Note that removing ` [edit system] - internet-options { - tcp-drop-synfin-set; - no-tcp-reset drop-all-tcp; - } ` Is needed otherwi... [09:18:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [09:19:03] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2033.codfw.wmnet [09:22:02] (03PS3) 10Slyngshede: idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 [09:22:23] !log added zabe to wmf-deployment gerrit group T326327 [09:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [09:24:26] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/877224 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [09:25:46] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) (owner: 10Muehlenhoff) [09:25:48] !log repool ulsfo (maintenance cancelled) - T316532 [09:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:51] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [09:33:04] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:18] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9568478]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@9568478] [09:34:25] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [09:34:30] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9568478]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@9568478] (duration: 00m 11s) [09:34:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede) [09:34:55] (03CR) 10Slyngshede: [V: 03+1] idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede) [09:35:01] (03CR) 10Slyngshede: [V: 03+2] idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede) [09:35:10] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede) [09:42:46] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:43:00] !log upgrade thanos to 0.30.1 on thanos-fe100[2-3] - T303154 [09:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:04] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [09:45:24] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9568478]: Fix bug fix in HDFS usage pipeline [airflow-dags@9568478] [09:45:38] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9568478]: Fix bug fix in HDFS usage pipeline [airflow-dags@9568478] (duration: 00m 13s) [09:46:44] (03PS8) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [09:47:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39027/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [09:49:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "dsh: Remove parse1002 from parsoid dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/877207 (https://phabricator.wikimedia.org/T326119) (owner: 10Clément Goubert) [09:52:23] (03CR) 10Clément Goubert: [C: 03+2] Revert "dsh: Remove parse1002 from parsoid dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/877207 (https://phabricator.wikimedia.org/T326119) (owner: 10Clément Goubert) [09:52:27] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) The pre-upgrade went fine on asw1-eqsin, so I guess the ulsfo issue is a corrupted storage. The last step for eqsin is a reboot, so I'll maintain... [09:52:58] (03PS1) 10JMeybohm: PKI: Default expiry of 3 days for wikikube_staging [puppet] - 10https://gerrit.wikimedia.org/r/877961 [09:53:26] (03PS9) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) [09:53:53] !log installing systemd bugfix updates from Bullseye point release [09:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:58] slyngs: There's an idm.yaml change pending on puppetmaster, should I merge it or do I leave it for you? [09:54:17] If you're there them please just merge [09:54:24] there [09:54:27] slyngs: all done [09:54:28] Aarg [09:54:30] Thanks [09:54:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix up package list after ldapsupportlib removal [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) (owner: 10Muehlenhoff) [09:54:42] (03PS4) 10Giuseppe Lavagetto: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 [09:54:44] (03PS3) 10Giuseppe Lavagetto: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 [09:55:19] !log upgrade thanos to 0.30.1 on prometheus hosts - T303154 [09:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:23] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [09:56:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39028/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [09:56:21] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10RhinosF1) @mutante: adding as IC, can you please let people know when the incident report from last night is ready? @multichill: I’ve ad... [09:57:46] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:57:58] ^ expected, gitlab replica [09:57:58] 10SRE, 10Traffic-Icebox: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion - https://phabricator.wikimedia.org/T266651 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:59:34] !log cgoubert@cumin1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1002.eqiad.wmnet [10:02:00] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [10:02:46] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:27] (03CR) 10Muehlenhoff: [C: 03+2] Fix up package list after ldapsupportlib removal [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) (owner: 10Muehlenhoff) [10:06:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [10:06:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2033.codfw.wmnet [10:07:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet [10:09:42] (03PS1) 10JMeybohm: k8s: Remove default kubelet_cluster_domain definitions [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) [10:10:50] PROBLEM - Memcached on mc2034 is CRITICAL: connect to address 10.192.48.78 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [10:11:36] (03PS1) 10Majavah: ldap: move ssh-key-ldap-lookup directly to ssh module [puppet] - 10https://gerrit.wikimedia.org/r/877964 [10:12:28] (03PS1) 10Muehlenhoff: profile::openldap::client: Stop including ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/877965 [10:13:14] RECOVERY - Memcached on mc2034 is OK: TCP OK - 0.033 second response time on 10.192.48.78 port 11214 https://wikitech.wikimedia.org/wiki/Memcached [10:13:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39031/console" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [10:13:44] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet [10:13:44] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet [10:13:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39032/console" [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:14:35] !log repooled parse1002.eqiad.wmnet - T326119 [10:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:37] T326119: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 [10:14:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet [10:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:18:02] (03CR) 10Btullis: [C: 03+2] Detect the correct disks for the O/S on the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [10:18:20] (03CR) 10Btullis: [C: 03+2] Detect the correct disks for the O/S on the cephosd servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [10:18:54] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1002.eqiad.wmnet with OS bullseye [10:19:18] jouncebot: nowandnext [10:19:18] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [10:19:19] In 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1100) [10:21:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:21:45] !log Starting rolling reboot of eqiad jobrunners [10:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes [10:24:49] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye [10:25:42] (03CR) 10Muehlenhoff: "I was thinking of rather just moving the content of ldap::client::utils to profile::base::labs (and then axing the "utils" check from ldap" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah) [10:28:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:29:44] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:52] PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-ask-password-console.path,systemd-ask-password-wall.path https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:56] !log upgrade thanos to 0.30.1 on thanos-fe2* - T303154 [10:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:59] T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154 [10:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:42:44] (03PS1) 10JMeybohm: k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) [10:43:05] (03CR) 10CI reject: [V: 04-1] k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [10:44:36] (03PS2) 10JMeybohm: k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) [10:45:26] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [10:45:45] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [10:46:01] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39033/console" [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [10:48:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 (owner: 10Giuseppe Lavagetto) [10:52:50] (03Merged) 10jenkins-bot: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 (owner: 10Giuseppe Lavagetto) [10:54:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede) [10:56:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39037/console" [puppet] - 10https://gerrit.wikimedia.org/r/877961 (owner: 10JMeybohm) [10:59:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [10:59:15] (03PS16) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [10:59:43] (03PS4) 10Giuseppe Lavagetto: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 [10:59:52] PROBLEM - Host an-worker1080 is DOWN: PING CRITICAL - Packet loss = 100% [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1100) [11:00:43] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2034.codfw.wmnet [11:00:55] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [11:02:50] (03PS14) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [11:02:59] (03PS6) 10Jbond: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [11:04:00] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10jcrespo) an-worker1080 downtime alerting expired. No issue on our side, just a friendly ping in case you want to extend it. [11:04:04] (03CR) 10Jbond: [C: 03+2] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [11:05:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s: Remove default kubelet_cluster_domain definitions [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:06:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] profile::openldap::client: Stop including ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/877965 (owner: 10Muehlenhoff) [11:06:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [11:08:42] (03CR) 10JMeybohm: [C: 03+2] PKI: Default expiry of 3 days for wikikube_staging [puppet] - 10https://gerrit.wikimedia.org/r/877961 (owner: 10JMeybohm) [11:08:45] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove default kubelet_cluster_domain definitions [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:10:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [11:12:02] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:12:39] (03PS3) 10JMeybohm: k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) [11:13:30] 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to 1.8.18 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [11:16:17] 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to 1.8.18 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to focus this task on Etherpad version upgrade as suggested previously. Please create tasks for specific plugins so that they can be e... [11:29:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:32:11] (03PS1) 10Muehlenhoff: Still support Stretch for Python LDAP includes [puppet] - 10https://gerrit.wikimedia.org/r/877994 [11:33:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Still support Stretch for Python LDAP includes [puppet] - 10https://gerrit.wikimedia.org/r/877994 (owner: 10Muehlenhoff) [11:33:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes [11:35:47] (03CR) 10Muehlenhoff: [C: 03+2] Still support Stretch for Python LDAP includes [puppet] - 10https://gerrit.wikimedia.org/r/877994 (owner: 10Muehlenhoff) [11:35:57] !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes [11:39:48] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:39:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:44:27] effie: the uncommitted DNS changes seems related to your decom of mc2034 [11:44:41] yes hangon [11:44:52] sorry mybad [11:45:10] k, no prob [11:46:14] <_joe_> jouncebot: nowandnext [11:46:14] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1100) [11:46:14] In 2 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400) [11:46:14] In 2 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400) [11:46:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 (owner: 10Giuseppe Lavagetto) [11:46:50] <_joe_> I should be able to finish my changes before the backport window [11:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:48:35] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [11:51:03] (03Merged) 10jenkins-bot: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 (owner: 10Giuseppe Lavagetto) [11:51:44] <_joe_> ok here we go [11:52:37] (03PS2) 10Muehlenhoff: Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971 [11:52:42] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:53:11] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:54:54] (03CR) 10Muehlenhoff: [C: 03+2] Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971 (owner: 10Muehlenhoff) [11:55:10] <_joe_> oof, sigh [11:55:41] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/869777 (owner: 10Muehlenhoff) [11:56:37] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff) [11:56:51] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:56:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:57:26] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:57:59] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:58:31] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:59:29] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [12:01:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:02:08] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [12:04:07] (03CR) 10Jbond: phabricator: change phd home dir to /var/lib/phd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [12:04:50] (03CR) 10Jbond: [C: 03+1] Add SPDX headers to various base/IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/863305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:05:07] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:05:29] (03CR) 10Jbond: [C: 03+1] Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:06:22] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:06:44] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:07:43] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:07:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:11:43] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:12:07] !log Finished rolling reboot of eqiad jobrunners [12:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:17:20] (03PS1) 10Muehlenhoff: Decom puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) [12:17:39] (03CR) 10CI reject: [V: 04-1] Decom puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:17:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:18:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [12:18:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:18:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2034.codfw.wmnet [12:19:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [12:22:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:22:49] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10LSobanski) @Reedy @MoritzMuehlenhoff is there anything else left to do here or can the task be resolved? [12:22:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:25:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [12:25:48] (03CR) 10Jbond: [C: 03+1] "lgtm, optional nit to make more dry" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff) [12:27:54] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is done, the ConfirmEdit extension as deployed in production uses Python 3 and then Puppe... [12:27:59] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:28:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:08] (03PS1) 10Giuseppe Lavagetto: mediawiki::mcrouter::yaml_defs: adapt to new values structure [puppet] - 10https://gerrit.wikimedia.org/r/878004 [12:30:32] (03CR) 10Muehlenhoff: [C: 03+2] Decom puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:31:05] !log oblivian@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [12:31:05] !log oblivian@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [12:31:28] !log oblivian@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:31:28] !log oblivian@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:32:13] 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10LSobanski) The task's original intent was to cover planning "over the next 3 years" starting in 2019. @ArielGlenn is the task still relevant, can... [12:32:41] (03PS1) 10Btullis: Correct the units for the cephosd volumes [puppet] - 10https://gerrit.wikimedia.org/r/878005 (https://phabricator.wikimedia.org/T324670) [12:33:14] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:33:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39038/console" [puppet] - 10https://gerrit.wikimedia.org/r/878004 (owner: 10Giuseppe Lavagetto) [12:34:41] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1002.eqiad.wmnet with OS bullseye [12:34:51] (03CR) 10Btullis: [C: 03+2] Correct the units for the cephosd volumes [puppet] - 10https://gerrit.wikimedia.org/r/878005 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [12:35:00] 10SRE: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving based on the most recent comment. Please reopen if appropriate. [12:36:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye [12:36:30] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::mcrouter::yaml_defs: adapt to new values structure [puppet] - 10https://gerrit.wikimedia.org/r/878004 (owner: 10Giuseppe Lavagetto) [12:39:50] 10SRE, 10serviceops, 10User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 (10LSobanski) [12:40:43] (03PS2) 10Jbond: admin: split system user data type into local and global [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn) [12:41:13] (03CR) 10Jbond: admin: split system user data type into local and global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn) [12:41:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn) [12:44:14] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112 (10LSobanski) [12:45:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn) [12:46:12] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving based on the recent comments, follow up work should be happening in T261196. [12:47:07] jouncebot: nowandnext [12:47:07] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [12:47:07] In 1 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400) [12:47:07] In 1 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400) [12:47:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes [12:49:57] !log Starting rolling reboot of eqiad appservers [12:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:00] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetdb-test2001.codfw.wmnet [12:50:09] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [12:50:11] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10LSobanski) @hashar As the original requester (T307349#7895775), could you help clarify what's needed here? [12:50:14] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [12:50:28] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [12:53:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:56:34] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:56:45] (03PS1) 10Btullis: Reduce the size of the partitions on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/878007 (https://phabricator.wikimedia.org/T324670) [12:58:30] (03CR) 10Btullis: [C: 03+2] Reduce the size of the partitions on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/878007 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [12:59:27] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1002.eqiad.wmnet with OS bullseye [12:59:45] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye [13:00:10] (03CR) 10Jbond: Detect the correct disks for the O/S on the cephosd servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [13:05:15] 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) >>! In T226093#8512308, @LSobanski wrote: > The task's original intent was to cover planning "over the next 3 years" starting in 2019... [13:08:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:08:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:08:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetdb-test2001.codfw.wmnet [13:09:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetdb-test2001.codfw.wmnet` - puppetdb-test2001.codfw.wmnet... [13:10:31] (03PS1) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:10:46] (03PS2) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:11:04] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:14] PROBLEM - Host mw1352 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:25] (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:11:27] (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:13:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff) [13:14:02] The two mw hosts down is me, they don't seem to be coming back and the cookbook downtime expired [13:16:02] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [13:18:37] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [13:19:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [13:19:46] (03CR) 10Muehlenhoff: [C: 03+2] profile::openldap::client: Stop including ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/877965 (owner: 10Muehlenhoff) [13:19:54] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:19:59] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:20:18] PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:20] RECOVERY - Check systemd state on mw1351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:46] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) So the above is my proposal for the check: rather than checking exact word... [13:22:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2001.wikimedia.org [13:22:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:18] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:24:40] RECOVERY - Host mw1352 is UP: PING OK - Packet loss = 0%, RTA = 2.48 ms [13:24:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (POST events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:25:30] PROBLEM - Check systemd state on mw1352 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:40] RECOVERY - Check systemd state on mw1352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2001.wikimedia.org [13:29:58] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (POST events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:30:50] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/878014 (https://phabricator.wikimedia.org/T325387) [13:31:32] (03PS3) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:31:56] (03PS4) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:32:30] (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:32:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:45] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:35:56] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [13:36:18] (03PS5) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:36:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [13:36:50] (03PS6) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:36:52] (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:37:14] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2035.codfw.wmnet [13:37:23] (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [13:40:13] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (POST events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:43:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [13:43:21] (03PS1) 10Muehlenhoff: Failover irc CNAME to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/878017 [13:44:41] (03CR) 10Muehlenhoff: [C: 03+2] Failover irc CNAME to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/878017 (owner: 10Muehlenhoff) [13:44:57] !log delete grafana dashboards from "sre dashboards for deletion" folder - T178690 [13:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:00] T178690: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 [13:45:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:07] (03PS7) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [13:46:46] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [13:46:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:46:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1002.eqiad.wmnet with OS bullseye [13:49:46] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2035.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400) [14:00:05] eigyan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400) [14:00:06] Greetings All 0/ [14:00:31] o/ [14:00:39] For deployers information, I am rebooting appservers in eqiad, which may cause some scap failures [14:00:48] heya heya heya, I would like to try stuff out [14:00:55] Just ping me with the machine failing and I'll tell you if it's me or not [14:01:02] I'm not touching mwdebug [14:01:13] claime: will you ensure those will then be updated with the latest config aftewards? [14:01:24] taavi: if you get a failure yes [14:01:29] if not, they scap pull at boot [14:02:03] zabe: sure! I'm around too and happy to help if you have any problems [14:02:24] nice, thanks [14:02:27] meanwhile.. eigyan: why are fawiki and enwiki added with the + syntax while the others aren't? [14:02:36] I’m in a meeting at the moment, I could do some backports later if someone pings me to remind me (after :20 or so) [14:03:37] zabe: I'm around for the first 45 minutes, in case you ned me. [14:03:43] taavi +syntax ensures a merge with beta-config will not overwrite its values [14:03:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2035.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [14:03:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:03:52] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2035.codfw.wmnet [14:04:18] (03CR) 10Filippo Giunchedi: "There's a merge conflict now though LGTM overall (not tested yet)" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:04:24] zabe: TLDR it's ssh to deploy1002 and run `scap backport 877268` those days, it'll practically guide you itself :) [14:04:59] ok :) [14:06:08] (03PS2) 10Zabe: [config]: GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan) [14:06:55] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Reinitialize staging-codfw with k8s 1.23 [14:06:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan) [14:06:59] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host apifeatureusage1001.eqiad.wmnet [14:07:04] PROBLEM - Memcached on mc2036 is CRITICAL: connect to address 10.192.48.80 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:07:05] <_joe_> claime: uhm actually that has been removed in some refactor (scap pull on reboot), sigh [14:07:12] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Reinitialize staging-codfw with k8s 1.23 [14:07:13] _joe_: *sigh* [14:07:15] ok [14:07:17] I'll stop it then [14:07:28] <_joe_> I mean, it's not the end of the world [14:07:28] Gimme 5 minutes to cleanup [14:07:41] (03Merged) 10jenkins-bot: [config]: GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan) [14:07:49] <_joe_> we sill just need to run a scap full sync at the end I guess [14:07:55] !log zabe@deploy1002 Started scap: Backport for [[gerrit:877268|[config]: GDI Safety Survey Wave 4 (T325136)]] [14:07:57] T325136: Deploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - week of January 9, 2023 - https://phabricator.wikimedia.org/T325136 [14:08:08] There's quite a few needing a racadm kick to the butt to actually reboot :/ [14:08:08] _joe_: scap backport runs full sync every time those days, so that should not be an issue [14:08:26] <_joe_> urbanecm: right, but if some machine fail on the last scap [14:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:09:06] ah, then yes, you're right. [14:09:47] that should be solvably just by not starting the full sync before c.laime is done with pausing the script? [14:09:47] !log zabe@deploy1002 zabe and essexigyan: Backport for [[gerrit:877268|[config]: GDI Safety Survey Wave 4 (T325136)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:10:00] yeah, I will wait [14:10:13] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [14:10:24] eigyan, you can test in the meantime :) [14:10:42] will do zabe [14:11:14] (03PS8) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [14:11:37] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2036.codfw.wmnet [14:11:45] mw1371/mw1372 need a hard reset [14:11:52] That's a few in a row :/ [14:11:58] (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [14:12:12] zabe all surveys are display as expected woo hoo! [14:12:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:13:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:14:17] claime, Can you let me know when I can sync? [14:14:20] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apifeatureusage1001.eqiad.wmnet [14:14:48] zabe: yep, just a few minutes to perform CPR on that host, I'll give you the go ahead [14:14:51] (03PS13) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) [14:14:55] ok, thanks [14:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:16:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39040/console" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [14:17:22] (03PS9) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [14:18:14] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw[1369-1372].eqiad.wmnet [14:18:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw[1369-1372].eqiad.wmnet [14:18:53] zabe: you can go ahead [14:18:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:19:04] thanks, syncing [14:19:27] !log Pausing reboots of eqiad appservers for deployments [14:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:44] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) This is an example of its execution in verbose mode, you can see it is abl... [14:21:16] (03CR) 10Jcrespo: "This is an example of its execution in verbose mode, you can see it is able to find and crawl the referred texts (only the last line would" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [14:21:25] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host apifeatureusage2001.codfw.wmnet [14:22:46] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:42] (03CR) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [14:23:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:37] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:877268|[config]: GDI Safety Survey Wave 4 (T325136)]] (duration: 17m 42s) [14:25:47] T325136: Deploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - week of January 9, 2023 - https://phabricator.wikimedia.org/T325136 [14:26:08] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [14:26:20] eigyan, should be live :) [14:26:47] Excellent, thank you zabe [14:27:09] thanks to all who made this deployment possible :) [14:27:31] (03PS2) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) [14:27:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:22] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2001.codfw.wmnet with OS bullseye [14:28:36] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2001.codfw.wmnet with OS bullseye [14:28:36] (03CR) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:28:36] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2036.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [14:28:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:29:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:32:01] (03PS1) 10Zabe: Start reading from cul_actor on remaining test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004) [14:33:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2036.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [14:33:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:33:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2036.codfw.wmnet [14:33:35] (03PS10) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [14:34:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:34:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host apifeatureusage2001.codfw.wmnet [14:34:57] (03PS2) 10Zabe: Start reading from cul_actor on remaining test wikis and group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004) [14:35:37] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2037.codfw.wmnet [14:35:47] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [14:35:53] (03PS11) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) [14:36:37] (03CR) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [14:36:53] !log run populateCulActor on group0 wikis # T325484 [14:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:56] T325484: Run PopulateCulActor on all wikis - https://phabricator.wikimedia.org/T325484 [14:37:10] (03CR) 10Lucas Werkmeister (WMDE): "I don’t understand the phpstan error in CI…" [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE)) [14:37:34] 10SRE, 10Traffic: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10ssingh) [14:37:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:38:24] (03Merged) 10jenkins-bot: Start reading from cul_actor on remaining test wikis and group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:38:37] !log zabe@deploy1002 Started scap: Backport for [[gerrit:878021|Start reading from cul_actor on remaining test wikis and group0 wikis (T233004)]] [14:38:43] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:39:13] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:40:24] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878021|Start reading from cul_actor on remaining test wikis and group0 wikis (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:44:46] (03CR) 10Eevans: [C: 03+1] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/878014 (https://phabricator.wikimedia.org/T325387) (owner: 10Muehlenhoff) [14:46:15] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [14:46:59] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @ayounsi @cmooney I have 2 questions 1- I have a total of 17 switches received so 1 is going to be used as the cloudsw in r... [14:47:36] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878021|Start reading from cul_actor on remaining test wikis and group0 wikis (T233004)]] (duration: 08m 59s) [14:47:39] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:47:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:21] !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2037.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [14:48:45] (03PS1) 10JMeybohm: install_server: Update kubestagetcd2* to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878047 (https://phabricator.wikimedia.org/T326340) [14:49:14] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:49:14] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader1001.eqiad.wmnet [14:49:38] !log UTC afternoon deploys done [14:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:43] (03PS1) 10Ssingh: Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) [14:50:16] claime, you can continue with your reboots if you like [14:50:35] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) We have now the logs in kafka, and thus should also be ingested in logstash, and create a dashboard. Once that's done, we should reduce also the retention time of... [14:51:30] (03CR) 10Filippo Giunchedi: "LGTM, though see inline. I got a an AttributeError when testing" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [14:51:54] (03CR) 10JMeybohm: [C: 03+2] install_server: Update kubestagetcd2* to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878047 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [14:52:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:07] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1001.eqiad.wmnet [14:53:35] 10SRE, 10Thumbor, 10serviceops: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10LSobanski) [14:54:13] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:55:29] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2001.codfw.wmnet with OS bullseye [14:55:38] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2001.codfw.wmnet with OS bullseye [14:55:58] PROBLEM - Host mc2050 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:02] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Ottomata) If we did {T291645} and {T276972}, these logs could be mirrored to Kafka jumbo and available in Hive and Turnilo too. [14:56:48] !log start VC link maintenance in eqiad - T325803 [14:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:55] T325803: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 [15:01:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2037.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001" [15:01:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2037.codfw.wmnet [15:02:21] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2001.codfw.wmnet with OS bullseye [15:02:55] (03PS1) 10Jbond: docker::baseimages: inject no_proxy config to rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) [15:04:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39041/console" [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) (owner: 10Jbond) [15:04:44] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) >>! In T265876#8512693, @Ottomata wrote: > If we did {T291645} and {T276972}, these logs could be mirrored to Kafka jumbo and available in Hive and Turnilo too. Wh... [15:04:52] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:04:55] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) asw2-b-eqiad: fpc1:1/1 Cleaned fiber and replaced optic [15:05:36] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) 1/ 1 ToR per rack = 8x2 + 1 spare = 17, so indeed 1 dedicated to WMCS 2/ A1 and B1 would make sens, and would match eqiad... [15:06:03] RECOVERY - Host mc2050 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [15:09:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [15:11:21] (03CR) 10Ottomata: [C: 03+2] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:11:24] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd2001.codfw.wmnet with reason: host reimage [15:13:40] (03CR) 10CI reject: [V: 04-1] Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [15:14:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd2001.codfw.wmnet with reason: host reimage [15:16:17] (03Merged) 10jenkins-bot: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:17:05] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) @Jclark-ctr let me know if you need me to depool this host (db1107). It can be easily be done, it is just a replica. [15:17:23] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader2001.codfw.wmnet [15:18:56] 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Marostegui) I think even if they grow a lot, with the new set of servers, we still have 6.6TB free (76% free disk space)...I'd be surprised if we... [15:21:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2001.codfw.wmnet [15:22:27] (03PS1) 10Marostegui: Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/877973 [15:22:52] (03CR) 10Marostegui: [C: 03+2] Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/877973 (owner: 10Marostegui) [15:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After the incident', diff saved to https://phabricator.wikimedia.org/P42944 and previous config saved to /var/cache/conftool/dbconfig/20230110-152336-root.json [15:25:32] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:25:36] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:27:46] (JobUnavailable) firing: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:05] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [15:29:54] !log Restarting rolling reboots of eqiad appservers [15:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:02] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:30:59] (03CR) 10Ssingh: "We have a successful build on build2001, this is just the regular test failing." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh) [15:32:08] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:35:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [15:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After the incident', diff saved to https://phabricator.wikimedia.org/P42945 and previous config saved to /var/cache/conftool/dbconfig/20230110-153841-root.json [15:43:03] (03PS3) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) [15:44:01] (03PS4) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) [15:44:06] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:45:40] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:48:50] (03PS1) 10Ottomata: flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) [15:49:47] (03PS2) 10Ottomata: flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) [15:49:49] (03PS1) 10Ayounsi: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/878117 [15:50:16] (03CR) 10JMeybohm: [C: 03+1] flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:50:49] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:51:33] (03CR) 10Btullis: [C: 03+1] flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:52:00] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Dzahn) >>! In T317169#8508073, @jcrespo wrote: > Thank you, while I understand why... [15:52:48] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2002.codfw.wmnet with OS bullseye [15:53:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After the incident', diff saved to https://phabricator.wikimedia.org/P42946 and previous config saved to /var/cache/conftool/dbconfig/20230110-155346-root.json [15:54:40] (03PS2) 10BCornwall: prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) [15:55:20] (03CR) 10BCornwall: prometheus: Add Varnish thread percent usage rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [15:55:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw[1373,1384-1385,1387].eqiad.wmnet [15:55:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw[1373,1384-1385,1387].eqiad.wmnet [15:55:31] (03Merged) 10jenkins-bot: flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:56:28] 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) [15:56:45] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Dzahn) >>! In T317169#8512445, @jcrespo wrote: > I would like first a technical rev... [15:57:07] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Dzahn) a:05Dzahn→03None [15:57:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:57:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:58:29] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:58:52] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:58:56] (03PS1) 10Volans: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878124 [15:59:03] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:59:17] !log reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10. [15:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:42] (03Abandoned) 10Volans: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878124 (owner: 10Volans) [16:00:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [16:00:55] (03Abandoned) 10Ayounsi: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/878117 (owner: 10Ayounsi) [16:01:25] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd2002.codfw.wmnet with reason: host reimage [16:01:34] (03PS1) 10Ottomata: admin_ng/flink-operator - crds release depends on kube-system/namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/878126 (https://phabricator.wikimedia.org/T324576) [16:02:24] (03PS1) 10Ayounsi: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878127 [16:03:40] (03CR) 10Ottomata: [C: 03+2] admin_ng/flink-operator - crds release depends on kube-system/namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/878126 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:03:50] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [16:03:55] (03CR) 10Btullis: [C: 03+1] admin_ng/flink-operator - crds release depends on kube-system/namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/878126 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:04:14] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [16:04:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd2002.codfw.wmnet with reason: host reimage [16:04:48] 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) [16:04:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878127 (owner: 10Ayounsi) [16:05:08] (03PS1) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 and connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 [16:05:27] (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 and connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (owner: 10Stevemunene) [16:06:38] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878127 (owner: 10Ayounsi) [16:08:06] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2003.codfw.wmnet with OS bullseye [16:08:25] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:08:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After the incident', diff saved to https://phabricator.wikimedia.org/P42947 and previous config saved to /var/cache/conftool/dbconfig/20230110-160851-root.json [16:08:57] (03PS2) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [16:09:24] (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:09:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:10:19] (03CR) 10Filippo Giunchedi: "Idea generally LGTM, pending Legal's requirement re: words/phrases we should be looking for" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [16:10:30] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:11:16] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [16:12:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:52] (03CR) 10Herron: [C: 03+2] update role_contacts for thanos (front|back)end (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron) [16:14:45] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2002.codfw.wmnet with OS bullseye [16:15:27] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) Changed kafka topic retention time to 2 days instead of the default 7. ` cgoubert@kafka-logging1001:~$ kafka topic... [16:18:46] (03CR) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE)) [16:19:47] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: add kafka-logging200[45] to codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [16:20:05] 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) [16:20:26] (03PS3) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [16:21:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd2003.codfw.wmnet with reason: host reimage [16:23:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd2003.codfw.wmnet with reason: host reimage [16:23:54] (03PS4) 10MVernon: hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) [16:23:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After the incident', diff saved to https://phabricator.wikimedia.org/P42948 and previous config saved to /var/cache/conftool/dbconfig/20230110-162356-root.json [16:24:03] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2001.codfw.wmnet with OS bullseye [16:25:38] 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) [16:26:11] (03CR) 10MVernon: hiera: move swift credentials into common (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [16:26:41] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39045/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:27:55] (03PS3) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 [16:28:02] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [16:29:28] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:29:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:29:45] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [16:32:58] (03PS7) 10MVernon: swift: move accounts_keys to common hiera global_account_keys [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [16:33:43] (03CR) 10MVernon: "Changed the name of the hiera entry, to make the transition easier." [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [16:36:05] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2003.codfw.wmnet with OS bullseye [16:36:19] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I agree that the updated name in hiera will make the transition easier." [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [16:36:21] 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Ladsgroup) Commons is now the biggest section and by far. It used to be so much worse that wikidata dwarfed in comparison. The thing is: It has a... [16:37:50] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:38:10] 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) @wiki_willy Jennifer needs access to the management network to be about to ssh into servers to access BIOS/IDRAC to troubleshoot and pull TSR report if needed. Can... [16:39:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After the incident', diff saved to https://phabricator.wikimedia.org/P42949 and previous config saved to /var/cache/conftool/dbconfig/20230110-163901-root.json [16:40:33] (03CR) 10Btullis: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:41:58] 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto) p:05Triage→03Medium [16:43:10] (03PS1) 10Lucas Werkmeister (WMDE): Fix test constructing HTMLFormField without parent [extensions/WikibaseLexeme] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877983 (https://phabricator.wikimedia.org/T326621) [16:43:29] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Hey @Jelto, I've been working with Scott Bassett on trying to gain access. Unfortunately, I am not able to login... [16:43:30] (03PS4) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 [16:44:08] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42950 and previous config saved to /var/cache/conftool/dbconfig/20230110-164447-ladsgroup.json [16:45:26] 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto) >>! In T326649#8513203, @Papaul wrote: > @wiki_willy Jennifer needs access to the management network to be about to ssh into servers to access BIOS/IDRAC to troubleshoot... [16:45:44] PROBLEM - Host bast5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:56] PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:58] PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:02] PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:04] PROBLEM - Host netflow5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:07] PROBLEM - Host cr2-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100% [16:46:07] PROBLEM - Host durum5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:21] PROBLEM - Host cr3-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100% [16:46:24] PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:24] PROBLEM - Host durum5001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:26] PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:26] PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:29] is this expected? [16:46:32] what on earth [16:46:35] surely not [16:46:36] hmmm let's depool eqsin [16:46:52] getting "Error: 502, Broken pipe" on my end (Philippines) [16:47:01] yes, let's depool eqsin [16:47:09] chlod: that’s the alerts [16:47:21] wow yeah [16:47:26] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:47:32] (03PS1) 10BBlack: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 [16:47:39] ^ Looking at it. [16:47:50] (03CR) 10Vgutierrez: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack) [16:47:52] PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:47:52] if you need help, do shout [16:47:54] PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:47:54] (03CR) 10Ssingh: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack) [16:47:57] (03CR) 10BBlack: [C: 03+2] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack) [16:48:02] (03CR) 10BBlack: [V: 03+2 C: 03+2] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack) [16:48:25] !log depooling eqsin from DNS [16:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:27] looks like both transport link to eqsin are down [16:49:11] maintenance? [16:49:16] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Jelto) >>! In T323943#8513231, @KHurd-WMF wrote: > Hey @Jelto, I've been working with Scott Bassett on trying to gain acces... [16:49:26] XioNoX jynus Yes, I think it has to do with the depooling of eqsin [16:49:41] I 'll update status page [16:49:56] PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:56] oh, I thought it was depooled [16:49:57] planned maintenance on the only working link [16:50:11] see PWIC225900 [16:50:16] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:22] PROBLEM - Host cr3-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:25] jynus: no, b.black has just depooled it as a response to all the p.ages [16:50:28] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:33] the other link has been on maintenance for a bit https://phabricator.wikimedia.org/T322529 [16:50:35] XioNoX: /o\ [16:50:40] (03PS5) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 [16:50:45] well, that should do it [16:51:01] can someone put up a statuspage update? [16:51:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10StephaneRebai) [16:51:08] taavi: already done [16:51:14] taavi: a.kosiaris is on it [16:51:19] (and types quicker than me, damnit) [16:51:32] 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) @Jelto thanks for the reply i have already her SSH-key and I will personally be adding her to the group once I have the approval from Willy. Thanks [16:51:34] (03CR) 10JMeybohm: sre.ganeti.reimage: add new cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [16:52:00] missing eqsin from this graph? https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1 [16:52:24] (03PS10) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [16:52:30] jynus: looks like it [16:52:45] but also you won't have data if the links are down [16:52:56] maybe that's why [16:53:00] that's a bug of the dashboard... no data is being collected from eqsin [16:53:06] no you should have historical [16:53:30] and the site variable gets filled dynamically [16:53:48] vgutierrez: as in, a potential actionable, or something expected when no metrics are arriving? [16:53:55] (03PS11) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [16:54:05] jynus: actionable IMHO [16:54:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After the incident', diff saved to https://phabricator.wikimedia.org/P42951 and previous config saved to /var/cache/conftool/dbconfig/20230110-165406-root.json [16:54:11] vgutierrez: good [16:54:30] nel looks good after a spike [16:54:49] but we may be missing metrics [16:54:54] not for NEL [16:55:03] NEL sends to the next best DC [16:55:24] basically I am trying to see impact [16:55:28] RECOVERY - Host netflow5002 is UP: PING OK - Packet loss = 0%, RTA = 247.25 ms [16:55:28] RECOVERY - Host durum5002 is UP: PING OK - Packet loss = 0%, RTA = 238.90 ms [16:55:29] I know dns lags a bit [16:55:30] RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 244.81 ms [16:55:30] RECOVERY - Host durum5001 is UP: PING OK - Packet loss = 0%, RTA = 242.79 ms [16:55:30] RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 232.47 ms [16:55:30] RECOVERY - Host ncredir5001 is UP: PING OK - Packet loss = 0%, RTA = 233.62 ms [16:55:30] RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 250.70 ms [16:55:32] RECOVERY - Host ncredir5002 is UP: PING OK - Packet loss = 0%, RTA = 231.49 ms [16:55:33] RECOVERY - Host cr2-eqsin #page is UP: PING OK - Packet loss = 0%, RTA = 225.39 ms [16:55:34] RECOVERY - Host cr3-eqsin #page is UP: PING OK - Packet loss = 0%, RTA = 245.89 ms [16:55:34] RECOVERY - Host doh5002 is UP: PING OK - Packet loss = 0%, RTA = 253.59 ms [16:55:34] RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.01 ms [16:55:36] RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 231.29 ms [16:55:46] let's keep it depooled until the end of the maintenance at least [16:55:50] +1 [16:55:52] +! [16:55:54] +1 [16:55:54] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 245.15 ms [16:55:55] sigh... [16:56:00] RECOVERY - Host cr3-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 243.03 ms [16:56:02] RECOVERY - Host bast5002 is UP: PING OK - Packet loss = 0%, RTA = 254.35 ms [16:56:08] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 251.84 ms [16:56:14] the other link should come back in 2 days, but its ETA has been pushed multiple times [16:56:16] RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 237.02 ms [16:57:00] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:08] (03PS12) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [16:57:30] ok, interesting question, should I just switch the incident in the status page to minor and degraded performance for reading? [16:57:45] I guess that reflects the current reality better? [16:57:48] akosiaris: if it's depooled yeah [16:58:02] degraded perf in the apac region [16:58:08] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:20] or we could go all glass-half-full and say we have an incidental editing performance improvement for users in australia :) [16:58:38] (03CR) 10MVernon: "Updated to take review comments on board (thanks!) and changed name of the hiera item." [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [16:58:44] (because they don't bounce backwards latency-wise through the cache site to reach the core) [16:58:54] (03PS1) 10Ottomata: flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) [16:59:51] XioNoX: akosiaris: I am here, anything left to help with? Sorry, I failed to see it in meeting [16:59:53] akosiaris: I wouldn't even call eqsin being depooled an incident tbh [16:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42952 and previous config saved to /var/cache/conftool/dbconfig/20230110-165952-ladsgroup.json [17:00:03] (from a status page perspective) [17:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700). [17:00:05] Krinkle: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:12] cdanis: it's depooled because of the incident though [17:00:15] it was briefly user-impacting though, ~5-10 minutes or so [17:00:23] yeah, that ^ [17:00:26] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:00:34] eqsin going down unexpectedly is an incident [17:00:49] bblack: maybe with this new cable we could have australia and NZ sent to ulsfo - https://www.submarinecablemap.com/submarine-cable/southern-cross-next [17:00:54] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [17:01:00] I noted it on the list of incidents as "2023-01-10 eqsin network outage" [17:01:04] but once we depool eqsin, that's resolved imo [17:01:10] (03PS2) 10Ottomata: flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) [17:01:13] but feel free to update the title if it is not great [17:01:43] +1 with cdanis, otherwise it would mean we had to create an incident each time we depool a site for maintenance [17:01:48] (to be consistent) [17:01:56] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10MarkTraceur) Approve as manager! [17:02:02] XioNoX: maybe, can re-measure and see [17:02:10] (03PS3) 10Ottomata: flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) [17:02:15] what I think is at least tracking incidents is nice that page and are not false positives [17:02:32] on the sheet (or otherwise, when there is a better place) [17:02:38] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 531 days) https://wikitech.wikimedia.org/wiki/Logs [17:02:44] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:02:45] jynus: i'm mostly talking from a public status page perspective [17:03:04] sure [17:03:22] (ProbeDown) firing: (6) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:28] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [17:03:35] !log ayounsi@deploy1002 deploy aborted: netbox-next to 3.2.9 (duration: 00m 07s) [17:03:41] that also works for tracking, which is what I am interested in [17:03:49] (03CR) 10Btullis: [C: 03+1] flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:04:23] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:05:21] When I access the 'jinxer-wm' link it only shows a coffee mug, does that mean that the incident is resolved? [17:07:27] fwiw sites like cloudflare will log a status of "re-routed" which we could consider borrowing [17:08:08] cloudflare's role and user base is quite different from ours IMO :) [17:09:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:41] denisse: please take no offense, but I think IRC messages and alertmanager/paging has degraded but not being useful in some contexts [17:09:49] by* [17:09:54] denisse: did jinxer-wm (aka alertmanager) page for the eqsin outage? I don't think it did, I think it was only icinga [17:09:56] that is normal, it is "new" [17:10:18] but I hope we can do better having more meaningful alert test and links [17:10:26] (03CR) 10JMeybohm: [C: 03+2] k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [17:10:32] (ProbeDown) resolved: (6) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:46] denisse: I sent the ACK via SMS just now to get it out of "active" state [17:10:47] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:11:00] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:11:02] mutante: oh, did it not auto-resolve? [17:11:40] cdanis: the "packet loss 100% to cr2-eqsin" did not [17:11:46] mutante: yeah, but the links should be to something that doesn't disappear when clicked after (e.g. "state is now green/acked") [17:11:47] 🙃 [17:12:05] again, this is nitpicking, please don't take my complains to seriously [17:12:27] (03CR) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE)) [17:12:55] I just generated a doc with ~200 complains from several people so I am lately very nitpicky [17:13:01] jynus: on the contrary, it helps to see where we can improve. :) [17:13:02] I agree with jynus that the IRC alerting isn't working as it used to anymore [17:13:39] I mean, let's be real- it never was great, but new tooling is forcing us to work harder :-D [17:13:48] it just needs time [17:14:21] I think in some cases it is the aggregation- which worked nicely to reduce spam [17:14:40] but in some aspects loosed specificity [17:14:45] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:14:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42953 and previous config saved to /var/cache/conftool/dbconfig/20230110-171457-ladsgroup.json [17:15:07] it's a lose-lose situation, nothing will be perfect :-) [17:26:57] jynus: https://en.wikipedia.org/wiki/Kobayashi_Maru [17:27:39] nah, we can actually do better, it is just finding the time to improve stuff [17:28:16] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: help [17:28:17] !log ayounsi@deploy1002 deploy aborted: help (duration: 00m 01s) [17:28:18] Although now that I have you here, let me ask you something you may be able to help with making alerting better (I pm you) [17:28:53] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [17:29:04] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 11s) [17:29:05] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Jclark-ctr) @marostegui sorry yes I will need it depooled I did have to run out of data center how long can it be depooled? If you depooled it I can swap it today and bring it up tomorrow? [17:29:51] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) @Jclark-ctr yeah, it can be depooled for 24h without any problem. I will get it ready now for you. [17:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42954 and previous config saved to /var/cache/conftool/dbconfig/20230110-173002-ladsgroup.json [17:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 T325652', diff saved to https://phabricator.wikimedia.org/P42955 and previous config saved to /var/cache/conftool/dbconfig/20230110-173027-marostegui.json [17:30:30] T325652: Inbound interface errors - https://phabricator.wikimedia.org/T325652 [17:31:35] (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878144 [17:32:00] (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878144 (owner: 10Marostegui) [17:32:53] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) @Jclark-ctr the host is ready for you to work on it anytime. I have left it ON, but the service is stopped, so if you need to power it off, you can do it anytime. Thanks! [17:36:16] RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:37:32] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1130 maint', diff saved to https://phabricator.wikimedia.org/P42956 and previous config saved to /var/cache/conftool/dbconfig/20230110-173807-ladsgroup.json [17:39:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:39:08] 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10wiki_willy) Approved from my end. Thanks! [17:39:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:42:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10RobH) [17:42:57] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10RobH) [17:44:27] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Ah, thank you @Jelto, that's what I needed to know. That username allowed me to login. @Ottomata can I have yo... [17:48:20] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1229 - https://phabricator.wikimedia.org/T326661 (10Marostegui) [17:48:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [17:48:36] !log Finished rolling reboots of eqiad appservers [17:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:45] !log run populateCulActor on all wikis # T325484 [17:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:49] T325484: Run PopulateCulActor on all wikis - https://phabricator.wikimedia.org/T325484 [17:55:21] (03PS2) 10Jbond: puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 [17:55:26] !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagemaster2001.codfw.wmnet with OS bullseye [17:57:32] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [17:59:30] 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) Thank you. [17:59:37] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:59:39] (03CR) 10CI reject: [V: 04-1] puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1800) [18:00:53] (03PS1) 10BCornwall: varnish: Alert on high thread count [alerts] - 10https://gerrit.wikimedia.org/r/878166 (https://phabricator.wikimedia.org/T323723) [18:01:38] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bullseye [18:01:41] !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bullseye [18:01:49] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Ottomata) Done, you should have an email at khurd@wikimedia.org with instructions. [18:06:28] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2001.codfw.wmnet with reason: host reimage [18:06:46] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:07:00] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:07:02] that's me [18:07:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:07:58] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:08:24] (03PS1) 10Dzahn: netbox: add scap::target to allowing scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 [18:09:21] jayme: thanks, ACK [18:09:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2001.codfw.wmnet with reason: host reimage [18:10:19] (03CR) 10CI reject: [V: 04-1] netbox: add scap::target to allowing scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn) [18:11:11] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) @Slaporte I got the technical ok to deploy the new version check. Here is... [18:11:26] (03CR) 10Dzahn: "Duplicate declaration: File[/var/lib/scap] is already declared ... duuuh :/" [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn) [18:12:16] (03CR) 10Dzahn: [C: 04-1] netbox: add scap::target to allowing scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn) [18:12:34] (03PS2) 10Dzahn: netbox: add scap::target to allow scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 [18:12:46] (JobUnavailable) firing: (7) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:13:47] (03Abandoned) 10Dzahn: netbox: add scap::target to allow scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn) [18:15:13] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10RobH) [18:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:15:27] mutante: I missed the puppet window I think? Or maybe it's not every week? [18:16:13] Krinkle: hmm. there is one on the deployment calendar but I rarely ever see those being used. what do you have? [18:16:30] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [18:16:32] I see.. eh.. looking [18:16:34] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [18:17:07] jouncebot: refresh [18:17:07] I refreshed my knowledge about deployments. [18:17:34] I am not sure if the bot failed to ping..but I am looking at the first patch [18:18:40] jbo.nd and r.zl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700). [18:18:47] hmm yea.. so.. it does not actually have reviews [18:19:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage [18:19:43] while I am willing to do that.. puppet window would be for stuff that is already +1 [18:19:57] or just the regular gerrit process without having to be in any window [18:20:56] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc [18:21:08] (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on group2 wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878168 (https://phabricator.wikimedia.org/T314714) [18:21:10] (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878169 (https://phabricator.wikimedia.org/T314714) [18:21:35] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage [18:22:44] (03PS4) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) [18:23:21] mutante: ah, I see, I didn' realize. Okay, thanks! [18:23:34] Krinkle: I looked at the 3 patches but I am not comfortable merging those. they don't have +1 and they touch mediawiki core lib, site-wide apache config and security. sorry [18:23:39] !log jayme@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc [18:23:44] though the doc one I might be talked into .. [18:23:56] !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagemaster2001.codfw.wmnet with OS bullseye [18:24:09] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal [18:24:20] me again, sorry [18:24:30] Krinkle: I would do the "relax CSP rules for taint demo" if you need it though? [18:25:32] I'm reimageing the staging-codfw k8s cluster - and I should probably have said that before, sorry d.enisse|m.utante [18:27:53] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal [18:28:05] (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_k8s-ingress-staging.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:28:43] mutante: yeah, that'd be nice to close out the Phan work [18:29:09] well, you left a pretty detailed explanation, so yea [18:29:18] (03CR) 10Dzahn: [C: 03+2] doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [18:29:25] also since that is doc hosts and not global [18:29:27] doing it [18:29:36] (03PS1) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [18:29:38] (03CR) 10MVernon: "CI run with this and the associated puppet change -" [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [18:29:47] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc [18:29:58] (03CR) 10MVernon: "CI run of this and the labs/private change - https://puppet-compiler.wmflabs.org/output/868721/39048/" [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [18:30:17] (03CR) 10CI reject: [V: 04-1] Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [18:32:13] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Jclark-ctr) @marostegui sfp-t has been replaced Let me know if you still see errors [18:33:06] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-staging.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:34:14] Krinkle: it has been deployed on doc1002 and doc2001. puppet did refresh (but not hard restart) apache [18:34:33] go ahead and test if you want [18:34:34] (03PS1) 10Jbond: dhcp: disable no-member check [software/spicerack] - 10https://gerrit.wikimedia.org/r/878172 [18:34:37] thx [18:34:38] checking [18:35:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bullseye [18:35:53] mutante: confirmed, the new header is coming through [18:36:12] Krinkle: cool, good [18:36:22] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) Unfortunately, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/863406/ has caused logspam every ten minutes in /var/log/messages. ` 03:27 brett: BTW..... [18:38:05] (ConfdResourceFailed) resolved: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-staging.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:38:10] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bullseye [18:38:55] (03PS2) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [18:39:03] (03PS3) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [18:39:40] (03CR) 10CI reject: [V: 04-1] Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [18:43:56] (03PS4) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [18:44:33] (03CR) 10CI reject: [V: 04-1] Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [18:44:38] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet [18:49:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet [18:56:19] (03PS5) 10Dzahn: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [18:58:05] (03PS2) 10JMeybohm: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) [18:58:07] (03PS1) 10JMeybohm: cert-manager: Fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878176 [18:59:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Dzahn) @Papaul please still add the key here on the ticket [18:59:37] (03CR) 10Dzahn: "fixed date format which made CI downvote, confirmed UID in LDAP, has approval, can't check SSH key though, but otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:00:04] jeena and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1900). [19:00:11] (03CR) 10Majavah: [C: 04-1] admin: Add Jennifer Hancock to the datacenter-ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:01:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [19:02:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [19:02:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42958 and previous config saved to /var/cache/conftool/dbconfig/20230110-190235-ladsgroup.json [19:02:42] train is delayed on some patches and will resume once they have applied sucessfully [19:02:47] (03CR) 10CI reject: [V: 04-1] Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [19:02:52] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39049/console" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:03:12] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878176 (owner: 10JMeybohm) [19:07:16] (03CR) 10JMeybohm: sre.ganeti.reimage: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:08:19] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet [19:08:41] (03Merged) 10jenkins-bot: cert-manager: Fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878176 (owner: 10JMeybohm) [19:09:23] (03PS3) 10JMeybohm: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) [19:10:54] (03PS1) 10Effie Mouzeli: site: Remove retired mc* hosts [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733) [19:12:53] (03PS6) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [19:15:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet [19:16:31] (03CR) 10Dzahn: [C: 03+2] "thanks for fixing my oversight and reviews" [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn) [19:17:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42962 and previous config saved to /var/cache/conftool/dbconfig/20230110-191740-ladsgroup.json [19:19:03] (03CR) 10Dzahn: ""The Affiliations Committee will be on leave from December 21st to January 6th and will reply once we return. Please send an email to this" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [19:19:32] (03PS7) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [19:19:49] (03PS8) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [19:20:29] (03CR) 10CI reject: [V: 04-1] admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:21:05] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) Thanks John! I will wait for @ayounsi to confirm before repooling this host. [19:21:57] (03PS1) 10Ottomata: flink - include examples in image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/878178 (https://phabricator.wikimedia.org/T316519) [19:22:45] (03PS9) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) [19:23:41] (03CR) 10JMeybohm: [C: 03+2] Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [19:24:01] (03CR) 10Dzahn: "so.. thanks Majavah. that was correct, it should be uid. that being said, Jennifer has 2 users in LDAP, both with the same email address." [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:29:05] (03Merged) 10jenkins-bot: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [19:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1158 maint', diff saved to https://phabricator.wikimedia.org/P42963 and previous config saved to /var/cache/conftool/dbconfig/20230110-192929-ladsgroup.json [19:30:50] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:31:13] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:31:26] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:31:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [19:31:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [19:31:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:31:54] (03PS1) 10BCornwall: varnish: Revert export of Prometheus params [puppet] - 10https://gerrit.wikimedia.org/r/878180 (https://phabricator.wikimedia.org/T323723) [19:31:57] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:32:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:32:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42964 and previous config saved to /var/cache/conftool/dbconfig/20230110-193245-ladsgroup.json [19:32:46] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:32:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42965 and previous config saved to /var/cache/conftool/dbconfig/20230110-193253-ladsgroup.json [19:35:13] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:37:50] !log dancy@deploy1002 Installing scap version "4.32.0" for 1 hosts [19:37:55] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:38:01] !log dancy@deploy1002 Installation of scap version "4.32.0" completed for 1 hosts [19:38:11] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:38:51] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:39:01] (03PS1) 10Marostegui: mariadb: Adjust new eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/878182 (https://phabricator.wikimedia.org/T326661) [19:39:50] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:42:11] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:42:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Adjust new eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/878182 (https://phabricator.wikimedia.org/T326661) (owner: 10Marostegui) [19:43:32] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:44:11] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10ayounsi) https://librenms.wikimedia.org/graphs/to=1673379600/id=15307/type=port_errors/from=1673293200/ looks good [19:45:05] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) Cool, repooling then. @ayounsi do you want me to close this ticket once I am done? [19:45:12] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39050/console" [puppet] - 10https://gerrit.wikimedia.org/r/878180 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:45:18] (03PS1) 10Marostegui: Revert "db1107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/878149 [19:47:02] (03CR) 10Marostegui: [C: 03+2] Revert "db1107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/878149 (owner: 10Marostegui) [19:47:07] (03PS1) 10JMeybohm: staging-codfw: Update coredns to 1.8.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878184 (https://phabricator.wikimedia.org/T326340) [19:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42968 and previous config saved to /var/cache/conftool/dbconfig/20230110-194750-ladsgroup.json [19:47:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42969 and previous config saved to /var/cache/conftool/dbconfig/20230110-194756-root.json [19:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42970 and previous config saved to /var/cache/conftool/dbconfig/20230110-194757-ladsgroup.json [19:49:19] (03PS1) 10Eevans: cassandra_dev: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878186 [19:49:46] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [19:49:49] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [19:51:31] (03PS10) 10Dzahn: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:51:37] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [19:52:22] 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) [19:52:43] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 01m 06s) [19:54:04] (03PS1) 10Zabe: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) [19:54:10] (03CR) 10Dzahn: admin: Add Jennifer Hancock to the datacenter-ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:54:30] (03CR) 10Dzahn: "I think it's ok now." [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [19:55:02] (03CR) 10CI reject: [V: 04-1] Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [19:55:31] (03PS2) 10Zabe: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) [19:57:49] 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) @Jclark-ctr please note that mc2020 and mc2021 are probably still bootable due to a failure during running the decomm script [19:58:19] (03CR) 10JMeybohm: [C: 03+2] staging-codfw: Update coredns to 1.8.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878184 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [19:58:20] 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) a:05jijiki→03Jclark-ctr [19:58:45] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet [20:00:23] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [20:00:58] !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 [20:01:33] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:01:55] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:02:40] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 01m 42s) [20:03:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42971 and previous config saved to /var/cache/conftool/dbconfig/20230110-200301-root.json [20:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42972 and previous config saved to /var/cache/conftool/dbconfig/20230110-200302-ladsgroup.json [20:03:20] (03Merged) 10jenkins-bot: staging-codfw: Update coredns to 1.8.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878184 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm) [20:04:16] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:04:29] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:05:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet [20:06:53] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:07:04] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:07:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [20:08:04] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:08:06] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:08:24] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [20:08:31] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [20:16:10] (03PS1) 10JMeybohm: Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340) [20:17:51] (03PS2) 10JMeybohm: Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340) [20:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42974 and previous config saved to /var/cache/conftool/dbconfig/20230110-201806-root.json [20:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42975 and previous config saved to /var/cache/conftool/dbconfig/20230110-201807-ladsgroup.json [20:18:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247 [20:18:36] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [20:26:18] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:26:34] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:28:20] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:28:27] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:28:36] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:29:11] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:31:21] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:31:29] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:31:38] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:32:33] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:33:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42976 and previous config saved to /var/cache/conftool/dbconfig/20230110-203311-root.json [20:33:23] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:36:53] (03PS1) 10Dzahn: Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/878150 [20:37:00] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:37:17] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42977 and previous config saved to /var/cache/conftool/dbconfig/20230110-204816-root.json [20:50:55] 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) 05Open→03Resolved The host is repooled. Closing. Thanks everyone! [20:51:36] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [20:51:48] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [20:51:56] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [20:52:38] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) 05Open→03Resolved a:03jijiki Bluntly closing this as we are moving to mediawiki to kubernetes [20:52:49] (03CR) 10Dzahn: [C: 03+1] "we confirmed the maintenance has been declared over" [dns] - 10https://gerrit.wikimedia.org/r/878150 (owner: 10Dzahn) [20:53:19] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) 05Open→03Resolved [20:54:53] (03CR) 10BBlack: [C: 03+1] Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/878150 (owner: 10Dzahn) [20:55:14] (03CR) 10Dzahn: [C: 03+2] Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/878150 (owner: 10Dzahn) [20:55:44] !log repooling eqsin [20:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T2100). [21:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:29] hi [21:01:46] MatmaRex: wmf.18 hasn't been deployed yet but that shouldn't affect you, right? [21:02:52] it shouldn't [21:03:04] 👍 [21:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42978 and previous config saved to /var/cache/conftool/dbconfig/20230110-210321-root.json [21:05:24] is anyone available to do the deployment for me? :) [21:05:27] I can deploy if no one else is around [21:05:34] i can also [21:06:03] (03CR) 10Zabe: [C: 03+2] Use new DiscussionTools heading markup on group2 wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878168 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:06:16] zabe: first deploy?:) congrats! [21:06:19] (03PS3) 10Zabe: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) [21:06:23] (03CR) 10Zabe: [C: 03+2] Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:49] (03Merged) 10jenkins-bot: Use new DiscussionTools heading markup on group2 wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878168 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:07:03] (03Merged) 10jenkins-bot: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:07:28] actually no, but still thanks :) [21:08:03] !log zabe@deploy1002 Started scap: Backport for [[gerrit:878168|Use new DiscussionTools heading markup on group2 wikis except enwiki (T314714)]], [[gerrit:878187|Start reading from cul_actor on group1 wikis (T233004)]] [21:08:08] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:08:08] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:09:39] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Everyone, you all are awesome. Thank you for all the help and assistance. I will close this ticket! [21:09:44] (03PS3) 10Dzahn: scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 [21:09:47] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) 05In progress→03Resolved [21:09:51] !log zabe@deploy1002 zabe and zabe and matmarex: Backport for [[gerrit:878168|Use new DiscussionTools heading markup on group2 wikis except enwiki (T314714)]], [[gerrit:878187|Start reading from cul_actor on group1 wikis (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:10:04] (03CR) 10CI reject: [V: 04-1] scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn) [21:10:50] (03CR) 10Dzahn: "I am doing it this way with assert_type() because you said on another change you think the UID should not be a class parameter.. but I sti" [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn) [21:11:06] MatmaRex, can you test? [21:11:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [21:11:22] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet [21:11:42] zabe: yeah i was just looking. everything is working correctly [21:11:55] nice, syncing [21:12:27] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:12:51] (03CR) 10Dzahn: "same here, I am using assert_type to have it both ways, validate data but also not make the UID a class parameter.. because you said so el" [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn) [21:14:08] (03PS4) 10Dzahn: scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 [21:14:57] (03PS2) 10Dzahn: phabricator: use specific data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877275 [21:17:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:16] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/877277/39051/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/877275 (owner: 10Dzahn) [21:17:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet [21:17:31] (03CR) 10Dzahn: [C: 03+2] "cc: Hashar we now have a data type to validate those" [puppet] - 10https://gerrit.wikimedia.org/r/877275 (owner: 10Dzahn) [21:17:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [21:18:12] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878168|Use new DiscussionTools heading markup on group2 wikis except enwiki (T314714)]], [[gerrit:878187|Start reading from cul_actor on group1 wikis (T233004)]] (duration: 10m 08s) [21:18:16] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:18:16] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:18:26] MatmaRex, should be live [21:18:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42979 and previous config saved to /var/cache/conftool/dbconfig/20230110-211826-root.json [21:18:35] thanks zabe [21:19:51] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [21:20:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [21:20:07] yw [21:21:17] PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:53] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:22:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [21:27:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [21:28:09] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:28:21] (03PS4) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) [21:28:52] If there are no more backports I would like to deploy the train now [21:29:57] (03PS5) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) [21:32:10] (03PS6) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) [21:33:20] (03CR) 10Herron: slo_dashboards: dynamic slo dashboard panels (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [21:34:30] jeena, I'm done with deploying, so I think you can go ahead [21:34:44] Thanks zabe [21:35:48] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878199 (https://phabricator.wikimedia.org/T325581) [21:35:50] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878199 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [21:36:30] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878199 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [21:36:51] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.18 refs T325581 [21:36:55] T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581 [21:52:28] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [21:52:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10nskaggs) 05In progress→03Resolved As https://wikitech.wikimed... [21:54:10] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [21:54:42] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet [21:54:48] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [21:56:18] (03CR) 10Ottomata: [C: 03+2] flink - include examples in image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/878178 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [21:56:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - include examples in image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/878178 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [22:00:02] (03PS1) 10Marostegui: db1206: No longer testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/878202 (https://phabricator.wikimedia.org/T326669) [22:00:37] (03CR) 10Marostegui: [C: 03+2] db1206: No longer testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/878202 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [22:01:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet [22:01:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [22:02:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10RobH) [22:02:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10RobH) [22:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:05:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10RobH) [22:08:30] (03PS1) 10Dzahn: httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 [22:09:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206 T325046', diff saved to https://phabricator.wikimedia.org/P42980 and previous config saved to /var/cache/conftool/dbconfig/20230110-220942-marostegui.json [22:09:45] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [22:09:46] T325046: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 [22:09:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:10:01] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [22:10:10] 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) 05Stalled→03Open a:05Marostegui→03Jclark-ctr @Jclark-ctr we want to test that the RAID monitoring works fine. Can you pull out a hard disk... [22:10:17] (03CR) 10CI reject: [V: 04-1] httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (owner: 10Dzahn) [22:10:52] (03PS2) 10Dzahn: httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) [22:11:16] (03CR) 10Dzahn: "let's add some tests first -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/878203" [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [22:12:41] (03CR) 10CI reject: [V: 04-1] httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [22:12:46] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:13:20] (03PS3) 10Dzahn: httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) [22:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:16:46] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:17:34] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:18:02] (03PS1) 10Dzahn: httpbb: add SPDX license headers for some test files [puppet] - 10https://gerrit.wikimedia.org/r/878205 [22:18:44] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:18:46] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:21:55] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.18 refs T325581 (duration: 45m 04s) [22:21:59] T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581 [22:22:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [22:23:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) [22:24:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) @bblack, The ordering task had the racking details populated by @kofori but I suspect there is a mistake in them. This order and racking is to replace dns100[12] and authdns1001... [22:28:39] (03PS1) 10Zabe: Start writing to rev_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954) [22:28:40] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.40.0-wmf.14, 1.40.0-wmf.13 (duration: 02m 35s) [22:29:57] (03CR) 10Dzahn: "people in CC, I don't expect you to actually review the assertions, I have tested those, but I wanted to share this is a thing and that we" [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [22:30:34] (03PS2) 10Zabe: Start writing to rev_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954) [22:34:09] deploying to group0 now [22:34:33] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878208 (https://phabricator.wikimedia.org/T325581) [22:34:35] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878208 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [22:35:17] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878208 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [22:37:04] (03CR) 10Dzahn: "how this is used:" [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [22:38:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/878203/39055/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [22:38:52] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:31] win 11 [22:40:00] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:41] 10ops-codfw, 10DC-Ops, 10Traffic: Q3:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) [22:42:49] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247 [22:42:51] 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) [22:42:52] T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms - https://phabricator.wikimedia.org/T324247 [22:42:54] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.18 refs T325581 [22:42:57] T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581 [22:43:35] 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) [22:44:56] 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:03BBlack @bblack, The racking details provided on ordering task T325230 list hostnames dns200[345] for this, but they are replacing dns200[12] and authdns2001. Should these instead b... [22:45:25] (03PS1) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) [22:47:29] (03CR) 10Ottomata: "Probably have a bunch of things wrong here; I've never written a new helmfile service for the dse-k8s-cluster." [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [22:48:04] (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [22:52:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:57:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:04:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) a:05BBlack→03Jclark-ctr >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >... [23:04:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:05BBlack→03Papaul >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >> >... [23:17:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:18:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH) [23:18:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH) [23:19:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH) [23:19:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH) [23:19:58] jouncebot, nowandnext [23:19:59] No deployments scheduled for the next 7 hour(s) and 40 minute(s) [23:19:59] In 7 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T0700) [23:20:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:20:48] (03Merged) 10jenkins-bot: Start writing to rev_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [23:21:16] !log zabe@deploy1002 Started scap: Backport for [[gerrit:878207|Start writing to rev_comment_id on test wikis (T299954)]] [23:21:20] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:22:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:22:54] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878207|Start writing to rev_comment_id on test wikis (T299954)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [23:24:50] RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:09] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Dzahn) We can confirm we served a lot of 5xx's in a time span from about 21:00 to 21:05 UTC yesterday. The reason was an overloaded data... [23:30:56] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878207|Start writing to rev_comment_id on test wikis (T299954)]] (duration: 09m 39s) [23:30:59] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:33:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:36] 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Dzahn) 05Open→03Resolved a:03Dzahn The actual incident is over, it was mitigated within minutes. Regarding the report it's still... [23:37:46] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:39:37] I did touch the httpbb tests but not those for appservers.. making sure that is not me ^ [23:46:33] !log cumin2002 - sudo systemctl status httpbb_hourly_appserver [23:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:57] yea, that was unrelated [23:47:16] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:20] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:57:10] (03CR) 10BCornwall: [V: 03+1] "PS3 → PS4 omits the thread_pools parameter based on discussion on IRC, but I'm thinking that it's valuable to keep it around just in case " [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [23:58:30] !log krinkle@deploy1002 Started deploy [integration/docroot@b7c82a3]: (no justification provided) [23:58:45] !log krinkle@deploy1002 Finished deploy [integration/docroot@b7c82a3]: (no justification provided) (duration: 00m 15s)