[00:06:14] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Scheduled for tomorrow https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700" [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy)
[00:46:43] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:45] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:17] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247
[00:48:17] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:20] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[00:49:49] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:08:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:13:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:33:41] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:28] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) p:05Triage→03Medium
[01:37:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:07] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [integration/docroot@f59119c]: (no justification provided)
[01:50:21] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [integration/docroot@f59119c]: (no justification provided) (duration: 00m 14s)
[01:57:46] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:38] <wikibugs>	 (03CR) 10Eevans: "I think what @joe was alluding to was that if you used a name other than `profile::swift::accounts_keys` for the Hash[String Hash] structu" [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[02:08:40] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247
[02:08:43] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[02:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:17:08] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul)
[02:17:46] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:40:43] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:41:08] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247
[02:41:11] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[02:42:15] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:46:49] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247
[02:46:52] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[03:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0300)
[03:01:59] <icinga-wm>	 PROBLEM - Check systemd state on elastic1092 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:50] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/876376 (https://phabricator.wikimedia.org/T325581)
[03:07:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/876376 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[03:12:10] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247
[03:12:13] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[03:24:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/876376 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[03:25:21] <icinga-wm>	 RECOVERY - Check systemd state on elastic1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:31:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Eileenmcnaughton) 05Open→03Resolved OK - I think this is resolved - my understanding from https://phabricator.wikimedia.org/T321494 is that 'done' looks like 'I can access  ht...
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0400)
[04:04:47] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:21:40] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Slaporte) Thanks for resolving this while I was out. There are no legal concer...
[05:27:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:30:16] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39018/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[05:32:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:38:37] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync idm-test1001 - slyngshede@cumin1001"
[05:39:32] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync idm-test1001 - slyngshede@cumin1001"
[05:40:15] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39019/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[05:40:50] <wikibugs>	 (03PS1) 10KartikMistry: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877219 (https://phabricator.wikimedia.org/T326278)
[05:44:21] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:52] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39020/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[06:01:37] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:02:11] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:22:06] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39021/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[06:22:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T326133
[06:22:31] <stashbot>	 T326133: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T326133
[06:22:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T326133
[06:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1100 with weight 0 T326133', diff saved to https://phabricator.wikimedia.org/P42938 and previous config saved to /var/cache/conftool/dbconfig/20230110-062309-ladsgroup.json
[06:42:14] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/874826 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot)
[06:42:21] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1100 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/874826 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot)
[06:51:39] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10ayounsi) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Could you work with @Marostegui to get this SFP-T replaced? see the errors on https://librenms.wikimedia.org/device/device=160/tab=port/port=15307/
[06:52:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:57:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0700).
[07:00:14] <Amir1>	 o/
[07:01:42] <Amir1>	 !log Starting s5 eqiad failover from db1130 to db1100 - T326133
[07:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:45] <stashbot>	 T326133: Switchover s5 master (db1130 -> db1100) - https://phabricator.wikimedia.org/T326133
[07:01:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T326133', diff saved to https://phabricator.wikimedia.org/P42939 and previous config saved to /var/cache/conftool/dbconfig/20230110-070152-ladsgroup.json
[07:01:56] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[07:02:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T326133', diff saved to https://phabricator.wikimedia.org/P42940 and previous config saved to /var/cache/conftool/dbconfig/20230110-070223-ladsgroup.json
[07:03:10] <XioNoX>	 !log remove static routes for legacy dns-rec-lb IPs - T239993
[07:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:13] <stashbot>	 T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993
[07:03:56] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[07:05:03] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/874827 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot)
[07:05:12] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/874827 (https://phabricator.wikimedia.org/T326133) (owner: 10Gerrit maintenance bot)
[07:06:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1130 T326133', diff saved to https://phabricator.wikimedia.org/P42941 and previous config saved to /var/cache/conftool/dbconfig/20230110-070628-ladsgroup.json
[07:10:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[07:10:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[07:11:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:14:59] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: check if dns update is needed after change of rec-dns-lb IPs status - ayounsi@cumin1001"
[07:16:02] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: check if dns update is needed after change of rec-dns-lb IPs status - ayounsi@cumin1001"
[07:16:02] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:16:30] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) a:05ayounsi→03BCornwall Static routes removed!  Next step is to remove the IPs from the servers: That means removing everything related to "legacy_vip" in Puppet https://github.com/wikime...
[07:19:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) asw2-b-eqiad:fpc1:1/1 is still showing errors...  Next step will be to replace the fiber between the two (already replaced) optics.  @Jclark-ctr let me know when woul...
[07:22:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet
[07:22:22] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2031.codfw.wmnet
[07:22:49] <wikibugs>	 (03PS1) 10Ayounsi: Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877221 (https://phabricator.wikimedia.org/T316532)
[07:23:07] <wikibugs>	 (03PS2) 10Ayounsi: Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877221 (https://phabricator.wikimedia.org/T316532)
[07:27:38] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[07:28:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/877221 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi)
[07:28:54] <XioNoX>	 !log depool ulsfo for network maintenance - T316532
[07:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:58] <stashbot>	 T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532
[07:32:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:33:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[07:33:53] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mc2044.codfw.wmnet
[07:36:04] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2031.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[07:37:14] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2031.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[07:37:14] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:37:16] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2031.codfw.wmnet
[07:37:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:42:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:42:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:45:27] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2032.codfw.wmnet
[07:52:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:52:57] <wikibugs>	 (03PS3) 10KartikMistry: ContentTranslation: Increase MT threshold for publishing in cswiki by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721)
[07:55:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T0800).
[08:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:01:33] * kart_ is around and will go for deployment..
[08:02:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[08:02:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) (owner: 10KartikMistry)
[08:02:55] <wikibugs>	 (03Merged) 10jenkins-bot: ContentTranslation: Increase MT threshold for publishing in cswiki by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) (owner: 10KartikMistry)
[08:03:34] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:875192|ContentTranslation: Increase MT threshold for publishing in cswiki by 20% (T324721)]]
[08:03:39] <stashbot>	 T324721: Modify Machine Translation in Czech Wikipedia by 20% or more to publish a translation  - https://phabricator.wikimedia.org/T324721
[08:05:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (GET clusterinformations) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:06:34] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39022/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[08:07:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:08:24] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:875192|ContentTranslation: Increase MT threshold for publishing in cswiki by 20% (T324721)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[08:09:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[08:09:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[08:10:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:11:46] <wikibugs>	 (03PS7) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[08:15:00] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Peachey88) p:05Unbreak!→03High Changing from Unbreak to High because it was resolved this morning.
[08:17:17] <kart_>	 Anything with mw1418 particular? Got this: "8:11:41 Check 'Logstash Error rate for mw1418.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.02, After: 2.00, Threshold: 1.00)"
[08:17:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[08:18:54] <zabe>	 mw1418 is a canary, but nothing else is special with it
[08:19:49] <kart_>	 zabe: yeah. 
[08:20:55] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:875192|ContentTranslation: Increase MT threshold for publishing in cswiki by 20% (T324721)]] (duration: 17m 21s)
[08:20:59] <stashbot>	 T324721: Modify Machine Translation in Czech Wikipedia by 20% or more to publish a translation  - https://phabricator.wikimedia.org/T324721
[08:22:26] <kart_>	 Moving to the next patch.
[08:22:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry)
[08:22:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) 05Open→03Resolved
[08:26:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix up package list after ldapsupportlib removal [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063)
[08:27:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Jelto)
[08:36:52] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Fix usage of categories translation unit as array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877138 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry)
[08:37:07] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:877138|CX: Fix usage of categories translation unit as array (T326278)]]
[08:37:15] <stashbot>	 T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278
[08:38:56] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:877138|CX: Fix usage of categories translation unit as array (T326278)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[08:48:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) (owner: 10Jelto)
[08:49:15] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:877138|CX: Fix usage of categories translation unit as array (T326278)]] (duration: 12m 08s)
[08:49:18] <stashbot>	 T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278
[08:49:20] <wikibugs>	 (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 15.5 [puppet] - 10https://gerrit.wikimedia.org/r/877958 (https://phabricator.wikimedia.org/T326616)
[08:50:12] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[08:50:27] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] admin: add zabe to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/877102 (https://phabricator.wikimedia.org/T326327) (owner: 10Jelto)
[08:51:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877219 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry)
[08:51:41] <kart_>	 (It seems backport deployment will stretch a bit or maybe be a byte!)
[08:52:45] <wikibugs>	 (03PS1) 10KartikMistry: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877223 (https://phabricator.wikimedia.org/T326278)
[08:53:16] <wikibugs>	 (03CR) 10Slyngshede: role:IDM assign IDM role to test VM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[08:54:03] <godog>	 !log upgrade thanos to 0.30.1 on prometheus2006 - T303154
[08:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:09] <stashbot>	 T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154
[08:56:54] <godog>	 !log upgrade thanos to 0.30.1 on thanos-fe1001 - T303154
[08:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:01] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, 10observability: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) @Slaporte - I am thinking of modifying the script to check that the i...
[08:58:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2032.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[09:03:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Jelto) 05Open→03Resolved a:03Jelto @Zabe you should have access to `deployment` group now. Happy to have you on board!  I'm closing this task. Feel free to re-open i...
[09:04:29] <wikibugs>	 (03PS2) 10Jelto: sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569)
[09:05:15] <wikibugs>	 (03Merged) 10jenkins-bot: CX: Fix transformation of TranslationUnitDTO to custom array [extensions/ContentTranslation] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/877219 (https://phabricator.wikimedia.org/T326278) (owner: 10KartikMistry)
[09:05:32] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:877219|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]]
[09:05:35] <stashbot>	 T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278
[09:06:38] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:06:38] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[09:07:17] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:877219|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[09:08:19] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: pass remote hosts down to alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/877191 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[09:08:48] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[09:10:20] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[09:11:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877958 (https://phabricator.wikimedia.org/T326616) (owner: 10Jelto)
[09:13:39] <wikibugs>	 (03PS1) 10Slyngshede: idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959
[09:13:41] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, but you also need to update the Admin::UID Variant for the new names." [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn)
[09:14:26] <wikibugs>	 (03PS2) 10Slyngshede: idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959
[09:14:52] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:877219|CX: Fix transformation of TranslationUnitDTO to custom array (T326278)]] (duration: 09m 20s)
[09:14:53] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Same issue with `rcp: /var/run/./vjunos-install.sh: Read-only file system` and then `mount: /dev/ad0s1a : Resource temporarily unavailable`, which...
[09:14:55] <stashbot>	 T326278: Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array - https://phabricator.wikimedia.org/T326278
[09:15:08] <wikibugs>	 (03CR) 10Muehlenhoff: "Better set these in hieradata/role/common/idm.yaml, then they apply to all future test hosts as well." [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede)
[09:15:38] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/877224 (https://phabricator.wikimedia.org/T316532)
[09:15:40] <kart_>	 !log Done: UTC morning backport window
[09:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:05] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 15.5 [puppet] - 10https://gerrit.wikimedia.org/r/877958 (https://phabricator.wikimedia.org/T326616) (owner: 10Jelto)
[09:17:07] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2032.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[09:17:07] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:17:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2032.codfw.wmnet
[09:18:20] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) Note that removing ` [edit system] -   internet-options { -       tcp-drop-synfin-set; -       no-tcp-reset drop-all-tcp; -   } ` Is needed otherwi...
[09:18:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet
[09:19:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2033.codfw.wmnet
[09:22:02] <wikibugs>	 (03PS3) 10Slyngshede: idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959
[09:22:23] <taavi>	 !log added zabe to wmf-deployment gerrit group T326327
[09:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:57] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet
[09:24:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/877224 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi)
[09:25:46] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) (owner: 10Muehlenhoff)
[09:25:48] <XioNoX>	 !log repool ulsfo (maintenance cancelled) - T316532
[09:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:51] <stashbot>	 T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532
[09:33:04] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:18] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9568478]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@9568478]
[09:34:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[09:34:30] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9568478]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@9568478] (duration: 00m 11s)
[09:34:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede)
[09:34:55] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede)
[09:35:01] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede)
[09:35:10] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] idm-test: Stub out secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/877959 (owner: 10Slyngshede)
[09:42:46] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:43:00] <godog>	 !log upgrade thanos to 0.30.1 on thanos-fe100[2-3] - T303154
[09:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:04] <stashbot>	 T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154
[09:45:24] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9568478]: Fix bug fix in HDFS usage pipeline [airflow-dags@9568478]
[09:45:38] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9568478]: Fix bug fix in HDFS usage pipeline [airflow-dags@9568478] (duration: 00m 13s)
[09:46:44] <wikibugs>	 (03PS8) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[09:47:41] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39027/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[09:49:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "dsh: Remove parse1002 from parsoid dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/877207 (https://phabricator.wikimedia.org/T326119) (owner: 10Clément Goubert)
[09:52:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "dsh: Remove parse1002 from parsoid dsh group" [puppet] - 10https://gerrit.wikimedia.org/r/877207 (https://phabricator.wikimedia.org/T326119) (owner: 10Clément Goubert)
[09:52:27] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) The pre-upgrade went fine on asw1-eqsin, so I guess the ulsfo issue is a corrupted storage.  The last step for eqsin is a reboot, so I'll maintain...
[09:52:58] <wikibugs>	 (03PS1) 10JMeybohm: PKI: Default expiry of 3 days for wikikube_staging [puppet] - 10https://gerrit.wikimedia.org/r/877961
[09:53:26] <wikibugs>	 (03PS9) 10Slyngshede: role:IDM assign IDM role to test VM. [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795)
[09:53:53] <moritzm>	 !log installing systemd bugfix updates from Bullseye point release
[09:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:58] <claime>	 slyngs: There's an idm.yaml change pending on puppetmaster, should I merge it or do I leave it for you?
[09:54:17] <slyngs>	 If you're there them please just merge
[09:54:24] <slyngs>	 there
[09:54:27] <claime>	 slyngs: all done
[09:54:28] <slyngs>	 Aarg
[09:54:30] <slyngs>	 Thanks
[09:54:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix up package list after ldapsupportlib removal [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) (owner: 10Muehlenhoff)
[09:54:42] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903
[09:54:44] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908
[09:55:19] <godog>	 !log upgrade thanos to 0.30.1 on prometheus hosts - T303154
[09:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:23] <stashbot>	 T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154
[09:56:09] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39028/console" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[09:56:21] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10RhinosF1) @mutante: adding as IC, can you please let people know when the incident report from last night is ready?  @multichill: I’ve ad...
[09:57:46] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:57:58] <jelto>	 ^ expected, gitlab replica
[09:57:58] <wikibugs>	 10SRE, 10Traffic-Icebox: varnish crash upon reload after libvmod-netmapper upgrade due to liburcu6 assertion - https://phabricator.wikimedia.org/T266651 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez
[09:59:34] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1002.eqiad.wmnet
[10:02:00] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[10:02:46] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:03:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix up package list after ldapsupportlib removal [puppet] - 10https://gerrit.wikimedia.org/r/877957 (https://phabricator.wikimedia.org/T114063) (owner: 10Muehlenhoff)
[10:06:10] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[10:06:10] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:06:11] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2033.codfw.wmnet
[10:07:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet
[10:09:42] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Remove default kubelet_cluster_domain definitions [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943)
[10:10:50] <icinga-wm>	 PROBLEM - Memcached on mc2034 is CRITICAL: connect to address 10.192.48.78 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[10:11:36] <wikibugs>	 (03PS1) 10Majavah: ldap: move ssh-key-ldap-lookup directly to ssh module [puppet] - 10https://gerrit.wikimedia.org/r/877964
[10:12:28] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::openldap::client: Stop including ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/877965
[10:13:14] <icinga-wm>	 RECOVERY - Memcached on mc2034 is OK: TCP OK - 0.033 second response time on 10.192.48.78 port 11214 https://wikitech.wikimedia.org/wiki/Memcached
[10:13:43] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39031/console" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah)
[10:13:44] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet
[10:13:44] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet
[10:13:51] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39032/console" [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[10:14:35] <claime>	 !log repooled parse1002.eqiad.wmnet - T326119
[10:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:37] <stashbot>	 T326119: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119
[10:14:58] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet
[10:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:18:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Detect the correct disks for the O/S on the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[10:18:20] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Detect the correct disks for the O/S on the cephosd servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[10:18:54] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1002.eqiad.wmnet with OS bullseye
[10:19:18] <claime>	 jouncebot: nowandnext
[10:19:18] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 40 minute(s)
[10:19:19] <jouncebot>	 In 0 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1100)
[10:21:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[10:21:45] <claime>	 !log Starting rolling reboot of eqiad jobrunners 
[10:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes
[10:24:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye
[10:25:42] <wikibugs>	 (03CR) 10Muehlenhoff: "I was thinking of rather just moving the content of ldap::client::utils to profile::base::labs (and then axing the "utils" check from ldap" [puppet] - 10https://gerrit.wikimedia.org/r/877964 (owner: 10Majavah)
[10:28:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[10:29:44] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:52] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-ask-password-console.path,systemd-ask-password-wall.path https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:56] <godog>	 !log upgrade thanos to 0.30.1 on thanos-fe2* - T303154
[10:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:59] <stashbot>	 T303154: Upgrade Thanos to latest version - https://phabricator.wikimedia.org/T303154
[10:33:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[10:42:44] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340)
[10:43:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[10:44:36] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340)
[10:45:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff)
[10:45:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete
[10:46:01] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39033/console" [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[10:48:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 (owner: 10Giuseppe Lavagetto)
[10:52:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 (owner: 10Giuseppe Lavagetto)
[10:54:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/877122 (https://phabricator.wikimedia.org/T320795) (owner: 10Slyngshede)
[10:56:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39037/console" [puppet] - 10https://gerrit.wikimedia.org/r/877961 (owner: 10JMeybohm)
[10:59:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond)
[10:59:15] <wikibugs>	 (03PS16) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365
[10:59:43] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908
[10:59:52] <icinga-wm>	 PROBLEM - Host an-worker1080 is DOWN: PING CRITICAL - Packet loss = 100%
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1100)
[11:00:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2034.codfw.wmnet
[11:00:55] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet
[11:02:50] <wikibugs>	 (03PS14) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[11:02:59] <wikibugs>	 (03PS6) 10Jbond: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[11:04:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080 and an-worker1084 - https://phabricator.wikimedia.org/T326127 (10jcrespo) an-worker1080 downtime alerting expired. No issue on our side, just a friendly ping in case you want to extend it.
[11:04:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[11:05:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s: Remove default kubelet_cluster_domain definitions [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:06:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] profile::openldap::client: Stop including ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/877965 (owner: 10Muehlenhoff)
[11:06:47] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet
[11:08:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] PKI: Default expiry of 3 days for wikikube_staging [puppet] - 10https://gerrit.wikimedia.org/r/877961 (owner: 10JMeybohm)
[11:08:45] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Remove default kubelet_cluster_domain definitions [puppet] - 10https://gerrit.wikimedia.org/r/877963 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:10:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[11:12:02] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:12:39] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340)
[11:13:30] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to 1.8.18 - https://phabricator.wikimedia.org/T316421 (10LSobanski)
[11:16:17] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops-collab: Upgrade etherpad.wikimedia.org to 1.8.18 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to focus this task on Etherpad version upgrade as suggested previously. Please create tasks for specific plugins so that they can be e...
[11:29:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:32:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Still support Stretch for Python LDAP includes [puppet] - 10https://gerrit.wikimedia.org/r/877994
[11:33:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Still support Stretch for Python LDAP includes [puppet] - 10https://gerrit.wikimedia.org/r/877994 (owner: 10Muehlenhoff)
[11:33:39] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes
[11:35:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Still support Stretch for Python LDAP includes [puppet] - 10https://gerrit.wikimedia.org/r/877994 (owner: 10Muehlenhoff)
[11:35:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes
[11:39:48] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:39:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:44:27] <volans>	 effie: the uncommitted DNS changes seems related to your decom of mc2034
[11:44:41] <effie>	 yes hangon 
[11:44:52] <effie>	 sorry mybad
[11:45:10] <volans>	 k, no prob
[11:46:14] <_joe_>	 jouncebot: nowandnext
[11:46:14] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1100)
[11:46:14] <jouncebot>	 In 2 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400)
[11:46:14] <jouncebot>	 In 2 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400)
[11:46:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 (owner: 10Giuseppe Lavagetto)
[11:46:50] <_joe_>	 I should be able to finish my changes before the backport window
[11:47:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:48:35] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[11:51:03] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: use the mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/874908 (owner: 10Giuseppe Lavagetto)
[11:51:44] <_joe_>	 ok here we go
[11:52:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971
[11:52:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:53:11] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:54:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971 (owner: 10Muehlenhoff)
[11:55:10] <_joe_>	 oof, sigh
[11:55:41] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/869777 (owner: 10Muehlenhoff)
[11:56:37] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff)
[11:56:51] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:56:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:57:26] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:57:59] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:58:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:59:29] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[12:01:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:02:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[12:04:07] <wikibugs>	 (03CR) 10Jbond: phabricator: change phd home dir to /var/lib/phd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[12:04:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add SPDX headers to various base/IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/863305 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:05:07] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[12:05:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add SPDX headers for various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/860912 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:06:22] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[12:06:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[12:07:43] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[12:07:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:11:43] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:12:07] <claime>	 !log Finished rolling reboot of eqiad jobrunners 
[12:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:17:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Decom puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783)
[12:17:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Decom puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:17:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:17:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:18:52] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[12:18:52] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:18:52] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2034.codfw.wmnet
[12:19:02] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet
[12:22:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:22:49] <wikibugs>	 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10LSobanski) @Reedy @MoritzMuehlenhoff is there anything else left to do here or can the task be resolved?
[12:22:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:25:32] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet
[12:25:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, optional nit to make more dry" [puppet] - 10https://gerrit.wikimedia.org/r/877120 (https://phabricator.wikimedia.org/T326325) (owner: 10Muehlenhoff)
[12:27:54] <wikibugs>	 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Python3-Porting: captcha.py needs to be ported to Python 3 - https://phabricator.wikimedia.org/T268468 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is done, the ConfirmEdit extension as deployed in production uses Python 3 and then Puppe...
[12:27:59] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:28:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:30:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::mcrouter::yaml_defs: adapt to new values structure [puppet] - 10https://gerrit.wikimedia.org/r/878004
[12:30:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Decom puppetdb-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/878001 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:31:05] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[12:31:05] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[12:31:28] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[12:31:28] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[12:32:13] <wikibugs>	 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10LSobanski) The task's original intent was to cover planning "over the next 3 years" starting in 2019. @ArielGlenn is the task still relevant, can...
[12:32:41] <wikibugs>	 (03PS1) 10Btullis: Correct the units for the cephosd volumes [puppet] - 10https://gerrit.wikimedia.org/r/878005 (https://phabricator.wikimedia.org/T324670)
[12:33:14] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:33:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39038/console" [puppet] - 10https://gerrit.wikimedia.org/r/878004 (owner: 10Giuseppe Lavagetto)
[12:34:41] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd1002.eqiad.wmnet with OS bullseye
[12:34:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Correct the units for the cephosd volumes [puppet] - 10https://gerrit.wikimedia.org/r/878005 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[12:35:00] <wikibugs>	 10SRE: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving based on the most recent comment. Please reopen if appropriate.
[12:36:03] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye
[12:36:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::mcrouter::yaml_defs: adapt to new values structure [puppet] - 10https://gerrit.wikimedia.org/r/878004 (owner: 10Giuseppe Lavagetto)
[12:39:50] <wikibugs>	 10SRE, 10serviceops, 10User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 (10LSobanski)
[12:40:43] <wikibugs>	 (03PS2) 10Jbond: admin: split system user data type into local and global [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn)
[12:41:13] <wikibugs>	 (03CR) 10Jbond: admin: split system user data type into local and global (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn)
[12:41:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn)
[12:44:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112 (10LSobanski)
[12:45:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn)
[12:46:12] <wikibugs>	 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving based on the recent comments, follow up work should be happening in T261196.
[12:47:07] <claime>	 jouncebot: nowandnext
[12:47:07] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[12:47:07] <jouncebot>	 In 1 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400)
[12:47:07] <jouncebot>	 In 1 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400)
[12:47:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes
[12:49:57] <claime>	 !log Starting rolling reboot of eqiad appservers
[12:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetdb-test2001.codfw.wmnet
[12:50:09] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[12:50:11] <wikibugs>	 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10LSobanski) @hashar As the original requester (T307349#7895775), could you help clarify what's needed here?
[12:50:14] <logmsgbot>	 !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97)
[12:50:28] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[12:53:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:56:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:56:45] <wikibugs>	 (03PS1) 10Btullis: Reduce the size of the partitions on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/878007 (https://phabricator.wikimedia.org/T324670)
[12:58:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Reduce the size of the partitions on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/878007 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[12:59:27] <logmsgbot>	 !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cephosd1002.eqiad.wmnet with OS bullseye
[12:59:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye
[13:00:10] <wikibugs>	 (03CR) 10Jbond: Detect the correct disks for the O/S on the cephosd servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[13:05:15] <wikibugs>	 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) >>! In T226093#8512308, @LSobanski wrote: > The task's original intent was to cover planning "over the next 3 years" starting in 2019...
[13:08:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb-test2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:08:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:08:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetdb-test2001.codfw.wmnet
[13:09:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetdb-test2001.codfw.wmnet` - puppetdb-test2001.codfw.wmnet...
[13:10:31] <wikibugs>	 (03PS1) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:10:46] <wikibugs>	 (03PS2) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:11:04] <icinga-wm>	 PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100%
[13:11:14] <icinga-wm>	 PROBLEM - Host mw1352 is DOWN: PING CRITICAL - Packet loss = 100%
[13:11:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[13:11:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[13:13:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff)
[13:14:02] <claime>	 The two mw hosts down is me, they don't seem to be coming back and the cookbook downtime expired
[13:16:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage
[13:18:37] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder)
[13:19:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage
[13:19:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::openldap::client: Stop including ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/877965 (owner: 10Muehlenhoff)
[13:19:54] <icinga-wm>	 RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[13:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:20:18] <icinga-wm>	 PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:20] <icinga-wm>	 RECOVERY - Check systemd state on mw1351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:46] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) So the above is my proposal for the check: rather than checking exact word...
[13:22:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2001.wikimedia.org
[13:22:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:24:18] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:24:40] <icinga-wm>	 RECOVERY - Host mw1352 is UP: PING OK - Packet loss = 0%, RTA = 2.48 ms
[13:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (POST events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:25:30] <icinga-wm>	 PROBLEM - Check systemd state on mw1352 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:40] <icinga-wm>	 RECOVERY - Check systemd state on mw1352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2001.wikimedia.org
[13:29:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (POST events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:30:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/878014 (https://phabricator.wikimedia.org/T325387)
[13:31:32] <wikibugs>	 (03PS3) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:31:56] <wikibugs>	 (03PS4) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:32:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[13:32:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:34:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[13:35:56] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[13:36:18] <wikibugs>	 (03PS5) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:36:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet
[13:36:50] <wikibugs>	 (03PS6) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:36:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[13:37:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2035.codfw.wmnet
[13:37:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[13:40:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (6) High Kubernetes API latency (POST events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:43:18] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet
[13:43:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover irc CNAME to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/878017
[13:44:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover irc CNAME to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/878017 (owner: 10Muehlenhoff)
[13:44:57] <godog>	 !log delete grafana dashboards from "sre dashboards for deletion" folder - T178690
[13:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:00] <stashbot>	 T178690: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690
[13:45:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:46:07] <wikibugs>	 (03PS7) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[13:46:46] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[13:46:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[13:46:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1002.eqiad.wmnet with OS bullseye
[13:49:46] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2035.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400)
[14:00:05] <jouncebot>	 eigyan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1400)
[14:00:06] <eigyan>	 Greetings All 0/
[14:00:31] <taavi>	 o/
[14:00:39] <claime>	 For deployers information, I am rebooting appservers in eqiad, which may cause some scap failures
[14:00:48] <zabe>	 heya heya heya, I would like to try stuff out
[14:00:55] <claime>	 Just ping me with the machine failing and I'll tell you if it's me or not
[14:01:02] <claime>	 I'm not touching mwdebug
[14:01:13] <taavi>	 claime: will you ensure those will then be updated with the latest config aftewards?
[14:01:24] <claime>	 taavi: if you get a failure yes
[14:01:29] <claime>	 if not, they scap pull at boot
[14:02:03] <taavi>	 zabe: sure! I'm around too and happy to help if you have any problems
[14:02:24] <zabe>	 nice, thanks
[14:02:27] <taavi>	 meanwhile.. eigyan: why are fawiki and enwiki added with the + syntax while the others aren't?
[14:02:36] <Lucas_WMDE>	 I’m in a meeting at the moment, I could do some backports later if someone pings me to remind me (after :20 or so)
[14:03:37] <urbanecm>	 zabe: I'm around for the first 45 minutes, in case you ned me.
[14:03:43] <eigyan>	 taavi +syntax ensures a merge with beta-config will not overwrite its values
[14:03:50] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2035.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[14:03:50] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:03:52] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2035.codfw.wmnet
[14:04:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "There's a merge conflict now though LGTM overall (not tested yet)" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[14:04:24] <urbanecm>	 zabe: TLDR it's ssh to deploy1002 and run `scap backport 877268` those days, it'll practically guide you itself :)
[14:04:59] <zabe>	 ok :)
[14:06:08] <wikibugs>	 (03PS2) 10Zabe: [config]: GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan)
[14:06:55] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 6 hosts with reason: Reinitialize staging-codfw with k8s 1.23
[14:06:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan)
[14:06:59] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host apifeatureusage1001.eqiad.wmnet
[14:07:04] <icinga-wm>	 PROBLEM - Memcached on mc2036 is CRITICAL: connect to address 10.192.48.80 and port 11214: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[14:07:05] <_joe_>	 claime: uhm actually that has been removed in some refactor (scap pull on reboot), sigh
[14:07:12] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 6 hosts with reason: Reinitialize staging-codfw with k8s 1.23
[14:07:13] <claime>	 _joe_: *sigh*
[14:07:15] <claime>	 ok
[14:07:17] <claime>	 I'll stop it then
[14:07:28] <_joe_>	 I mean, it's not the end of the world
[14:07:28] <claime>	 Gimme 5 minutes to cleanup
[14:07:41] <wikibugs>	 (03Merged) 10jenkins-bot: [config]: GDI Safety Survey Wave 4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/877268 (https://phabricator.wikimedia.org/T325136) (owner: 10Eigyan)
[14:07:49] <_joe_>	 we sill just need to run a scap full sync at the end I guess
[14:07:55] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:877268|[config]: GDI Safety Survey Wave 4 (T325136)]]
[14:07:57] <stashbot>	 T325136: Deploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - week of January 9, 2023 - https://phabricator.wikimedia.org/T325136
[14:08:08] <claime>	 There's quite a few needing a racadm kick to the butt to actually reboot :/
[14:08:08] <urbanecm>	 _joe_: scap backport runs full sync every time those days, so that should not be an issue
[14:08:26] <_joe_>	 urbanecm: right, but if some machine fail on the last scap
[14:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:09:06] <urbanecm>	 ah, then yes, you're right.
[14:09:47] <taavi>	 that should be solvably just by not starting the full sync before c.laime is done with pausing the script?
[14:09:47] <logmsgbot>	 !log zabe@deploy1002 zabe and essexigyan: Backport for [[gerrit:877268|[config]: GDI Safety Survey Wave 4 (T325136)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:10:00] <zabe>	 yeah, I will wait
[14:10:13] <logmsgbot>	 !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97)
[14:10:24] <zabe>	 eigyan, you can test in the meantime :)
[14:10:42] <eigyan>	 will do zabe
[14:11:14] <wikibugs>	 (03PS8) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[14:11:37] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2036.codfw.wmnet
[14:11:45] <claime>	 mw1371/mw1372 need a hard reset
[14:11:52] <claime>	 That's a few in a row :/
[14:11:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[14:12:12] <eigyan>	 zabe all surveys are display as expected woo hoo!
[14:12:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:13:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:14:17] <zabe>	 claime, Can you let me know when I can sync?
[14:14:20] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apifeatureusage1001.eqiad.wmnet
[14:14:48] <claime>	 zabe: yep, just a few minutes to perform CPR on that host, I'll give you the go ahead
[14:14:51] <wikibugs>	 (03PS13) 10Vgutierrez: varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676)
[14:14:55] <zabe>	 ok, thanks
[14:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:16:30] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39040/console" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[14:17:22] <wikibugs>	 (03PS9) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[14:18:14] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw[1369-1372].eqiad.wmnet
[14:18:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw[1369-1372].eqiad.wmnet
[14:18:53] <claime>	 zabe: you can go ahead
[14:18:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:19:04] <zabe>	 thanks, syncing
[14:19:27] <claime>	 !log Pausing reboots of eqiad appservers for deployments
[14:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:44] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) This is an example of its execution in verbose mode, you can see it is abl...
[14:21:16] <wikibugs>	 (03CR) 10Jcrespo: "This is an example of its execution in verbose mode, you can see it is able to find and crawl the referred texts (only the last line would" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[14:21:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host apifeatureusage2001.codfw.wmnet
[14:22:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job k8s-api in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:23:42] <wikibugs>	 (03CR) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[14:23:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:25:37] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:877268|[config]: GDI Safety Survey Wave 4 (T325136)]] (duration: 17m 42s)
[14:25:47] <stashbot>	 T325136: Deploy GDI Safety Survey Wave 4 on EN, ES, FR, FA, PT wikis - week of January 9, 2023 - https://phabricator.wikimedia.org/T325136
[14:26:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[14:26:20] <zabe>	 eigyan, should be live :)
[14:26:47] <eigyan>	 Excellent, thank you zabe
[14:27:09] <eigyan>	 thanks to all who made this deployment possible :)
[14:27:31] <wikibugs>	 (03PS2) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783)
[14:27:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:28:22] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2001.codfw.wmnet with OS bullseye
[14:28:36] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2001.codfw.wmnet with OS bullseye
[14:28:36] <wikibugs>	 (03CR) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[14:28:36] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2036.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[14:28:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:29:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:32:01] <wikibugs>	 (03PS1) 10Zabe: Start reading from cul_actor on remaining test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004)
[14:33:30] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2036.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[14:33:31] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:33:31] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2036.codfw.wmnet
[14:33:35] <wikibugs>	 (03PS10) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[14:34:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:34:29] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host apifeatureusage2001.codfw.wmnet
[14:34:57] <wikibugs>	 (03PS2) 10Zabe: Start reading from cul_actor on remaining test wikis and group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004)
[14:35:37] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc2037.codfw.wmnet
[14:35:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet
[14:35:53] <wikibugs>	 (03PS11) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169)
[14:36:37] <wikibugs>	 (03CR) 10Jcrespo: check_legal_terms: Refactor check to make it more robust against changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[14:36:53] <zabe>	 !log run populateCulActor on group0 wikis # T325484
[14:36:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:56] <stashbot>	 T325484: Run PopulateCulActor on all wikis - https://phabricator.wikimedia.org/T325484
[14:37:10] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I don’t understand the phpstan error in CI…" [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE))
[14:37:34] <wikibugs>	 10SRE, 10Traffic: Package and deploy varnish 6.0.11 - https://phabricator.wikimedia.org/T326634 (10ssingh)
[14:37:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[14:38:24] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from cul_actor on remaining test wikis and group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878021 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[14:38:37] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:878021|Start reading from cul_actor on remaining test wikis and group0 wikis (T233004)]]
[14:38:43] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[14:39:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:40:24] <logmsgbot>	 !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878021|Start reading from cul_actor on remaining test wikis and group0 wikis (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[14:44:46] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/878014 (https://phabricator.wikimedia.org/T325387) (owner: 10Muehlenhoff)
[14:46:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.dns.netbox
[14:46:59] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @ayounsi @cmooney I have 2 questions  1- I have a total of 17 switches received so 1 is going to be used as the cloudsw in r...
[14:47:36] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878021|Start reading from cul_actor on remaining test wikis and group0 wikis (T233004)]] (duration: 08m 59s)
[14:47:39] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[14:47:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:48:21] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2037.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[14:48:45] <wikibugs>	 (03PS1) 10JMeybohm: install_server: Update kubestagetcd2* to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878047 (https://phabricator.wikimedia.org/T326340)
[14:49:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:49:14] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader1001.eqiad.wmnet
[14:49:38] <zabe>	 !log UTC afternoon deploys done
[14:49:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:43] <wikibugs>	 (03PS1) 10Ssingh: Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634)
[14:50:16] <zabe>	 claime, you can continue with your reboots if you like
[14:50:35] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) We have now the logs in kafka, and thus should also be ingested in logstash, and create a dashboard.  Once that's done, we should reduce also the retention time of...
[14:51:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, though see inline. I got a an AttributeError when testing" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[14:51:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] install_server: Update kubestagetcd2* to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/878047 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[14:52:46] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:53:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1001.eqiad.wmnet
[14:53:35] <wikibugs>	 10SRE, 10Thumbor, 10serviceops: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10LSobanski)
[14:54:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:55:29] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2001.codfw.wmnet with OS bullseye
[14:55:38] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2001.codfw.wmnet with OS bullseye
[14:55:58] <icinga-wm>	 PROBLEM - Host mc2050 is DOWN: PING CRITICAL - Packet loss = 100%
[14:56:02] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Ottomata) If we did {T291645} and {T276972}, these logs could be mirrored to Kafka jumbo and available in Hive and Turnilo too.
[14:56:48] <XioNoX>	 !log start VC link maintenance in eqiad - T325803
[14:56:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:55] <stashbot>	 T325803: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803
[15:01:22] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mc2037.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1001"
[15:01:23] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:01:23] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc2037.codfw.wmnet
[15:02:21] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2001.codfw.wmnet with OS bullseye
[15:02:55] <wikibugs>	 (03PS1) 10Jbond: docker::baseimages: inject no_proxy config to rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316)
[15:04:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39041/console" [puppet] - 10https://gerrit.wikimedia.org/r/878063 (https://phabricator.wikimedia.org/T326316) (owner: 10Jbond)
[15:04:44] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) >>! In T265876#8512693, @Ottomata wrote: > If we did {T291645} and {T276972}, these logs could be mirrored to Kafka jumbo and available in Hive and Turnilo too.  Wh...
[15:04:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:04:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) asw2-b-eqiad: fpc1:1/1   Cleaned fiber and replaced optic
[15:05:36] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) 1/ 1 ToR per rack = 8x2 + 1 spare = 17, so indeed 1 dedicated to WMCS  2/ A1 and B1 would make sens, and would match eqiad...
[15:06:03] <icinga-wm>	 RECOVERY - Host mc2050 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms
[15:09:28] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet
[15:11:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:11:24] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd2001.codfw.wmnet with reason: host reimage
[15:13:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 6.0.11-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh)
[15:14:31] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd2001.codfw.wmnet with reason: host reimage
[15:16:17] <wikibugs>	 (03Merged) 10jenkins-bot: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:17:05] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) @Jclark-ctr let me know if you need me to depool this host (db1107). It can be easily be done, it is just a replica.
[15:17:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host search-loader2001.codfw.wmnet
[15:18:56] <wikibugs>	 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Marostegui) I think even if they grow a lot, with the new set of servers, we still have 6.6TB free (76% free disk space)...I'd be surprised if we...
[15:21:16] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2001.codfw.wmnet
[15:22:27] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/877973
[15:22:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1143: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/877973 (owner: 10Marostegui)
[15:23:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: After the incident', diff saved to https://phabricator.wikimedia.org/P42944 and previous config saved to /var/cache/conftool/dbconfig/20230110-152336-root.json
[15:25:32] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:25:36] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:27:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:28:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet
[15:29:54] <claime>	 !log Restarting rolling reboots of eqiad appservers 
[15:29:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:02] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster
[15:30:59] <wikibugs>	 (03CR) 10Ssingh: "We have a successful build on build2001, this is just the regular test failing." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/878049 (https://phabricator.wikimedia.org/T326634) (owner: 10Ssingh)
[15:32:08] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar)
[15:35:42] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet
[15:38:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: After the incident', diff saved to https://phabricator.wikimedia.org/P42945 and previous config saved to /var/cache/conftool/dbconfig/20230110-153841-root.json
[15:43:03] <wikibugs>	 (03PS3) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783)
[15:44:01] <wikibugs>	 (03PS4) 10Jbond: monitoring: convert prometheus-puppet-agent-stats to pathlib [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783)
[15:44:06] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[15:45:40] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[15:48:50] <wikibugs>	 (03PS1) 10Ottomata: flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576)
[15:49:47] <wikibugs>	 (03PS2) 10Ottomata: flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576)
[15:49:49] <wikibugs>	 (03PS1) 10Ayounsi: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/878117
[15:50:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:50:49] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:51:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:52:00] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Dzahn) >>! In T317169#8508073, @jcrespo wrote: > Thank you, while I understand why...
[15:52:48] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2002.codfw.wmnet with OS bullseye
[15:53:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: After the incident', diff saved to https://phabricator.wikimedia.org/P42946 and previous config saved to /var/cache/conftool/dbconfig/20230110-155346-root.json
[15:54:40] <wikibugs>	 (03PS2) 10BCornwall: prometheus: Add Varnish thread percent usage rule [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723)
[15:55:20] <wikibugs>	 (03CR) 10BCornwall: prometheus: Add Varnish thread percent usage rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[15:55:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for mw[1373,1384-1385,1387].eqiad.wmnet
[15:55:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw[1373,1384-1385,1387].eqiad.wmnet
[15:55:31] <wikibugs>	 (03Merged) 10jenkins-bot: flink-kubernetes-operator - use chart version as wmf version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878116 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:56:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul)
[15:56:45] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Dzahn) >>! In T317169#8512445, @jcrespo wrote: > I would like first a technical rev...
[15:57:07] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Dzahn) a:05Dzahn→03None
[15:57:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:57:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:58:29] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:58:52] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:58:56] <wikibugs>	 (03PS1) 10Volans: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878124
[15:59:03] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:59:17] <SandraEbele>	 !log reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10.
[15:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:42] <wikibugs>	 (03Abandoned) 10Volans: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878124 (owner: 10Volans)
[16:00:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[16:00:55] <wikibugs>	 (03Abandoned) 10Ayounsi: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/878117 (owner: 10Ayounsi)
[16:01:25] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd2002.codfw.wmnet with reason: host reimage
[16:01:34] <wikibugs>	 (03PS1) 10Ottomata: admin_ng/flink-operator - crds release depends on kube-system/namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/878126 (https://phabricator.wikimedia.org/T324576)
[16:02:24] <wikibugs>	 (03PS1) 10Ayounsi: Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878127
[16:03:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] admin_ng/flink-operator - crds release depends on kube-system/namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/878126 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:03:50] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
[16:03:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] admin_ng/flink-operator - crds release depends on kube-system/namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/878126 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:04:14] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
[16:04:23] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd2002.codfw.wmnet with reason: host reimage
[16:04:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul)
[16:04:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878127 (owner: 10Ayounsi)
[16:05:08] <wikibugs>	 (03PS1) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 and connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128
[16:05:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 and connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (owner: 10Stevemunene)
[16:06:38] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Upstream release v3.2.9 with WMF modifications [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/878127 (owner: 10Ayounsi)
[16:08:06] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagetcd2003.codfw.wmnet with OS bullseye
[16:08:25] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:08:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: After the incident', diff saved to https://phabricator.wikimedia.org/P42947 and previous config saved to /var/cache/conftool/dbconfig/20230110-160851-root.json
[16:08:57] <wikibugs>	 (03PS2) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[16:09:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:09:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:10:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Idea generally LGTM, pending Legal's requirement re: words/phrases we should be looking for" [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo)
[16:10:30] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:11:16] <wikibugs>	 (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[16:12:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:13:52] <wikibugs>	 (03CR) 10Herron: [C: 03+2] update role_contacts for thanos (front|back)end (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron)
[16:14:45] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2002.codfw.wmnet with OS bullseye
[16:15:27] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) Changed kafka topic retention time to 2 days instead of the default 7. ` cgoubert@kafka-logging1001:~$ kafka topic...
[16:18:46] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE))
[16:19:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: add kafka-logging200[45] to codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[16:20:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul)
[16:20:26] <wikibugs>	 (03PS3) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[16:21:26] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagetcd2003.codfw.wmnet with reason: host reimage
[16:23:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagetcd2003.codfw.wmnet with reason: host reimage
[16:23:54] <wikibugs>	 (03PS4) 10MVernon: hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123)
[16:23:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: After the incident', diff saved to https://phabricator.wikimedia.org/P42948 and previous config saved to /var/cache/conftool/dbconfig/20230110-162356-root.json
[16:24:03] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2001.codfw.wmnet with OS bullseye
[16:25:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul)
[16:26:11] <wikibugs>	 (03CR) 10MVernon: hiera: move swift credentials into common (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[16:26:41] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39045/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:27:55] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972
[16:28:02] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[16:29:28] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[16:29:38] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:29:45] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/877257 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[16:32:58] <wikibugs>	 (03PS7) 10MVernon: swift: move accounts_keys to common hiera global_account_keys [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123)
[16:33:43] <wikibugs>	 (03CR) 10MVernon: "Changed the name of the hiera entry, to make the transition easier." [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[16:36:05] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagetcd2003.codfw.wmnet with OS bullseye
[16:36:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. I agree that the updated name in hiera will make the transition easier." [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[16:36:21] <wikibugs>	 10SRE, 10Dumps-Generation, 10SDC General, 10Wikidata, 10wdwb-tech: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10Ladsgroup) Commons is now the biggest section and by far. It used to be so much worse that wikidata dwarfed in comparison. The thing is: It has a...
[16:37:50] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:38:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to management network for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) @wiki_willy Jennifer needs access to the management network to be about to ssh into servers to access BIOS/IDRAC to troubleshoot and pull TSR report if needed.  Can...
[16:39:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: After the incident', diff saved to https://phabricator.wikimedia.org/P42949 and previous config saved to /var/cache/conftool/dbconfig/20230110-163901-root.json
[16:40:33] <wikibugs>	 (03CR) 10Btullis: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[16:41:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto) p:05Triage→03Medium
[16:43:10] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Fix test constructing HTMLFormField without parent [extensions/WikibaseLexeme] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877983 (https://phabricator.wikimedia.org/T326621)
[16:43:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Hey @Jelto, I've been working with Scott Bassett on trying to gain access. Unfortunately, I am not able to login...
[16:43:30] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972
[16:44:08] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:44:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42950 and previous config saved to /var/cache/conftool/dbconfig/20230110-164447-ladsgroup.json
[16:45:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto) >>! In T326649#8513203, @Papaul wrote: > @wiki_willy Jennifer needs access to the management network to be about to ssh into servers to access BIOS/IDRAC to troubleshoot...
[16:45:44] <icinga-wm>	 PROBLEM - Host bast5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:56] <icinga-wm>	 PROBLEM - Host doh5001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:58] <icinga-wm>	 PROBLEM - Host prometheus5001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:02] <icinga-wm>	 PROBLEM - Host ncredir5001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:04] <icinga-wm>	 PROBLEM - Host netflow5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:07] <icinga-wm>	 PROBLEM - Host cr2-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:07] <icinga-wm>	 PROBLEM - Host durum5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:21] <icinga-wm>	 PROBLEM - Host cr3-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:24] <icinga-wm>	 PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:24] <icinga-wm>	 PROBLEM - Host durum5001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:26] <icinga-wm>	 PROBLEM - Host install5001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:26] <icinga-wm>	 PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:29] <Emperor>	 is this expected?
[16:46:32] <akosiaris>	 what on earth
[16:46:35] <chlod>	 surely not
[16:46:36] <vgutierrez>	 hmmm let's depool eqsin
[16:46:52] <chlod>	 getting "Error: 502, Broken pipe" on my end (Philippines)
[16:47:01] <akosiaris>	 yes, let's depool eqsin
[16:47:09] <RhinosF1>	 chlod: that’s the alerts
[16:47:21] <XioNoX>	 wow yeah
[16:47:26] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:47:32] <wikibugs>	 (03PS1) 10BBlack: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133
[16:47:39] <denisse>	 ^ Looking at it.
[16:47:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack)
[16:47:52] <icinga-wm>	 PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:47:52] <Emperor>	 if you need help, do shout
[16:47:54] <icinga-wm>	 PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:47:54] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack)
[16:47:57] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack)
[16:48:02] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/878133 (owner: 10BBlack)
[16:48:25] <bblack>	 !log depooling eqsin from DNS
[16:48:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:27] <XioNoX>	 looks like both transport link to eqsin are down
[16:49:11] <jynus>	 maintenance?
[16:49:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Jelto) >>! In T323943#8513231, @KHurd-WMF wrote: > Hey @Jelto, I've been working with Scott Bassett on trying to gain acces...
[16:49:26] <denisse>	 XioNoX jynus Yes, I think it has to do with the depooling of eqsin
[16:49:41] <akosiaris>	 I 'll update status page
[16:49:56] <icinga-wm>	 PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:49:56] <jynus>	 oh, I thought it was depooled
[16:49:57] <XioNoX>	 planned maintenance on the only working link
[16:50:11] <XioNoX>	 see PWIC225900
[16:50:16] <icinga-wm>	 PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:50:22] <icinga-wm>	 PROBLEM - Host cr3-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:50:25] <Emperor>	 jynus: no, b.black has just depooled it as a response to all the p.ages
[16:50:28] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:50:33] <XioNoX>	 the other link has been on maintenance for a bit https://phabricator.wikimedia.org/T322529
[16:50:35] <Emperor>	 XioNoX: /o\
[16:50:40] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972
[16:50:45] <jynus>	 well, that should do it
[16:51:01] <taavi>	 can someone put up a statuspage update?
[16:51:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10StephaneRebai)
[16:51:08] <akosiaris>	 taavi: already done
[16:51:14] <Emperor>	 taavi: a.kosiaris is on it
[16:51:19] <Emperor>	 (and types quicker than me, damnit)
[16:51:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) @Jelto thanks for the reply i have already her SSH-key and I will personally be adding her to the group once I have the approval from Willy.  Thanks
[16:51:34] <wikibugs>	 (03CR) 10JMeybohm: sre.ganeti.reimage: add new cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[16:52:00] <jynus>	 missing eqsin from this graph? https://grafana.wikimedia.org/d/000000093/cdn-frontend-network?orgId=1
[16:52:24] <wikibugs>	 (03PS10) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[16:52:30] <XioNoX>	 jynus: looks like it
[16:52:45] <XioNoX>	 but also you won't have data if the links are down
[16:52:56] <jynus>	 maybe that's why
[16:53:00] <vgutierrez>	 that's a bug of the dashboard... no data is being collected from eqsin
[16:53:06] <XioNoX>	 no you should have historical
[16:53:30] <vgutierrez>	 and the site variable gets filled dynamically 
[16:53:48] <jynus>	 vgutierrez: as in, a potential actionable, or something expected when no metrics are arriving?
[16:53:55] <wikibugs>	 (03PS11) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[16:54:05] <vgutierrez>	 jynus: actionable IMHO
[16:54:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: After the incident', diff saved to https://phabricator.wikimedia.org/P42951 and previous config saved to /var/cache/conftool/dbconfig/20230110-165406-root.json
[16:54:11] <jynus>	 vgutierrez: good
[16:54:30] <jynus>	 nel looks good after a spike
[16:54:49] <jynus>	 but we may be missing metrics
[16:54:54] <XioNoX>	 not for NEL
[16:55:03] <XioNoX>	 NEL sends to the next best DC
[16:55:24] <jynus>	 basically I am trying to see impact
[16:55:28] <icinga-wm>	 RECOVERY - Host netflow5002 is UP: PING OK - Packet loss = 0%, RTA = 247.25 ms
[16:55:28] <icinga-wm>	 RECOVERY - Host durum5002 is UP: PING OK - Packet loss = 0%, RTA = 238.90 ms
[16:55:29] <jynus>	 I know dns lags a bit
[16:55:30] <icinga-wm>	 RECOVERY - Host doh5001 is UP: PING OK - Packet loss = 0%, RTA = 244.81 ms
[16:55:30] <icinga-wm>	 RECOVERY - Host durum5001 is UP: PING OK - Packet loss = 0%, RTA = 242.79 ms
[16:55:30] <icinga-wm>	 RECOVERY - Host install5001 is UP: PING OK - Packet loss = 0%, RTA = 232.47 ms
[16:55:30] <icinga-wm>	 RECOVERY - Host ncredir5001 is UP: PING OK - Packet loss = 0%, RTA = 233.62 ms
[16:55:30] <icinga-wm>	 RECOVERY - Host prometheus5001 is UP: PING OK - Packet loss = 0%, RTA = 250.70 ms
[16:55:32] <icinga-wm>	 RECOVERY - Host ncredir5002 is UP: PING OK - Packet loss = 0%, RTA = 231.49 ms
[16:55:33] <icinga-wm>	 RECOVERY - Host cr2-eqsin #page is UP: PING OK - Packet loss = 0%, RTA = 225.39 ms
[16:55:34] <icinga-wm>	 RECOVERY - Host cr3-eqsin #page is UP: PING OK - Packet loss = 0%, RTA = 245.89 ms
[16:55:34] <icinga-wm>	 RECOVERY - Host doh5002 is UP: PING OK - Packet loss = 0%, RTA = 253.59 ms
[16:55:34] <icinga-wm>	 RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.01 ms
[16:55:36] <icinga-wm>	 RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 231.29 ms
[16:55:46] <XioNoX>	 let's keep it depooled until the end of the maintenance at least
[16:55:50] <bblack>	 +1
[16:55:52] <akosiaris>	 +!
[16:55:54] <akosiaris>	 +1
[16:55:54] <icinga-wm>	 RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 245.15 ms
[16:55:55] <akosiaris>	 sigh...
[16:56:00] <icinga-wm>	 RECOVERY - Host cr3-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 243.03 ms
[16:56:02] <icinga-wm>	 RECOVERY - Host bast5002 is UP: PING OK - Packet loss = 0%, RTA = 254.35 ms
[16:56:08] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 251.84 ms
[16:56:14] <XioNoX>	 the other link should come back in 2 days, but its ETA has been pushed multiple times
[16:56:16] <icinga-wm>	 RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 237.02 ms
[16:57:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:57:08] <wikibugs>	 (03PS12) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[16:57:30] <akosiaris>	 ok, interesting question, should I just switch the incident in the status page to minor and degraded performance for reading?
[16:57:45] <akosiaris>	 I guess that reflects the current reality better?
[16:57:48] <XioNoX>	 akosiaris: if it's depooled yeah
[16:58:02] <XioNoX>	 degraded perf in the apac region
[16:58:08] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_eqsin_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:58:20] <bblack>	 or we could go all glass-half-full and say we have an incidental editing performance improvement for users in australia :)
[16:58:38] <wikibugs>	 (03CR) 10MVernon: "Updated to take review comments on board (thanks!) and changed name of the hiera item." [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[16:58:44] <bblack>	 (because they don't bounce backwards latency-wise through the cache site to reach the core)
[16:58:54] <wikibugs>	 (03PS1) 10Ottomata: flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576)
[16:59:51] <mutante>	 XioNoX: akosiaris: I am here, anything left to help with? Sorry, I failed to see it in meeting
[16:59:53] <cdanis>	 akosiaris: I wouldn't even call eqsin being depooled an incident tbh
[16:59:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42952 and previous config saved to /var/cache/conftool/dbconfig/20230110-165952-ladsgroup.json
[17:00:03] <cdanis>	 (from a status page perspective)
[17:00:05] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700).
[17:00:05] <jouncebot>	 Krinkle: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:12] <akosiaris>	 cdanis: it's depooled because of the incident though
[17:00:15] <bblack>	 it was briefly user-impacting though, ~5-10 minutes or so
[17:00:23] <akosiaris>	 yeah, that ^
[17:00:26] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:00:34] <cdanis>	 eqsin going down unexpectedly is an incident
[17:00:49] <XioNoX>	 bblack: maybe with this new cable we could have australia and NZ sent to ulsfo - https://www.submarinecablemap.com/submarine-cable/southern-cross-next
[17:00:54] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[17:01:00] <jynus>	 I noted it on the list of incidents as "2023-01-10 eqsin network outage"
[17:01:04] <cdanis>	 but once we depool eqsin, that's resolved imo
[17:01:10] <wikibugs>	 (03PS2) 10Ottomata: flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576)
[17:01:13] <jynus>	 but feel free to update the title if it is not great
[17:01:43] <XioNoX>	 +1 with cdanis, otherwise it would mean we had to create an incident each time we depool a site for maintenance
[17:01:48] <XioNoX>	 (to be consistent)
[17:01:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for StephaneRebai - https://phabricator.wikimedia.org/T326655 (10MarkTraceur) Approve as manager!
[17:02:02] <bblack>	 XioNoX: maybe, can re-measure and see
[17:02:10] <wikibugs>	 (03PS3) 10Ottomata: flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576)
[17:02:15] <jynus>	 what I think is at least tracking incidents is nice that page and are not false positives
[17:02:32] <jynus>	 on the sheet (or otherwise, when there is a better place)
[17:02:38] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 531 days) https://wikitech.wikimedia.org/wiki/Logs
[17:02:44] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:02:45] <cdanis>	 jynus: i'm mostly talking from a public status page perspective
[17:03:04] <jynus>	 sure
[17:03:22] <jinxer-wm>	 (ProbeDown) firing: (6) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:03:28] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[17:03:35] <logmsgbot>	 !log ayounsi@deploy1002 deploy aborted: netbox-next to 3.2.9 (duration: 00m 07s)
[17:03:41] <jynus>	 that also works for tracking, which is what I am interested in
[17:03:49] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[17:04:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - networkpolicies for k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/878134 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[17:05:21] <denisse>	 When I access the 'jinxer-wm' link it only shows a coffee mug, does that mean that the incident is resolved?
[17:07:27] <herron>	 fwiw sites like cloudflare will log a status of "re-routed" which we could consider borrowing
[17:08:08] <cdanis>	 cloudflare's role and user base is quite different from ours IMO :)
[17:09:06] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:41] <jynus>	 denisse: please take no offense, but I think IRC messages and alertmanager/paging has degraded but not being useful in some contexts
[17:09:49] <jynus>	 by*
[17:09:54] <cdanis>	 denisse: did jinxer-wm (aka alertmanager) page for the eqsin outage?  I don't think it did, I think it was only icinga
[17:09:56] <jynus>	 that is normal, it is "new"
[17:10:18] <jynus>	 but I hope we can do better having more meaningful alert test and links
[17:10:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Update staging-codfw to kubernetes 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/877990 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[17:10:32] <jinxer-wm>	 (ProbeDown) resolved: (6) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:10:46] <mutante>	 denisse: I sent the ACK via SMS just now to get it out of "active" state
[17:10:47] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:11:00] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:11:02] <cdanis>	 mutante: oh, did it not auto-resolve?
[17:11:40] <mutante>	 cdanis: the "packet loss 100% to cr2-eqsin" did not
[17:11:46] <jynus>	 mutante: yeah, but the links should be to something that doesn't disappear when clicked after (e.g. "state is now green/acked")
[17:11:47] <cdanis>	 🙃
[17:12:05] <jynus>	 again, this is nitpicking, please don't take my complains to seriously
[17:12:27] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Add missing parentheses to vector search match text (031 comment) [extensions/Wikibase] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/877972 (owner: 10Lucas Werkmeister (WMDE))
[17:12:55] <jynus>	 I just generated a doc with ~200 complains from several people so I am lately very nitpicky
[17:13:01] <denisse>	 jynus: on the contrary, it helps to see where we can improve. :)
[17:13:02] <mutante>	 I agree with jynus that the IRC alerting isn't working as it used to anymore
[17:13:39] <jynus>	 I mean, let's be real- it never was great, but new tooling is forcing us to work harder :-D
[17:13:48] <jynus>	 it just needs time
[17:14:21] <jynus>	 I think in some cases it is the aggregation- which worked nicely to reduce spam
[17:14:40] <jynus>	 but in some aspects loosed specificity
[17:14:45] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:14:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42953 and previous config saved to /var/cache/conftool/dbconfig/20230110-171457-ladsgroup.json
[17:15:07] <jynus>	 it's a lose-lose situation, nothing will be perfect :-)
[17:26:57] <urandom>	 jynus: https://en.wikipedia.org/wiki/Kobayashi_Maru
[17:27:39] <jynus>	 nah, we can actually do better, it is just finding the time to improve stuff
[17:28:16] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: help
[17:28:17] <logmsgbot>	 !log ayounsi@deploy1002 deploy aborted: help (duration: 00m 01s)
[17:28:18] <jynus>	 Although now that I have you here, let me ask you something you may be able to help with making alerting better (I pm you)
[17:28:53] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[17:29:04] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 00m 11s)
[17:29:05] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Jclark-ctr) @marostegui  sorry yes I will need it depooled I did have to run out of data center  how long can it be depooled?  If you depooled it I can swap it today and bring it up tomorrow?
[17:29:51] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) @Jclark-ctr yeah, it can be depooled for 24h without any problem. I will get it ready now for you.
[17:30:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42954 and previous config saved to /var/cache/conftool/dbconfig/20230110-173002-ladsgroup.json
[17:30:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 T325652', diff saved to https://phabricator.wikimedia.org/P42955 and previous config saved to /var/cache/conftool/dbconfig/20230110-173027-marostegui.json
[17:30:30] <stashbot>	 T325652: Inbound interface errors - https://phabricator.wikimedia.org/T325652
[17:31:35] <wikibugs>	 (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878144
[17:32:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878144 (owner: 10Marostegui)
[17:32:53] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) @Jclark-ctr the host is ready for you to work on it anytime. I have left it ON, but the service is stopped, so if you need to power it off, you can do it anytime. Thanks!
[17:36:16] <icinga-wm>	 RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[17:37:32] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1130 maint', diff saved to https://phabricator.wikimedia.org/P42956 and previous config saved to /var/cache/conftool/dbconfig/20230110-173807-ladsgroup.json
[17:39:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:39:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10wiki_willy) Approved from my end.  Thanks!
[17:39:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:42:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10RobH)
[17:42:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10RobH)
[17:44:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Ah, thank you @Jelto, that's what I needed to know. That username allowed me to login.   @Ottomata can I have yo...
[17:48:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1229 - https://phabricator.wikimedia.org/T326661 (10Marostegui)
[17:48:21] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[17:48:36] <claime>	 !log Finished rolling reboots of eqiad appservers 
[17:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:45] <zabe>	 !log run populateCulActor on all wikis # T325484
[17:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:49] <stashbot>	 T325484: Run PopulateCulActor on all wikis - https://phabricator.wikimedia.org/T325484
[17:55:21] <wikibugs>	 (03PS2) 10Jbond: puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773
[17:55:26] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.ganeti.reimage for host kubestagemaster2001.codfw.wmnet with OS bullseye
[17:57:32] <wikibugs>	 (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[17:59:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) Thank you.
[17:59:37] <wikibugs>	 (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:59:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: allow to specify the exact disabled message [software/spicerack] - 10https://gerrit.wikimedia.org/r/869773 (owner: 10Jbond)
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1800)
[18:00:53] <wikibugs>	 (03PS1) 10BCornwall: varnish: Alert on high thread count [alerts] - 10https://gerrit.wikimedia.org/r/878166 (https://phabricator.wikimedia.org/T323723)
[18:01:38] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2001.codfw.wmnet with OS bullseye
[18:01:41] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bullseye
[18:01:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Ottomata) Done, you should have an email at khurd@wikimedia.org with instructions.
[18:06:28] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2001.codfw.wmnet with reason: host reimage
[18:06:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:07:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:07:02] <jayme>	 that's me
[18:07:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:07:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[18:08:24] <wikibugs>	 (03PS1) 10Dzahn: netbox: add scap::target to allowing scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167
[18:09:21] <mutante>	 jayme: thanks, ACK
[18:09:36] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2001.codfw.wmnet with reason: host reimage
[18:10:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: add scap::target to allowing scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn)
[18:11:11] <wikibugs>	 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) @Slaporte I got the technical ok to deploy the new version check. Here is...
[18:11:26] <wikibugs>	 (03CR) 10Dzahn: "Duplicate declaration: File[/var/lib/scap] is already declared  ... duuuh :/" [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn)
[18:12:16] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] netbox: add scap::target to allowing scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn)
[18:12:34] <wikibugs>	 (03PS2) 10Dzahn: netbox: add scap::target to allow scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167
[18:12:46] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:13:47] <wikibugs>	 (03Abandoned) 10Dzahn: netbox: add scap::target to allow scap self-deployment [puppet] - 10https://gerrit.wikimedia.org/r/878167 (owner: 10Dzahn)
[18:15:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10RobH)
[18:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:15:27] <Krinkle>	 mutante: I missed the puppet window I think? Or maybe it's not every week?
[18:16:13] <mutante>	 Krinkle: hmm. there is one on the deployment calendar but I rarely ever see those being used. what do you have?
[18:16:30] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage
[18:16:32] <mutante>	 I see.. eh.. looking
[18:16:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage
[18:17:07] <zabe>	 jouncebot: refresh
[18:17:07] <jouncebot>	 I refreshed my knowledge about deployments.
[18:17:34] <mutante>	 I am not sure if the bot failed to ping..but I am looking at the first patch
[18:18:40] <zabe>	 <jouncebot> jbo.nd and r.zl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700).
[18:18:47] <mutante>	 hmm yea.. so.. it does not actually have reviews
[18:19:32] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2001.codfw.wmnet with reason: host reimage
[18:19:43] <mutante>	 while I am willing to do that.. puppet window would be for stuff that is already +1
[18:19:57] <mutante>	 or just the regular gerrit process without having to be in any window
[18:20:56] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc
[18:21:08] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on group2 wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878168 (https://phabricator.wikimedia.org/T314714)
[18:21:10] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878169 (https://phabricator.wikimedia.org/T314714)
[18:21:35] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage
[18:22:44] <wikibugs>	 (03PS4) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944)
[18:23:21] <Krinkle>	 mutante: ah, I see, I didn' realize. Okay, thanks!
[18:23:34] <mutante>	 Krinkle: I looked at the 3 patches but I am not comfortable merging those. they don't have +1 and they touch mediawiki core lib, site-wide apache config and security. sorry
[18:23:39] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc
[18:23:44] <mutante>	 though the doc one I might be talked into ..
[18:23:56] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host kubestagemaster2001.codfw.wmnet with OS bullseye
[18:24:09] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal
[18:24:20] <jayme>	 me again, sorry
[18:24:30] <mutante>	 Krinkle: I would do the "relax CSP rules for taint demo" if you need it though?
[18:25:32] <jayme>	 I'm reimageing the staging-codfw k8s cluster - and I should probably have said that before, sorry d.enisse|m.utante
[18:27:53] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.69:30443]) https://wikitech.wikimedia.org/wiki/PyBal
[18:28:05] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_k8s-ingress-staging.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:28:43] <Krinkle>	 mutante: yeah, that'd be nice to close out the Phan work
[18:29:09] <mutante>	 well, you left a pretty detailed explanation, so yea
[18:29:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)
[18:29:25] <mutante>	 also since that is doc hosts and not global
[18:29:27] <mutante>	 doing it
[18:29:36] <wikibugs>	 (03PS1) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[18:29:38] <wikibugs>	 (03CR) 10MVernon: "CI run with this and the associated puppet change -" [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[18:29:47] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes-staging,service=kubesvc
[18:29:58] <wikibugs>	 (03CR) 10MVernon: "CI run of this and the labs/private change - https://puppet-compiler.wmflabs.org/output/868721/39048/" [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[18:30:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[18:32:13] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Jclark-ctr) @marostegui  sfp-t has been replaced  Let me know if you still see errors
[18:33:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-staging.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:34:14] <mutante>	 Krinkle: it has been deployed on doc1002 and doc2001. puppet did refresh (but not hard restart) apache
[18:34:33] <mutante>	 go ahead and test if you want 
[18:34:34] <wikibugs>	 (03PS1) 10Jbond: dhcp: disable no-member check [software/spicerack] - 10https://gerrit.wikimedia.org/r/878172
[18:34:37] <Krinkle>	 thx
[18:34:38] <Krinkle>	 checking
[18:35:49] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2001.codfw.wmnet with OS bullseye
[18:35:53] <Krinkle>	 mutante: confirmed, the new header is coming through
[18:36:12] <mutante>	 Krinkle: cool, good
[18:36:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) Unfortunately, merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/863406/ has caused logspam every ten minutes in /var/log/messages.   ` 03:27 <vgutierrez> brett: BTW.....
[18:38:05] <jinxer-wm>	 (ConfdResourceFailed) resolved: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-staging.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:38:10] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bullseye
[18:38:55] <wikibugs>	 (03PS2) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[18:39:03] <wikibugs>	 (03PS3) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[18:39:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[18:43:56] <wikibugs>	 (03PS4) 10Papaul: Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[18:44:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Jennier Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[18:44:38] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2052.codfw.wmnet
[18:49:51] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2052.codfw.wmnet
[18:56:19] <wikibugs>	 (03PS5) 10Dzahn: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[18:58:05] <wikibugs>	 (03PS2) 10JMeybohm: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943)
[18:58:07] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878176
[18:59:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Dzahn) @Papaul please still add the key here on the ticket
[18:59:37] <wikibugs>	 (03CR) 10Dzahn: "fixed date format which made CI downvote, confirmed UID in LDAP, has approval, can't check SSH key though, but otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:00:04] <jouncebot>	 jeena and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1900).
[19:00:11] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] admin: Add Jennifer Hancock to the datacenter-ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:01:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[19:02:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[19:02:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42958 and previous config saved to /var/cache/conftool/dbconfig/20230110-190235-ladsgroup.json
[19:02:42] <jeena>	 train is delayed on some patches and will resume once they have applied sucessfully
[19:02:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[19:02:52] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39049/console" [puppet] - 10https://gerrit.wikimedia.org/r/877263 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[19:03:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878176 (owner: 10JMeybohm)
[19:07:16] <wikibugs>	 (03CR) 10JMeybohm: sre.ganeti.reimage: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:08:19] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2053.codfw.wmnet
[19:08:41] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/878176 (owner: 10JMeybohm)
[19:09:23] <wikibugs>	 (03PS3) 10JMeybohm: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943)
[19:10:54] <wikibugs>	 (03PS1) 10Effie Mouzeli: site: Remove retired mc* hosts [puppet] - 10https://gerrit.wikimedia.org/r/878177 (https://phabricator.wikimedia.org/T313733)
[19:12:53] <wikibugs>	 (03PS6) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[19:15:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2053.codfw.wmnet
[19:16:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks for fixing my oversight and reviews" [puppet] - 10https://gerrit.wikimedia.org/r/877274 (owner: 10Dzahn)
[19:17:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42962 and previous config saved to /var/cache/conftool/dbconfig/20230110-191740-ladsgroup.json
[19:19:03] <wikibugs>	 (03CR) 10Dzahn: ""The Affiliations Committee will be on leave from December 21st to January 6th and will reply once we return. Please send an email to this" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn)
[19:19:32] <wikibugs>	 (03PS7) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[19:19:49] <wikibugs>	 (03PS8) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[19:20:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:21:05] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) Thanks John! I will wait for @ayounsi to confirm before repooling this host.
[19:21:57] <wikibugs>	 (03PS1) 10Ottomata: flink - include examples in image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/878178 (https://phabricator.wikimedia.org/T316519)
[19:22:45] <wikibugs>	 (03PS9) 10Papaul: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649)
[19:23:41] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[19:24:01] <wikibugs>	 (03CR) 10Dzahn: "so.. thanks Majavah. that was correct, it should be uid.  that being said, Jennifer has 2 users in LDAP, both with the same email address." [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:29:05] <wikibugs>	 (03Merged) 10jenkins-bot: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[19:29:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1158 maint', diff saved to https://phabricator.wikimedia.org/P42963 and previous config saved to /var/cache/conftool/dbconfig/20230110-192929-ladsgroup.json
[19:30:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:31:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:31:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:31:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[19:31:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[19:31:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:31:54] <wikibugs>	 (03PS1) 10BCornwall: varnish: Revert export of Prometheus params [puppet] - 10https://gerrit.wikimedia.org/r/878180 (https://phabricator.wikimedia.org/T323723)
[19:31:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:32:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:32:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42964 and previous config saved to /var/cache/conftool/dbconfig/20230110-193245-ladsgroup.json
[19:32:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:32:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42965 and previous config saved to /var/cache/conftool/dbconfig/20230110-193253-ladsgroup.json
[19:35:13] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:37:50] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.32.0" for 1 hosts
[19:37:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:38:01] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.32.0" completed for 1 hosts
[19:38:11] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:38:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:39:01] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Adjust new eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/878182 (https://phabricator.wikimedia.org/T326661)
[19:39:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:42:11] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:42:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Adjust new eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/878182 (https://phabricator.wikimedia.org/T326661) (owner: 10Marostegui)
[19:43:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:44:11] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10ayounsi) https://librenms.wikimedia.org/graphs/to=1673379600/id=15307/type=port_errors/from=1673293200/ looks good
[19:45:05] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) Cool, repooling then. @ayounsi do you want me to close this ticket once I am done?
[19:45:12] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39050/console" [puppet] - 10https://gerrit.wikimedia.org/r/878180 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[19:45:18] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/878149
[19:47:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/878149 (owner: 10Marostegui)
[19:47:07] <wikibugs>	 (03PS1) 10JMeybohm: staging-codfw: Update coredns to 1.8.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878184 (https://phabricator.wikimedia.org/T326340)
[19:47:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42968 and previous config saved to /var/cache/conftool/dbconfig/20230110-194750-ladsgroup.json
[19:47:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42969 and previous config saved to /var/cache/conftool/dbconfig/20230110-194756-root.json
[19:47:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42970 and previous config saved to /var/cache/conftool/dbconfig/20230110-194757-ladsgroup.json
[19:49:19] <wikibugs>	 (03PS1) 10Eevans: cassandra_dev: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/878186
[19:49:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[19:49:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[19:51:31] <wikibugs>	 (03PS10) 10Dzahn: admin: Add Jennifer Hancock to the datacenter-ops group [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:51:37] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[19:52:22] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki)
[19:52:43] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 01m 06s)
[19:54:04] <wikibugs>	 (03PS1) 10Zabe: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004)
[19:54:10] <wikibugs>	 (03CR) 10Dzahn: admin: Add Jennifer Hancock to the datacenter-ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:54:30] <wikibugs>	 (03CR) 10Dzahn: "I think it's ok now." [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul)
[19:55:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[19:55:31] <wikibugs>	 (03PS2) 10Zabe: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004)
[19:57:49] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) @Jclark-ctr please note that mc2020 and mc2021 are probably still bootable due to a failure during running the decomm script
[19:58:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] staging-codfw: Update coredns to 1.8.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878184 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[19:58:20] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) a:05jijiki→03Jclark-ctr
[19:58:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2054.codfw.wmnet
[20:00:23] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet
[20:00:58] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9
[20:01:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:01:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:02:40] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@ef7451d]: netbox-next to 3.2.9 (duration: 01m 42s)
[20:03:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42971 and previous config saved to /var/cache/conftool/dbconfig/20230110-200301-root.json
[20:03:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42972 and previous config saved to /var/cache/conftool/dbconfig/20230110-200302-ladsgroup.json
[20:03:20] <wikibugs>	 (03Merged) 10jenkins-bot: staging-codfw: Update coredns to 1.8.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878184 (https://phabricator.wikimedia.org/T326340) (owner: 10JMeybohm)
[20:04:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:04:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:05:41] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2054.codfw.wmnet
[20:06:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:07:04] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:07:16] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet
[20:08:04] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:08:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:08:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[20:08:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[20:16:10] <wikibugs>	 (03PS1) 10JMeybohm: Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340)
[20:17:51] <wikibugs>	 (03PS2) 10JMeybohm: Add istio config for main/wikikube clusters on k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/878190 (https://phabricator.wikimedia.org/T326340)
[20:18:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42974 and previous config saved to /var/cache/conftool/dbconfig/20230110-201806-root.json
[20:18:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42975 and previous config saved to /var/cache/conftool/dbconfig/20230110-201807-ladsgroup.json
[20:18:33] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247
[20:18:36] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[20:26:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:26:34] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:28:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:28:27] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:28:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:29:11] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:31:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:31:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:31:38] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:32:33] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[20:33:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42976 and previous config saved to /var/cache/conftool/dbconfig/20230110-203311-root.json
[20:33:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:36:53] <wikibugs>	 (03PS1) 10Dzahn: Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/878150
[20:37:00] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[20:37:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[20:48:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42977 and previous config saved to /var/cache/conftool/dbconfig/20230110-204816-root.json
[20:50:55] <wikibugs>	 10ops-eqiad, 10DBA: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10Marostegui) 05Open→03Resolved The host is repooled. Closing. Thanks everyone!
[20:51:36] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki)
[20:51:48] <wikibugs>	 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki)
[20:51:56] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki)
[20:52:38] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) 05Open→03Resolved a:03jijiki Bluntly closing this as we are moving to  mediawiki to kubernetes
[20:52:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "we confirmed the maintenance has been declared over" [dns] - 10https://gerrit.wikimedia.org/r/878150 (owner: 10Dzahn)
[20:53:19] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) 05Open→03Resolved
[20:54:53] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/878150 (owner: 10Dzahn)
[20:55:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/878150 (owner: 10Dzahn)
[20:55:44] <mutante>	 !log repooling eqsin
[20:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T2100).
[21:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:29] <MatmaRex>	 hi
[21:01:46] <jeena>	 MatmaRex: wmf.18 hasn't been deployed yet but that shouldn't affect you, right?
[21:02:52] <MatmaRex>	 it shouldn't
[21:03:04] <jeena>	 👍
[21:03:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42978 and previous config saved to /var/cache/conftool/dbconfig/20230110-210321-root.json
[21:05:24] <MatmaRex>	 is anyone available to do the deployment for me? :)
[21:05:27] <zabe>	 I can deploy if no one else is around
[21:05:34] <jeena>	 i can also
[21:06:03] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Use new DiscussionTools heading markup on group2 wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878168 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński)
[21:06:16] <mutante>	 zabe: first deploy?:) congrats!
[21:06:19] <wikibugs>	 (03PS3) 10Zabe: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004)
[21:06:23] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:06:49] <wikibugs>	 (03Merged) 10jenkins-bot: Use new DiscussionTools heading markup on group2 wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878168 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński)
[21:07:03] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from cul_actor on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878187 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:07:28] <zabe>	 actually no, but still thanks :)
[21:08:03] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:878168|Use new DiscussionTools heading markup on group2 wikis except enwiki (T314714)]], [[gerrit:878187|Start reading from cul_actor on group1 wikis (T233004)]]
[21:08:08] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:08:08] <stashbot>	 T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714
[21:09:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Everyone, you all are awesome. Thank you for all the help and assistance. I will close this ticket!
[21:09:44] <wikibugs>	 (03PS3) 10Dzahn: scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277
[21:09:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) 05In progress→03Resolved
[21:09:51] <logmsgbot>	 !log zabe@deploy1002 zabe and zabe and matmarex: Backport for [[gerrit:878168|Use new DiscussionTools heading markup on group2 wikis except enwiki (T314714)]], [[gerrit:878187|Start reading from cul_actor on group1 wikis (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:10:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn)
[21:10:50] <wikibugs>	 (03CR) 10Dzahn: "I am doing it this way with assert_type() because you said on another change you think the UID should not be a class parameter.. but I sti" [puppet] - 10https://gerrit.wikimedia.org/r/877277 (owner: 10Dzahn)
[21:11:06] <zabe>	 MatmaRex, can you test?
[21:11:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet
[21:11:22] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2055.codfw.wmnet
[21:11:42] <MatmaRex>	 zabe: yeah i was just looking. everything is working correctly
[21:11:55] <zabe>	 nice, syncing
[21:12:27] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:12:51] <wikibugs>	 (03CR) 10Dzahn: "same here, I am using assert_type to have it both ways, validate data but also not make the UID a class parameter.. because you said so el" [puppet] - 10https://gerrit.wikimedia.org/r/877276 (owner: 10Dzahn)
[21:14:08] <wikibugs>	 (03PS4) 10Dzahn: scap: assert data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877277
[21:14:57] <wikibugs>	 (03PS2) 10Dzahn: phabricator: use specific data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/877275
[21:17:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:17:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/877277/39051/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/877275 (owner: 10Dzahn)
[21:17:25] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2055.codfw.wmnet
[21:17:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "cc: Hashar we now have a data type to validate those" [puppet] - 10https://gerrit.wikimedia.org/r/877275 (owner: 10Dzahn)
[21:17:59] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet
[21:18:12] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878168|Use new DiscussionTools heading markup on group2 wikis except enwiki (T314714)]], [[gerrit:878187|Start reading from cul_actor on group1 wikis (T233004)]] (duration: 10m 08s)
[21:18:16] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:18:16] <stashbot>	 T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714
[21:18:26] <zabe>	 MatmaRex, should be live
[21:18:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P42979 and previous config saved to /var/cache/conftool/dbconfig/20230110-211826-root.json
[21:18:35] <MatmaRex>	 thanks zabe
[21:19:51] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet
[21:20:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet
[21:20:07] <zabe>	 yw
[21:21:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:21:53] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:22:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:27:15] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet
[21:27:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet
[21:28:09] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:28:21] <wikibugs>	 (03PS4) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749)
[21:28:52] <jeena>	 If there are no more backports I would like to deploy the train now
[21:29:57] <wikibugs>	 (03PS5) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749)
[21:32:10] <wikibugs>	 (03PS6) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749)
[21:33:20] <wikibugs>	 (03CR) 10Herron: slo_dashboards: dynamic slo dashboard panels (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[21:34:30] <zabe>	 jeena, I'm done with deploying, so I think you can go ahead
[21:34:44] <jeena>	 Thanks zabe 
[21:35:48] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878199 (https://phabricator.wikimedia.org/T325581)
[21:35:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878199 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[21:36:30] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878199 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[21:36:51] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.18  refs T325581
[21:36:55] <stashbot>	 T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581
[21:52:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[21:52:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10nskaggs) 05In progress→03Resolved As https://wikitech.wikimed...
[21:54:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[21:54:42] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet
[21:54:48] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet
[21:56:18] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink - include examples in image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/878178 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[21:56:24] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] flink - include examples in image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/878178 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[22:00:02] <wikibugs>	 (03PS1) 10Marostegui: db1206: No longer testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/878202 (https://phabricator.wikimedia.org/T326669)
[22:00:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: No longer testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/878202 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui)
[22:01:18] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet
[22:01:34] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet
[22:02:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10RobH)
[22:02:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10RobH)
[22:04:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:05:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10RobH)
[22:08:30] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203
[22:09:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206 T325046', diff saved to https://phabricator.wikimedia.org/P42980 and previous config saved to /var/cache/conftool/dbconfig/20230110-220942-marostegui.json
[22:09:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[22:09:46] <stashbot>	 T325046: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046
[22:09:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:10:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[22:10:10] <wikibugs>	 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) 05Stalled→03Open a:05Marostegui→03Jclark-ctr @Jclark-ctr we want to test that the RAID monitoring works fine. Can you pull out a hard disk...
[22:10:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (owner: 10Dzahn)
[22:10:52] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311)
[22:11:16] <wikibugs>	 (03CR) 10Dzahn: "let's add some tests first -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/878203" [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn)
[22:12:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn)
[22:12:46] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:13:20] <wikibugs>	 (03PS3) 10Dzahn: httpbb: add tests for phabricator, git, bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311)
[22:15:15] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:16:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:17:34] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[22:18:02] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add SPDX license headers for some test files [puppet] - 10https://gerrit.wikimedia.org/r/878205
[22:18:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:18:46] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[22:21:55] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.18  refs T325581 (duration: 45m 04s)
[22:21:59] <stashbot>	 T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581
[22:22:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH)
[22:23:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH)
[22:24:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) @bblack,  The ordering task had the racking details populated by @kofori but I suspect there is a mistake in them.  This order and racking is to replace dns100[12] and authdns1001...
[22:28:39] <wikibugs>	 (03PS1) 10Zabe: Start writing to rev_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954)
[22:28:40] <logmsgbot>	 !log jhuneidi@deploy1002 Pruned MediaWiki: 1.40.0-wmf.14, 1.40.0-wmf.13 (duration: 02m 35s)
[22:29:57] <wikibugs>	 (03CR) 10Dzahn: "people in CC, I don't expect you to actually review the assertions, I have tested those, but I wanted to share this is a thing and that we" [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn)
[22:30:34] <wikibugs>	 (03PS2) 10Zabe: Start writing to rev_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954)
[22:34:09] <jeena>	 deploying to group0 now
[22:34:33] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878208 (https://phabricator.wikimedia.org/T325581)
[22:34:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878208 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[22:35:17] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878208 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot)
[22:37:04] <wikibugs>	 (03CR) 10Dzahn: "how this is used:" [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn)
[22:38:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/878203/39055/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/878203 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn)
[22:38:52] <icinga-wm>	 PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:31] <mutante>	 win 11
[22:40:00] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:42:41] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Traffic: Q3:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH)
[22:42:49] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247
[22:42:51] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH)
[22:42:52] <stashbot>	 T324247: [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms  - https://phabricator.wikimedia.org/T324247
[22:42:54] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.18  refs T325581
[22:42:57] <stashbot>	 T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581
[22:43:35] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH)
[22:44:56] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:03BBlack @bblack,  The racking details provided on ordering task T325230 list hostnames dns200[345] for this, but they are replacing dns200[12] and authdns2001.  Should these instead b...
[22:45:25] <wikibugs>	 (03PS1) 10Ottomata: Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576)
[22:47:29] <wikibugs>	 (03CR) 10Ottomata: "Probably have a bunch of things wrong here; I've never written a new helmfile service for the dse-k8s-cluster." [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[22:48:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add flink-app-example service [deployment-charts] - 10https://gerrit.wikimedia.org/r/878210 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[22:52:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:57:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:04:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns100[345] - https://phabricator.wikimedia.org/T326685 (10RobH) a:05BBlack→03Jclark-ctr >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >...
[23:04:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic: Q4:rack/setup/install dns200[345] - https://phabricator.wikimedia.org/T326688 (10RobH) a:05BBlack→03Papaul >>! In T325231#8514793, @KOfori wrote: >>>! In T325231#8514647, @RobH wrote: >>>>! In T325231#8514232, @KOfori wrote: >>> @RobH looks good. Approved. >>  >...
[23:17:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:18:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH)
[23:18:34] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH)
[23:19:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH)
[23:19:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10RobH)
[23:19:58] <zabe>	 jouncebot, nowandnext
[23:19:59] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 40 minute(s)
[23:19:59] <jouncebot>	 In 7 hour(s) and 40 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230111T0700)
[23:20:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[23:20:48] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to rev_comment_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/878207 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe)
[23:21:16] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:878207|Start writing to rev_comment_id on test wikis (T299954)]]
[23:21:20] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[23:22:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:22:54] <logmsgbot>	 !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:878207|Start writing to rev_comment_id on test wikis (T299954)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[23:24:50] <icinga-wm>	 RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:27:09] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Dzahn) We can confirm we served a lot of 5xx's in a time span from about 21:00 to 21:05 UTC yesterday.  The reason was an overloaded data...
[23:30:56] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:878207|Start writing to rev_comment_id on test wikis (T299954)]] (duration: 09m 39s)
[23:30:59] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[23:33:02] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:37:36] <wikibugs>	 10SRE, 10Discovery-Search, 10Elasticsearch, 10Wikidata, and 2 others: wbsgetsuggestions not returning anything on Wikidata - https://phabricator.wikimedia.org/T326590 (10Dzahn) 05Open→03Resolved a:03Dzahn The actual incident is over, it was mitigated within minutes.   Regarding the report it's still...
[23:37:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:39:37] <mutante>	 I did touch the httpbb tests but not those for appservers.. making sure that is not me ^
[23:46:33] <mutante>	 !log cumin2002 - sudo systemctl status httpbb_hourly_appserver
[23:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:57] <mutante>	 yea, that was unrelated 
[23:47:16] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:20] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:57:10] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PS3 → PS4 omits the thread_pools parameter based on discussion on IRC, but I'm thinking that it's valuable to keep it around just in case " [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)
[23:58:30] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [integration/docroot@b7c82a3]: (no justification provided)
[23:58:45] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [integration/docroot@b7c82a3]: (no justification provided) (duration: 00m 15s)