[00:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:28:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:19:27] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (03PS2) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [01:47:17] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:04] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T0200) [02:00:59] (03PS3) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:05:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:06:55] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:07:03] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.4 [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/837765 (https://phabricator.wikimedia.org/T314193) [02:07:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.4 [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/837765 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:09:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:09:32] (03PS4) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:12:47] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:13:28] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:13:35] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:20:53] (03PS5) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:22:31] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.4 [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/837765 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [02:24:14] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:28:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:33] (03PS6) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:30:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:31:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:33:59] (03CR) 10CI reject: [V: 04-1] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:38:23] (03PS7) 10Andrew Bogott: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 [02:57:13] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T0300) [03:07:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:07:09] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:09:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:12:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:14:47] RECOVERY - SSH on mw1334.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:58:29] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:05:11] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:28:14] (03CR) 10Marostegui: [C: 03+2] admin: Revoke my ssh key temporarily [puppet] - 10https://gerrit.wikimedia.org/r/837079 (owner: 10Ladsgroup) [05:34:46] (03PS1) 10Marostegui: Revert "Revert "Revert "db1189: Disable notifications""" [puppet] - 10https://gerrit.wikimedia.org/r/837722 [05:35:11] (03CR) 10Marostegui: "I am repooling this host after the data check and the DIMM replacement" [puppet] - 10https://gerrit.wikimedia.org/r/837722 (owner: 10Marostegui) [05:35:55] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "Revert "db1189: Disable notifications""" [puppet] - 10https://gerrit.wikimedia.org/r/837722 (owner: 10Marostegui) [05:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35322 and previous config saved to /var/cache/conftool/dbconfig/20221004-053623-root.json [05:38:09] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:49:23] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:51:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 3%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35323 and previous config saved to /var/cache/conftool/dbconfig/20221004-055128-root.json [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T0600). [06:06:29] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:06:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35324 and previous config saved to /var/cache/conftool/dbconfig/20221004-060633-root.json [06:21:30] (03PS3) 10Giuseppe Lavagetto: mediawiki::canary: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/835506 (https://phabricator.wikimedia.org/T318894) [06:21:32] (03PS2) 10Giuseppe Lavagetto: mediawiki::canary: cleanup php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/837681 (https://phabricator.wikimedia.org/T318894) [06:21:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35325 and previous config saved to /var/cache/conftool/dbconfig/20221004-062138-root.json [06:23:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::canary: remove php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/835506 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [06:31:40] (03PS2) 10Matthias Mullie: [beta] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837731 (https://phabricator.wikimedia.org/T306883) [06:32:31] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 25885 [06:33:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 25885 [06:33:45] (03CR) 10Matthias Mullie: [C: 03+2] [beta] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837731 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [06:34:36] (03Merged) 10jenkins-bot: [beta] Show thumbnails on Special:Search for NS_FILE + PageImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837731 (https://phabricator.wikimedia.org/T306883) (owner: 10Matthias Mullie) [06:34:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::canary: cleanup php 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/837681 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [06:36:19] (03PS1) 10Giuseppe Lavagetto: mediawiki::canary: fix hiera label [puppet] - 10https://gerrit.wikimedia.org/r/838059 [06:36:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::canary: fix hiera label [puppet] - 10https://gerrit.wikimedia.org/r/838059 (owner: 10Giuseppe Lavagetto) [06:36:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35326 and previous config saved to /var/cache/conftool/dbconfig/20221004-063643-root.json [06:39:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:42:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:42:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:43:05] PROBLEM - php7.2-fpm service on mw2374 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:43:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:51:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35327 and previous config saved to /var/cache/conftool/dbconfig/20221004-065148-root.json [07:00:04] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:49] <_joe_> jouncebot: next [07:01:49] In 5 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1300) [07:01:49] In 5 hour(s) and 58 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1300) [07:06:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35328 and previous config saved to /var/cache/conftool/dbconfig/20221004-070653-root.json [07:10:49] (03CR) 10Elukey: [C: 03+2] Move kafka-logging1001 to PKI settings for TLS [puppet] - 10https://gerrit.wikimedia.org/r/837621 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:11:09] (03PS1) 10Muehlenhoff: Remove access for jmads [puppet] - 10https://gerrit.wikimedia.org/r/838061 [07:11:11] (03PS1) 10Muehlenhoff: Remove LDAP access for maryyang and ozhang [puppet] - 10https://gerrit.wikimedia.org/r/838062 [07:11:29] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging1001.eqiad.wmnet with reason: Kafka PKI upgrade [07:11:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging1001.eqiad.wmnet with reason: Kafka PKI upgrade [07:14:50] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jmads [puppet] - 10https://gerrit.wikimedia.org/r/838061 (owner: 10Muehlenhoff) [07:15:51] (03PS2) 10Muehlenhoff: Remove LDAP access for maryyang and ozhang [puppet] - 10https://gerrit.wikimedia.org/r/838062 [07:16:27] (03PS3) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [07:16:29] (03PS2) 10Giuseppe Lavagetto: termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 [07:16:33] !log restart kafka on kafka-logging1001 to pick up its new PKI TLS cert [07:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:59] (03CR) 10CI reject: [V: 04-1] termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 (owner: 10Giuseppe Lavagetto) [07:18:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for maryyang and ozhang [puppet] - 10https://gerrit.wikimedia.org/r/838062 (owner: 10Muehlenhoff) [07:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P35329 and previous config saved to /var/cache/conftool/dbconfig/20221004-072158-root.json [07:22:16] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:10] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:27:58] (03PS1) 10Muehlenhoff: Remove access for nokafor [puppet] - 10https://gerrit.wikimedia.org/r/838065 [07:28:37] (03CR) 10CI reject: [V: 04-1] Remove access for nokafor [puppet] - 10https://gerrit.wikimedia.org/r/838065 (owner: 10Muehlenhoff) [07:31:45] (03PS2) 10Muehlenhoff: Remove access for nokafor [puppet] - 10https://gerrit.wikimedia.org/r/838065 [07:35:08] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nokafor [puppet] - 10https://gerrit.wikimedia.org/r/838065 (owner: 10Muehlenhoff) [07:36:17] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: sync [07:36:47] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync [07:37:22] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:40] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:40:38] (03PS1) 10Marostegui: Revert "db2178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/838089 [07:49:33] (03CR) 10Marostegui: [C: 03+2] Revert "db2178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/838089 (owner: 10Marostegui) [07:49:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35330 and previous config saved to /var/cache/conftool/dbconfig/20221004-074955-root.json [07:52:55] !log installing libdatetime-timezone-perl updates (catching up with latest timezone changes) [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:18] (03PS2) 10Muehlenhoff: Also apply labweb->cloudweb rename for the Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836795 [07:59:49] (03PS1) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) [08:00:26] (03PS2) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) [08:00:52] (03CR) 10Muehlenhoff: [C: 03+2] Also apply labweb->cloudweb rename for the Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/836795 (owner: 10Muehlenhoff) [08:01:37] (03PS3) 10Muehlenhoff: mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) [08:02:20] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37418/console" [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [08:02:48] (03CR) 10CI reject: [V: 04-1] p:wmcs::prometheus: overwrite instance with hostname if there [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [08:03:31] (03PS3) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) [08:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2181', diff saved to https://phabricator.wikimedia.org/P35331 and previous config saved to /var/cache/conftool/dbconfig/20221004-080338-root.json [08:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:04:19] (03PS1) 10Marostegui: db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/838070 [08:05:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35332 and previous config saved to /var/cache/conftool/dbconfig/20221004-080500-root.json [08:05:50] (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837093 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:07:51] (03PS2) 10Muehlenhoff: grub: Update includes [puppet] - 10https://gerrit.wikimedia.org/r/836855 [08:10:00] (03CR) 10Marostegui: [C: 03+2] db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/838070 (owner: 10Marostegui) [08:12:00] (03CR) 10Muehlenhoff: [C: 03+2] grub: Update includes [puppet] - 10https://gerrit.wikimedia.org/r/836855 (owner: 10Muehlenhoff) [08:14:31] (03PS3) 10Hashar: Release 3.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (https://phabricator.wikimedia.org/T310458) (owner: 10Clément Goubert) [08:14:44] (03CR) 10Hashar: "I have amended the commit message to point to T310458" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (https://phabricator.wikimedia.org/T310458) (owner: 10Clément Goubert) [08:16:44] PROBLEM - Disk space on moscovium is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=80%): /tmp 179 MB (2% inode=80%): /var/tmp 179 MB (2% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=moscovium&var-datasource=eqiad+prometheus/ops [08:16:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2181.codfw.wmnet with reason: Upgrading [08:17:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: Upgrading [08:17:50] (03Abandoned) 10JMeybohm: Disable zipkin and tracing for wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/837117 (https://phabricator.wikimedia.org/T318814) (owner: 10JMeybohm) [08:18:00] (03PS1) 10Marostegui: db2181: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/838074 (https://phabricator.wikimedia.org/T301879) [08:18:04] (03CR) 10JMeybohm: [C: 03+2] Enable additional envoy native metrics in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/837118 (owner: 10JMeybohm) [08:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35333 and previous config saved to /var/cache/conftool/dbconfig/20221004-082005-root.json [08:20:25] (03CR) 10Marostegui: [C: 03+2] db2181: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/838074 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [08:22:03] (03CR) 10JMeybohm: [C: 03+2] Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:22:14] (03Abandoned) 10Hashar: gerrit: remove unused mysql-connector-java lib [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [08:23:48] (03PS11) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) [08:24:23] (03CR) 10Slyngshede: "This should resolve all comments 🙏" [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [08:24:59] 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Export confd template status as Prometheus metrics - https://phabricator.wikimedia.org/T319272 (10fgiunchedi) [08:26:03] (03Merged) 10jenkins-bot: Update calico-crds to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826270 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:28:16] PROBLEM - Check systemd state on cp1090 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ats-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:30:18] 10SRE, 10Traffic: CDN doesn't validate request-target - https://phabricator.wikimedia.org/T318676 (10Vgutierrez) 05Open→03Resolved [08:31:24] (03PS1) 10Marostegui: mariadb: Remove innodb-stats-sample-pages [puppet] - 10https://gerrit.wikimedia.org/r/838076 (https://phabricator.wikimedia.org/T318914) [08:31:47] RECOVERY - Check systemd state on cp1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:04] (03PS2) 10Marostegui: mariadb: Remove innodb-stats-sample-pages [puppet] - 10https://gerrit.wikimedia.org/r/838076 (https://phabricator.wikimedia.org/T318914) [08:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35334 and previous config saved to /var/cache/conftool/dbconfig/20221004-083511-root.json [08:35:40] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:35:44] (03CR) 10JMeybohm: [C: 03+2] Update calico to v3.23.3 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:36:18] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb-stats-sample-pages [puppet] - 10https://gerrit.wikimedia.org/r/838076 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui) [08:37:08] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:15] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update calico to v3.23.3 [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/826230 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:39:25] (03Merged) 10jenkins-bot: Update calico to v3.23.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:40:37] (03PS5) 10Clément Goubert: parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) [08:42:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:42:22] (03CR) 10FNegri: [C: 03+1] "LGTM. One question to better understand how this works: where are the "hostname" and "instance" values set? I see the phab mentions "alert" [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [08:44:55] 10SRE, 10SRE-Access-Requests: Remove old production ssh key for RelEng user - https://phabricator.wikimedia.org/T319274 (10jnuche) [08:45:00] (03PS4) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [08:45:02] (03PS3) 10Giuseppe Lavagetto: termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 [08:45:57] FYI, merging removal of php7.2 on codfw parse servers. This will cause a brief unavailability on puppet run, but we're not sending any traffic to them, and I don't expect them to alert. [08:46:01] !oncall [08:46:09] <_joe_> !oncall-now [08:46:09] Oncall now for team SRE, rotation business_hours: [08:46:09] v.gutierrez, v.olans [08:46:18] <_joe_> but also, see topic :P [08:46:36] I know :') [08:47:09] (03CR) 10Clément Goubert: [C: 03+2] parsoid: Cleanup post php7.4 migration [puppet] - 10https://gerrit.wikimedia.org/r/837116 (https://phabricator.wikimedia.org/T318946) (owner: 10Clément Goubert) [08:48:41] (03CR) 10CI reject: [V: 04-1] termbox: use the new mesh functions [deployment-charts] - 10https://gerrit.wikimedia.org/r/837672 (owner: 10Giuseppe Lavagetto) [08:49:53] (03PS1) 10David Caro: Revert "cloudbackups: run nfs backups from labstore1004 rather than 1005" [puppet] - 10https://gerrit.wikimedia.org/r/838090 [08:50:04] marostegui: did you read cumin cumin in some line above by any chance? [08:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35335 and previous config saved to /var/cache/conftool/dbconfig/20221004-085015-root.json [08:50:30] elukey: I did while I was using spicerack [08:51:02] marostegui: right you are always following cookbooks to the letter [08:51:27] hahaha [08:51:30] <_joe_> just one more thing [08:51:57] (03PS13) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [08:52:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2181.codfw.wmnet with reason: Upgrading [08:52:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Upgrading [08:53:06] PROBLEM - php7.2-fpm service on parse2012 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:54:02] PROBLEM - php7.2-fpm service on parse2019 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:54:14] Heh, welp [08:54:21] I should have puppet run on incinga [08:54:23] Sorry about that [08:55:14] PROBLEM - php7.2-fpm service on parse2011 is CRITICAL: CRITICAL - Expecting active but unit php7.2-fpm is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:55:47] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 20 hosts with reason: php7.2 removal [08:56:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 20 hosts with reason: php7.2 removal [08:56:13] Downtiming for now, since they have no traffic, just to cut on spam [08:56:30] I'll go work out how to remove that alert from icinga. [08:56:41] _joe_: Was that your "one more thing" ? [08:57:02] <_joe_> claime: nothing, I was shitposting [09:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35336 and previous config saved to /var/cache/conftool/dbconfig/20221004-090520-root.json [09:07:15] (03PS1) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) [09:07:17] (03PS1) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) [09:07:49] (03CR) 10CI reject: [V: 04-1] confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) (owner: 10Filippo Giunchedi) [09:08:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2181.codfw.wmnet with reason: Upgrading [09:08:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Upgrading [09:09:34] (03PS1) 10JMeybohm: Add debian/README [debs/calico] - 10https://gerrit.wikimedia.org/r/838080 [09:12:09] Hey all! Any idea when the branch cut will take place today? I broke a thing, and I want to know if I have to revert. Would probably take a three or four hours to fix... [09:12:37] (03PS2) 10Filippo Giunchedi: confd: export template status as Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/838078 (https://phabricator.wikimedia.org/T319272) [09:12:39] (03PS2) 10Filippo Giunchedi: confd: install and run confd_prometheus_metrics [puppet] - 10https://gerrit.wikimedia.org/r/838079 (https://phabricator.wikimedia.org/T319272) [09:13:07] (03CR) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [09:14:12] (03PS2) 10JMeybohm: Enable additional envoy native metrics in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/837118 [09:15:55] duesen: already did.. it's in the deployment calendar [09:18:19] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add debian/README [debs/calico] - 10https://gerrit.wikimedia.org/r/838080 (owner: 10JMeybohm) [09:19:47] (03PS1) 10JMeybohm: Add debian/README [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/838082 [09:20:04] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add debian/README [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/838082 (owner: 10JMeybohm) [09:20:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35337 and previous config saved to /var/cache/conftool/dbconfig/20221004-092025-root.json [09:24:35] (03CR) 10Muehlenhoff: [C: 03+2] Add cookbook to perform rolling restart of maps [cookbooks] - 10https://gerrit.wikimedia.org/r/836790 (owner: 10Muehlenhoff) [09:24:59] (03PS1) 10Giuseppe Lavagetto: logging: switch to php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/838083 (https://phabricator.wikimedia.org/T271736) [09:25:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) [09:25:03] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: remove php 7.2 from the servers [puppet] - 10https://gerrit.wikimedia.org/r/838085 (https://phabricator.wikimedia.org/T318894) [09:26:17] (03PS1) 10JMeybohm: aptrepo: Add bullseye components calico323 and kubernetes123 [puppet] - 10https://gerrit.wikimedia.org/r/838106 (https://phabricator.wikimedia.org/T307943) [09:26:44] (03PS2) 10JMeybohm: aptrepo: Add bullseye components calico323 and kubernetes123 [puppet] - 10https://gerrit.wikimedia.org/r/838106 (https://phabricator.wikimedia.org/T307943) [09:26:51] (03PS1) 10Btullis: Bump version of eventgate image that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/838107 (https://phabricator.wikimedia.org/T319261) [09:27:15] (03CR) 10CI reject: [V: 04-1] logging: switch to php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/838083 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [09:27:36] (03CR) 10CI reject: [V: 04-1] mediawiki::php: use only php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/838084 (https://phabricator.wikimedia.org/T318894) (owner: 10Giuseppe Lavagetto) [09:30:03] (03PS2) 10Giuseppe Lavagetto: logging: switch to php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/838083 (https://phabricator.wikimedia.org/T271736) [09:31:28] (03PS1) 10Marostegui: mariadb: Remove innodb_locks_unsafe_for_binlog [puppet] - 10https://gerrit.wikimedia.org/r/838108 (https://phabricator.wikimedia.org/T318914) [09:32:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Bump version of eventgate image that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/838107 (https://phabricator.wikimedia.org/T319261) (owner: 10Btullis) [09:32:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] logging: switch to php 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/838083 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [09:33:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb_locks_unsafe_for_binlog [puppet] - 10https://gerrit.wikimedia.org/r/838108 (https://phabricator.wikimedia.org/T318914) (owner: 10Marostegui) [09:33:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:httpd::mpm: Remove mod_php* for php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/828507 (owner: 10Clément Goubert) [09:34:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall!" [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [09:34:18] <_joe_> marostegui: you work too much [09:34:27] (03CR) 10Btullis: [C: 03+2] Bump version of eventgate image that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/838107 (https://phabricator.wikimedia.org/T319261) (owner: 10Btullis) [09:34:43] _joe_: do I? XD [09:34:53] <_joe_> I found puppet-merge locked by you [09:35:10] it should be fine now :p [09:35:29] 10SRE, 10Data-Engineering, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JMeybohm) This needs SRE support to depool eventstreams from one DC. helmfile destroy/helmfile appy... [09:35:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35338 and previous config saved to /var/cache/conftool/dbconfig/20221004-093530-root.json [09:35:50] (03PS3) 10Marostegui: mariadb: Remove innodb_large_prefix flag. [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) [09:35:58] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for 20 hosts [09:36:05] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 20 hosts [09:36:31] And I'm done with 7.2 removal on parse, sorry for the spam [09:36:33] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore2001.codfw.wmnet with reason: Prep for reimage [09:36:48] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore2001.codfw.wmnet with reason: Prep for reimage [09:37:08] (03CR) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [09:37:47] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-codfw [09:38:15] (03Merged) 10jenkins-bot: Bump version of eventgate image that is in use [deployment-charts] - 10https://gerrit.wikimedia.org/r/838107 (https://phabricator.wikimedia.org/T319261) (owner: 10Btullis) [09:40:23] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS buster [09:40:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:42:03] !log deployed istio-ingressgateway with additional envoy native metrics to wikikube codfw and eqiad [09:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:51] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [09:44:26] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [09:44:46] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [09:45:46] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [09:45:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:46:00] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [09:46:03] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [09:46:27] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [09:47:20] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [09:47:29] (03PS4) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) [09:47:31] (03CR) 10David Caro: p:wmcs::prometheus: overwrite instance with hostname if there (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [09:48:38] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37420/console" [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [09:50:01] (03PS3) 10Jelto: docker_registry_ha: pass jwt_allowed_ips to docker_registry_ha::web [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) [09:50:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:51:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [09:51:49] (03PS1) 10Vgutierrez: trafficserver: Log total time spent on a client request [puppet] - 10https://gerrit.wikimedia.org/r/838111 (https://phabricator.wikimedia.org/T317748) [09:52:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: pass jwt_allowed_ips to docker_registry_ha::web [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [09:52:29] (03PS2) 10Jelto: docker_registry_ha: add codfw Trusted Runners to jwt_allowed_ips [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) [09:52:31] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10MoritzMuehlenhoff) [09:52:42] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:52:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_registry_ha: add codfw Trusted Runners to jwt_allowed_ips [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [09:53:50] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37421/console" [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [09:56:21] (03PS1) 10Clément Goubert: httpbb: Remove PHP version routing tests [puppet] - 10https://gerrit.wikimedia.org/r/838112 (https://phabricator.wikimedia.org/T318894) [09:56:46] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37422/console" [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [09:57:01] (03CR) 10Clément Goubert: "As discussed, in preparation of php 7.2 complete removal." [puppet] - 10https://gerrit.wikimedia.org/r/838112 (https://phabricator.wikimedia.org/T318894) (owner: 10Clément Goubert) [09:59:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.maps.roll-restart (exit_code=1) rolling restart_daemons on A:maps-codfw [10:01:47] (03CR) 10Jelto: [V: 03+1 C: 03+2] docker_registry_ha: pass jwt_allowed_ips to docker_registry_ha::web [puppet] - 10https://gerrit.wikimedia.org/r/836135 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [10:02:01] (03CR) 10Jelto: [V: 03+1 C: 03+2] docker_registry_ha: add codfw Trusted Runners to jwt_allowed_ips [puppet] - 10https://gerrit.wikimedia.org/r/836139 (https://phabricator.wikimedia.org/T308501) (owner: 10Jelto) [10:02:26] 10SRE, 10Traffic, 10Patch-For-Review: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) p:05Triage→03Medium This seems to happen every nine minutes both for upload and text nodes: ` vgutierrez@cp6016:~$... [10:04:21] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Log total time spent on a client request [puppet] - 10https://gerrit.wikimedia.org/r/838111 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [10:06:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) Moving to 'Troubleshoot' column on the #ops-eqiad board. @cmjohnson have you been able to look into this at all? Thanks. [10:15:06] (03CR) 10Jbond: cloudnet1005/1006: prepare for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [10:19:09] (03PS1) 10Muehlenhoff: Create dedicated aliases for maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/838113 [10:25:48] 10SRE, 10Traffic, 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) Reported to upstream in https://github.com/apache/trafficserver/issues/9118 [10:37:05] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:37:52] (03PS1) 10Muehlenhoff: Enable Ganeti 3 on ganeti/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/838116 (https://phabricator.wikimedia.org/T311687) [10:41:42] !log installing expat security updates [10:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 9119 [10:43:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 9119 [10:43:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 135158 [10:44:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 135158 [10:46:04] (03CR) 10FNegri: [C: 03+1] p:wmcs::prometheus: overwrite instance with hostname if there (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [10:48:41] (03PS1) 10Arturo Borrero Gonzalez: cloudnet1005/1006: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/838117 (https://phabricator.wikimedia.org/T316284) [10:48:56] (03PS2) 10Arturo Borrero Gonzalez: cloudnet1005/1006: give them proper role [puppet] - 10https://gerrit.wikimedia.org/r/838117 (https://phabricator.wikimedia.org/T316284) [10:49:59] (03CR) 10Vgutierrez: "This was being tracked on https://gerrit.wikimedia.org/r/c/operations/puppet/+/769827," [puppet] - 10https://gerrit.wikimedia.org/r/837742 (https://phabricator.wikimedia.org/T316932) (owner: 10Zabe) [10:51:23] (03CR) 10Vgutierrez: [C: 03+1] "As mentioned by Zabe on I489e7cd8861e23feeb666bd082b110a12de4a8e0, this has been addressed by 22dca3a:" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [10:52:04] (03CR) 10Marostegui: "Btullis, I am merging this now. I don't think you'll have any issues. But if you do, this is perfectly safe to revert." [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [10:52:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb_large_prefix flag. [puppet] - 10https://gerrit.wikimedia.org/r/837490 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [10:52:34] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/37423/" [puppet] - 10https://gerrit.wikimedia.org/r/838117 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [10:53:54] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye [10:54:44] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2001.codfw.wmnet with OS buster [10:54:49] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [10:54:52] (03PS2) 10Muehlenhoff: alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837097 (https://phabricator.wikimedia.org/T308013) [10:55:12] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [10:56:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2001.codfw.wmnet with OS buster [10:57:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/838106 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:57:58] (03CR) 10JMeybohm: [C: 03+2] aptrepo: Add bullseye components calico323 and kubernetes123 [puppet] - 10https://gerrit.wikimedia.org/r/838106 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:58:41] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [10:58:49] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [11:00:37] (03CR) 10Hnowlan: [C: 03+1] Create dedicated aliases for maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/838113 (owner: 10Muehlenhoff) [11:00:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [11:04:57] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [11:05:05] (03PS1) 10Muehlenhoff: Add ssh-agent-proxy processes to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/838118 (https://phabricator.wikimedia.org/T135991) [11:05:41] !log published calico 3.23.3 debian packages in bullseye component/calico323 as well as corresponding docker images - T307943 [11:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:44] T307943: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 [11:10:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2181.codfw.wmnet with reason: Upgrading [11:11:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Upgrading [11:11:06] 10SRE: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10hnowlan) [11:12:27] 10SRE: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10hnowlan) [11:18:19] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx on archiva/proxy [puppet] - 10https://gerrit.wikimedia.org/r/832258 (https://phabricator.wikimedia.org/T135991) [11:22:19] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [11:24:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [11:33:43] (03PS1) 10Muehlenhoff: Finetune various DE aliases [puppet] - 10https://gerrit.wikimedia.org/r/838122 [11:35:54] (03PS1) 10Elukey: Move kafka-logging1002's Kafka TLS config to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838123 (https://phabricator.wikimedia.org/T300130) [11:37:37] (03CR) 10Btullis: [C: 03+1] "Thanks. Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/838122 (owner: 10Muehlenhoff) [11:37:40] (03CR) 10Muehlenhoff: [C: 03+2] alerts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/837097 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:37:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37424/console" [puppet] - 10https://gerrit.wikimedia.org/r/838123 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [11:38:39] (03CR) 10Filippo Giunchedi: [C: 03+1] Move kafka-logging1002's Kafka TLS config to PKI [puppet] - 10https://gerrit.wikimedia.org/r/838123 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [11:38:48] (03CR) 10Muehlenhoff: [C: 03+2] Finetune various DE aliases [puppet] - 10https://gerrit.wikimedia.org/r/838122 (owner: 10Muehlenhoff) [11:40:26] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2001.codfw.wmnet with reason: host reimage [11:43:34] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:43:58] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2001.codfw.wmnet with reason: host reimage [11:50:31] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:50:39] (03PS1) 10Daniel Kinzler: Revert "Introduce LanguageVariantConverter" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 [11:50:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10QChris) While I fully support the goal to have the repo under Apache 2.0 (yay!), I cannot add my name to the list, due to commit 18144f1 :-( The above commit adds an MIT... [11:51:42] (03PS2) 10Daniel Kinzler: Revert "Introduce LanguageVariantConverter" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 (https://phabricator.wikimedia.org/T319282) [11:52:08] (03CR) 10David Caro: [V: 03+1] p:wmcs::prometheus: overwrite instance with hostname if there (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [11:55:34] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [11:57:09] (03CR) 10Hokwelum: [C: 03+1] "We used PCC to double-check check and it looks good! Thank you, Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/837101 (owner: 10Muehlenhoff) [11:59:37] (03CR) 10Muehlenhoff: [C: 03+2] snapshot: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/837101 (owner: 10Muehlenhoff) [12:00:51] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [12:01:04] (03PS2) 10Muehlenhoff: Create dedicated aliases for maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/838113 [12:01:06] (03PS1) 10Marostegui: Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/838096 [12:01:34] (03CR) 10CI reject: [V: 04-1] Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/838096 (owner: 10Marostegui) [12:02:15] (03Abandoned) 10Marostegui: Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/838096 (owner: 10Marostegui) [12:02:17] (03PS1) 10Marostegui: db2181: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/838126 [12:03:41] (03CR) 10Marostegui: [C: 03+2] db2181: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/838126 (owner: 10Marostegui) [12:03:46] (03PS1) 10Slyngshede: Initial checkin. [software/charon] - 10https://gerrit.wikimedia.org/r/838127 [12:03:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:04:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35339 and previous config saved to /var/cache/conftool/dbconfig/20221004-120413-root.json [12:05:52] (03CR) 10Muehlenhoff: [C: 03+2] Create dedicated aliases for maps replicas [puppet] - 10https://gerrit.wikimedia.org/r/838113 (owner: 10Muehlenhoff) [12:08:21] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Merge cert-manager/sample-external-issuer@55b043b [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/828545 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [12:08:25] (03PS1) 10Muehlenhoff: sre.maps.roll-restart: Update aliases to only use replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/838130 [12:08:27] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host sessionstore2001.codfw.wmnet with OS buster [12:09:33] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:10:25] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1005.eqiad.wmnet with OS bullseye [12:11:05] (03CR) 10Awight: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/838129 (https://phabricator.wikimedia.org/T318678) (owner: 10Awight) [12:12:57] (03PS1) 10JMeybohm: Update cfssl-issuer image to v0.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838131 (https://phabricator.wikimedia.org/T310486) [12:13:08] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) Now that the php72 images were dropped, I think we o... [12:13:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:14:06] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [12:14:13] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] doc: add README.md [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836816 (owner: 10Clément Goubert) [12:14:19] !log uploaded python3-gjson_0.1.0 to apt.wikimedia.org bullseye-wikimedia [12:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:43] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] Release 3.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (https://phabricator.wikimedia.org/T310458) (owner: 10Clément Goubert) [12:16:51] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Release 3.0.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/836730 (https://phabricator.wikimedia.org/T310458) (owner: 10Clément Goubert) [12:19:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35340 and previous config saved to /var/cache/conftool/dbconfig/20221004-121917-root.json [12:21:01] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [12:26:01] !log cgoubert@deploy1002 Started deploy [docker-pkg/deploy@24fbee1]: Release 3.0.3 # T310458 [12:26:06] T310458: docker-pkg / docker downloads all versions of parent image upon building - https://phabricator.wikimedia.org/T310458 [12:26:16] !log cgoubert@deploy1002 Finished deploy [docker-pkg/deploy@24fbee1]: Release 3.0.3 # T310458 (duration: 00m 14s) [12:28:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:29:00] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [12:30:13] (03CR) 10Filippo Giunchedi: [C: 03+2] customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:30:26] !log cgoubert@deploy1002 Started deploy [docker-pkg/deploy@24fbee1]: Release 3.0.3 # T310458 [12:30:31] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] customscripts: export 'mgmt' entries from hiera_export [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/817739 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:31:24] !log cgoubert@deploy1002 Finished deploy [docker-pkg/deploy@24fbee1]: Release 3.0.3 # T310458 (duration: 00m 58s) [12:31:28] T310458: docker-pkg / docker downloads all versions of parent image upon building - https://phabricator.wikimedia.org/T310458 [12:34:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35341 and previous config saved to /var/cache/conftool/dbconfig/20221004-123422-root.json [12:35:09] (03PS1) 10Clément Goubert: scap: remove deneb.codfw.wmnet [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/838132 (https://phabricator.wikimedia.org/T298463) [12:35:37] (03CR) 10Hashar: [C: 03+1] scap: remove deneb.codfw.wmnet [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/838132 (https://phabricator.wikimedia.org/T298463) (owner: 10Clément Goubert) [12:35:45] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] scap: remove deneb.codfw.wmnet [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/838132 (https://phabricator.wikimedia.org/T298463) (owner: 10Clément Goubert) [12:35:50] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] scap: remove deneb.codfw.wmnet [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/838132 (https://phabricator.wikimedia.org/T298463) (owner: 10Clément Goubert) [12:37:36] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bullseye [12:37:42] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [12:39:09] !next [12:39:16] jouncebot next [12:39:16] In 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1300) [12:39:16] In 0 hour(s) and 20 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1300) [12:41:52] (03PS1) 10JMeybohm: Pin cert-manager and cfssl-issuer chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/838134 (https://phabricator.wikimedia.org/T310486) [12:41:56] (03PS1) 10JMeybohm: cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) [12:41:58] (03PS1) 10JMeybohm: cfssl-issuer: Bump CRD chart version for cfssl-issuer update [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 [12:42:00] (03PS1) 10JMeybohm: cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) [12:42:13] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 (owner: 10Hashar) [12:43:07] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 (owner: 10Hashar) [12:44:55] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8280847, @BBlack wrote: > I've also found some other breadcrumbs. Runtime buster + 5.10 support is puppetized in `modul... [12:48:15] (03PS1) 10Filippo Giunchedi: sre: write netbox-hiera common.yaml with mgmt data [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) [12:49:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35342 and previous config saved to /var/cache/conftool/dbconfig/20221004-124927-root.json [12:50:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838118 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:51:43] (03PS1) 10DCausse: eliastic: make gc log files rotate at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 [12:53:05] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [12:53:11] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [12:53:18] (03PS2) 10DCausse: elastic: rotate gc log files at 20m [puppet] - 10https://gerrit.wikimedia.org/r/838141 [12:55:44] (03CR) 10Volans: [C: 04-1] "Few nits inline, I don't think it would work as-is, just minor details" [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [12:56:21] (03PS3) 10D3r1ck01: Revert "Introduce LanguageVariantConverter" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 (https://phabricator.wikimedia.org/T319282) (owner: 10Daniel Kinzler) [12:56:37] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [12:58:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10Traffic: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10MoritzMuehlenhoff) >>! In T319067#8276581, @BBlack wrote: > The question is why the Debian installer didn't load this automagically, and how we fix that s... [12:58:23] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1300). [13:00:05] koi, awight, and xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1300) [13:00:12] o/ [13:00:16] o/ [13:00:36] o/ [13:00:41] I can deploy my patches, and would be happy to also deploy anyone else's if you wish? [13:01:02] I’m fine either way ^^ [13:01:16] awight please you can go ahead and deploy mine, thank you :) [13:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:05] duesen: maybe we can also deploy the LanguageVariantConverter revert if there’s time left in the window? [13:02:52] ah, nevermind, that’s the backport that xSavitar added :D [13:03:04] (I think it wasn’t there when I looked at the calendar earlier ^^) [13:03:05] Lucas_WMDE yes, that's it. [13:03:12] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10Volans) Interesting use case, I can see both pros and cons of doing that. I need also to check how easy is to detect the ack from the data exported. Would "ask... [13:03:13] xSavitar: thank you! [13:03:19] Lucas_WMDE I just added it not long ago [13:03:26] 👍 [13:03:41] awight: do you want to start or should I? [13:04:06] Lucas_WMDE: I'll do it--was just waiting for koi but I'll go ahead. [13:04:08] (03PS2) 10Filippo Giunchedi: sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) [13:04:11] (03CR) 10Filippo Giunchedi: "Thank you for the quick review, appreciate it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:04:15] ok thanks! [13:04:23] awight: that's great, please go ahead [13:04:28] ack [13:04:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35343 and previous config saved to /var/cache/conftool/dbconfig/20221004-130432-root.json [13:05:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837756 (https://phabricator.wikimedia.org/T319244) (owner: 10Stang) [13:05:29] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10jbond) @QChris thanks for the contribution and reaching out. >>! In T308013#8282636, @QChris wrote: > While I fully support the goal to have the repo under Apache 2.0 (ya... [13:06:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/838130 (owner: 10Muehlenhoff) [13:06:45] koi: Looks like the throttling exception would be hard to test, so I'll just check mwdebug for basic functionality. [13:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:12] make sense, thanks [13:07:52] (03CR) 10Awight: [C: 03+2] ukwiki: Create flood group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837757 (https://phabricator.wikimedia.org/T319243) (owner: 10Stang) [13:07:56] (03Merged) 10jenkins-bot: throttle: Add throttle rule for 2022-10-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837756 (https://phabricator.wikimedia.org/T319244) (owner: 10Stang) [13:08:12] (03CR) 10CI reject: [V: 04-1] Revert "Introduce LanguageVariantConverter" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 (https://phabricator.wikimedia.org/T319282) (owner: 10Daniel Kinzler) [13:08:37] !log awight@deploy1002 Started scap: Backport for [[gerrit:837756|throttle: Add throttle rule for 2022-10-13 (T319244)]] [13:08:41] T319244: Throttle rule for 2022-10-13 - Senior Citizens Write Wikipedia course - https://phabricator.wikimedia.org/T319244 [13:08:54] (03Merged) 10jenkins-bot: ukwiki: Create flood group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837757 (https://phabricator.wikimedia.org/T319243) (owner: 10Stang) [13:09:01] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:09:25] (03CR) 10D3r1ck01: "recheck, failure seems unrelated." [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 (https://phabricator.wikimedia.org/T319282) (owner: 10Daniel Kinzler) [13:09:51] (03CR) 10Volans: [C: 03+1] "LGTM modulo the decision about the hieradata path" [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:11:32] !log awight@deploy1002 awight and stang: Backport for [[gerrit:837756|throttle: Add throttle rule for 2022-10-13 (T319244)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:12:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:13:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:14:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:14:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:16:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:51] Is it possible that "scap backport" is slower than "sync-file", maybe it's rsyncing the entire tree? [13:17:28] (03PS1) 10Jbond: C:puppetmaster: drop hieradata from the netbox common path [puppet] - 10https://gerrit.wikimedia.org/r/838144 (https://phabricator.wikimedia.org/T310266) [13:17:42] (03CR) 10Jbond: [C: 03+1] sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:18:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/838144 (https://phabricator.wikimedia.org/T310266) (owner: 10Jbond) [13:18:25] (03CR) 10Filippo Giunchedi: [C: 03+2] sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:18:40] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data [cookbooks] - 10https://gerrit.wikimedia.org/r/838139 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:19:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35345 and previous config saved to /var/cache/conftool/dbconfig/20221004-131937-root.json [13:21:25] !log awight@deploy1002 Finished scap: Backport for [[gerrit:837756|throttle: Add throttle rule for 2022-10-13 (T319244)]] (duration: 12m 48s) [13:21:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) >>! In T308013#8282942, @jbond wrote: > It is worth noting that currently we have taken the stance that if a file already has some licence on it we have... [13:21:29] T319244: Throttle rule for 2022-10-13 - Senior Citizens Write Wikipedia course - https://phabricator.wikimedia.org/T319244 [13:21:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837757 (https://phabricator.wikimedia.org/T319243) (owner: 10Stang) [13:21:54] !log awight@deploy1002 Started scap: Backport for [[gerrit:837757|ukwiki: Create flood group (T319243)]] [13:21:58] T319243: Create Flooder group on Ukrainian Wikipedia - https://phabricator.wikimedia.org/T319243 [13:21:59] (03PS1) 10Muehlenhoff: Fix typo in header [puppet] - 10https://gerrit.wikimedia.org/r/838147 [13:22:23] !log awight@deploy1002 awight and stang: Backport for [[gerrit:837757|ukwiki: Create flood group (T319243)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:23:12] (03PS2) 10Jbond: C:puppetmaster: drop hieradata from the netbox common path [puppet] - 10https://gerrit.wikimedia.org/r/838144 (https://phabricator.wikimedia.org/T310266) [13:23:34] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:23:37] (03CR) 10Jbond: [C: 03+2] C:puppetmaster: drop hieradata from the netbox common path [puppet] - 10https://gerrit.wikimedia.org/r/838144 (https://phabricator.wikimedia.org/T310266) (owner: 10Jbond) [13:24:09] !log disable puppet to deploy a puppetmaster change 838144 [13:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:09] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in header [puppet] - 10https://gerrit.wikimedia.org/r/838147 (owner: 10Muehlenhoff) [13:25:32] jbond: you can merge along my patch [13:25:42] moritzm: ack will do [13:25:51] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Inital FHRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/826559 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi) [13:25:53] (03PS2) 10Muehlenhoff: sre.maps.roll-restart: Update aliases to only use replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/838130 [13:27:11] !log awight@deploy1002 Finished scap: Backport for [[gerrit:837757|ukwiki: Create flood group (T319243)]] (duration: 05m 16s) [13:27:15] T319243: Create Flooder group on Ukrainian Wikipedia - https://phabricator.wikimedia.org/T319243 [13:27:28] koi: Patches deployed, thank you! [13:27:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:27:53] (03PS2) 10Awight: Wire new event stream for maps interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) [13:27:56] awight: thanks! [13:27:59] (03CR) 10TrainBranchBot: "Approved by awight@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:28:38] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update to wmf-netbx CR826559 - ayounsi@cumin1001 [13:29:29] (03Merged) 10jenkins-bot: Wire new event stream for maps interactions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/836804 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:29:42] !log awight@deploy1002 Started scap: Backport for [[gerrit:836804|Wire new event stream for maps interactions (T315972 T318678)]] [13:29:46] T318678: Add show nearby metrics: count clicks on button to open and on links to different articles - https://phabricator.wikimedia.org/T318678 [13:29:47] T315972: Metrics for new Kartographer feature usage - https://phabricator.wikimedia.org/T315972 [13:30:05] !log awight@deploy1002 awight and awight: Backport for [[gerrit:836804|Wire new event stream for maps interactions (T315972 T318678)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:30:12] (03CR) 10DCausse: [C: 04-1] Update beta eventgate hostname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838129 (https://phabricator.wikimedia.org/T318678) (owner: 10Awight) [13:30:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update to wmf-netbx CR826559 - ayounsi@cumin1001 [13:30:44] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10MoritzMuehlenhoff) >>! In T319277#8282939, @Volans wrote: > Interesting use case, I can see both pros and cons of doing that. I need also to check how easy is t... [13:30:51] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update to wmf-netbox - try 2 - CR826559 - ayounsi@cumin1001 [13:31:48] !log re-enable puppet post deploy a puppetmaster change 838144 [13:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:15] jbond: sweet, thank you! I'll try running the cookbook now [13:32:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update to wmf-netbox - try 2 - CR826559 - ayounsi@cumin1001 [13:33:18] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:21] (03PS1) 10Ssingh: hiera: upgrade cp hosts in eqsin to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838149 (https://phabricator.wikimedia.org/T309651) [13:34:30] (03CR) 10Muehlenhoff: [C: 03+2] sre.maps.roll-restart: Update aliases to only use replicas [cookbooks] - 10https://gerrit.wikimedia.org/r/838130 (owner: 10Muehlenhoff) [13:34:36] godog: no problem [13:34:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35346 and previous config saved to /var/cache/conftool/dbconfig/20221004-133442-root.json [13:34:45] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "filippo test - filippo@cumin1001" [13:35:04] !log filippo@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "filippo test - filippo@cumin1001" [13:35:21] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37427/console" [puppet] - 10https://gerrit.wikimedia.org/r/838149 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:35:35] (03PS1) 10Awight: Log basic nearby and fullscreen events [extensions/Kartographer] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838097 (https://phabricator.wikimedia.org/T315972) [13:35:51] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [13:35:51] (03CR) 10Awight: [C: 03+2] "Deploying" [extensions/Kartographer] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838097 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:36:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:36:31] !log awight@deploy1002 Finished scap: Backport for [[gerrit:836804|Wire new event stream for maps interactions (T315972 T318678)]] (duration: 06m 49s) [13:36:36] T318678: Add show nearby metrics: count clicks on button to open and on links to different articles - https://phabricator.wikimedia.org/T318678 [13:36:36] T315972: Metrics for new Kartographer feature usage - https://phabricator.wikimedia.org/T315972 [13:36:42] (03CR) 10Muehlenhoff: [C: 03+2] Add ssh-agent-proxy processes to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/838118 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:36:46] .7 [13:37:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:37:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:44] duesen: xSavitar: Deploying your revert now. [13:37:54] okay [13:38:18] ah--wmf.4 isn't deployed yet, so we'll just merge the cherry-pick. [13:38:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet1005/1006: prepare for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837631 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [13:38:24] Did you also want to change wmf.3? [13:38:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [13:38:40] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [13:38:56] awight, yes it seems that wmf.4 hasn't been deployed to any group yet. So a merge would be fine. [13:39:07] (03PS1) 10Btullis: Add a new production image for spark version 3.3.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [13:39:23] (03CR) 10Awight: [C: 03+2] "Merging backport ahead of train." [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 (https://phabricator.wikimedia.org/T319282) (owner: 10Daniel Kinzler) [13:39:28] as for wmf.3, not sure that is needed or are my missing anything @duesen? [13:40:46] !log filippo@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [13:41:02] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: update prefix for common data [cookbooks] - 10https://gerrit.wikimedia.org/r/838153 [13:41:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ayounsi) From diffscan: ` STATUS HOST PORT PROTO OPREV CPREV DNS OPEN 198.35.26.7 22 tcp 0 6 dns4003.wikimedia.org ` That host is exposed to the world without properly config... [13:41:41] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) @ayounsi @cmooney i am having space issue on msw1-codfw which is preventing me to copy the Junos image to /var/tmp. request system storage cleanup didn't... [13:42:01] (03PS1) 10Ssingh: hiera: upgrade cp hosts in esams to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838154 (https://phabricator.wikimedia.org/T309651) [13:42:07] godog: see https://gerrit.wikimedia.org/r/838153 [13:42:20] dancy: jeena: Brilliant work on "scap backport", quite a pleasure to use it today! [13:42:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/838153 (owner: 10Jbond) [13:42:29] jbond: ack, makes sense [13:42:34] !log EU backport window finished. [13:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:44] (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in eqsin to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838149 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:42:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37428/console" [puppet] - 10https://gerrit.wikimedia.org/r/838154 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:43:01] (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in esams to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838154 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:43:46] awight, verifying if we need a wmf.3 backport too [13:43:49] PROBLEM - cassandra-a service on sessionstore2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:43:55] 10SRE, 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10ayounsi) There is an outstanding diff for cr3-ulsfo: `lang=diff [edit interfaces fxp0] - description "Core: msw1-ulsfo:12 {#1021}"; ` This seems to be because it lost its cable to the mgmt switch. msw1-ulsfo only have a couple... [13:44:01] PROBLEM - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:44:21] xSavitar: Thanks--I'll have to leave soon but maybe Lucas_WMDE is monitoring and can backport for wmf.3 if needed. [13:44:22] (03PS1) 10Ssingh: hiera: upgrade cp hosts in eqiad to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838155 (https://phabricator.wikimedia.org/T309651) [13:44:33] PROBLEM - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is CRITICAL: connect to address 10.192.16.95 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:44:34] mmhh I must have done sth wrong in the netbox script because it is taking forever (running interactively from netbox UI, the cookbook times out) [13:44:41] (03CR) 10Btullis: Add a new production image for spark version 3.3.0 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:44:51] ah no it actually works, "just" slow now [13:45:08] just as a heads-up: vgutierrez and I will be upgrading to ATS9 on all cp hosts in eqsin, esams, eqiad today. no impact expected and the caches should be preserved. see T309651 [13:45:08] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [13:45:12] godog: yes the netbox api is not the fastest [13:45:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37429/console" [puppet] - 10https://gerrit.wikimedia.org/r/838155 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:45:47] xSavitar: The task makes it look quite specific to wmf.4, fwiw [13:46:24] (03CR) 10Jbond: [V: 03+2 C: 03+2] sre.puppet.sync-netbox-hiera: update prefix for common data [cookbooks] - 10https://gerrit.wikimedia.org/r/838153 (owner: 10Jbond) [13:46:48] awight, wmf.4 should be enough. Thanks! [13:47:15] awight, Subbu also just confirmed that Parsoid CI is green now. [13:47:29] (03PS1) 10Jbond: P:netbox::data: add profile to load common netbox data [puppet] - 10https://gerrit.wikimedia.org/r/838159 (https://phabricator.wikimedia.org/T310266) [13:47:42] ok, thanks all! [13:47:47] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [13:48:20] (03PS1) 10Filippo Giunchedi: sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/838161 (https://phabricator.wikimedia.org/T310266) [13:48:41] jbond: yeah that's going to timeout, see also https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/838161 [13:48:45] !log disable Puppet on A:cp and A:eqsin for T309651 [13:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:53] godog: i think that the script is potentially taking longer then 30 seconds to run and so is hitting a timeout [13:49:08] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in eqsin to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838149 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:49:09] my gut feeling is that the times it works is because enough stuff is cached for it to run quciker [13:49:32] indeed: Initiated: 2022-10-04 13:40 Duration: 3 minutes, 34.49 seconds [13:49:33] wow 3.5m is a lot :/ [13:49:37] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=codfw [13:49:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P35347 and previous config saved to /var/cache/conftool/dbconfig/20221004-134947-root.json [13:49:53] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [13:49:55] godog: can you raise a task and ill look at that later that se4ems like a big regression [13:50:15] jbond: for sure, I will [13:51:12] (03Merged) 10jenkins-bot: Log basic nearby and fullscreen events [extensions/Kartographer] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838097 (https://phabricator.wikimedia.org/T315972) (owner: 10Awight) [13:51:15] thanks [13:52:27] (03PS1) 10Jbond: script_proxy: increase timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838162 [13:52:34] jbond: T319299 [13:52:35] T319299: Investigate longer run time for hiera_export netbox script - https://phabricator.wikimedia.org/T319299 [13:52:50] opps sorry :/ [13:53:06] (03PS2) 10Jbond: script_proxy: increase timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838162 [13:53:11] RECOVERY - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is OK: TCP OK - 0.032 second response time on 10.192.16.95 port 9042 https://phabricator.wikimedia.org/T93886 [13:53:14] (03PS2) 10Jbond: P:netbox::data: add profile to load common netbox data [puppet] - 10https://gerrit.wikimedia.org/r/838159 (https://phabricator.wikimedia.org/T319299) [13:53:37] RECOVERY - cassandra-a service on sessionstore2001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:53:40] hrm, wmf.4 may not be deployed yet, but it exists on deploy1002, so those backports should still be pulled at least [13:53:51] RECOVERY - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is OK: SSL OK - Certificate sessionstore2001-a valid until 2023-02-22 11:12:13 +0000 (expires in 140 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:54:04] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: sync [13:54:08] (03CR) 10Jbond: "lgtm will also need https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/838162" [cookbooks] - 10https://gerrit.wikimedia.org/r/838161 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:54:11] (03CR) 10Jbond: [C: 03+1] sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/838161 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:54:19] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [13:54:26] (03CR) 10Jbond: [C: 03+2] script_proxy: increase timeout [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838162 (owner: 10Jbond) [13:55:10] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw [13:56:01] and wmf.4 exists on mwdebug1002 too [13:56:05] I think those backports need to be synced [13:56:13] even if the train hasn’t rolled out yet [13:56:28] (03CR) 10Jbond: [C: 03+1] sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/838161 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [13:56:33] is it okay if I do that now? cc jbond, godog [13:56:47] (second backport should be merged in a few seconds) [13:57:04] Lucas_WMDE: nothing me or god.og are doing should block the train, please go ahead [13:57:05] (03Merged) 10jenkins-bot: Revert "Introduce LanguageVariantConverter" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838094 (https://phabricator.wikimedia.org/T319282) (owner: 10Daniel Kinzler) [13:57:08] ack, thanks [13:57:30] let’s see if scap backport will let me sync them [13:58:02] nope [13:58:08] manual operation then [13:58:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:59:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:59:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:59:59] syncing Kartographer [14:00:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:01:33] Lucas_WMDE: yes you're all good on my end, thanks for checking [14:02:10] ok! [14:02:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [14:03:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.4/extensions/Kartographer/modules/dialog: Backport: [[gerrit:838097|Log basic nearby and fullscreen events (T315972, T318678)]] (no wikis use wmf.4 yet, but the code exists, so the change needs to be synced) (duration: 03m 42s) [14:03:47] T318678: Add show nearby metrics: count clicks on button to open and on links to different articles - https://phabricator.wikimedia.org/T318678 [14:03:47] T315972: Metrics for new Kartographer feature usage - https://phabricator.wikimedia.org/T315972 [14:05:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:06:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:06:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:06:56] (03CR) 10Filippo Giunchedi: [C: 03+1] P:netbox::data: add profile to load common netbox data [puppet] - 10https://gerrit.wikimedia.org/r/838159 (https://phabricator.wikimedia.org/T319299) (owner: 10Jbond) [14:07:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:07:11] (03CR) 10Filippo Giunchedi: [C: 03+2] sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/838161 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [14:08:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.4/includes/: Backport: [[gerrit:838094|Revert "Introduce LanguageVariantConverter" (T319282)]] (1/2; no wikis use wmf.4 yet, but the code exists, so the change needs to be synced) (duration: 03m 43s) [14:08:07] T319282: Language variant conversion broken for page/html endpoints on RESTBase - https://phabricator.wikimedia.org/T319282 [14:09:40] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) This issue was the file was copied first to /tmp before /var/tmp according to @ayounsi so copy the file first to local laptop and use scp to copy the file... [14:11:23] (03CR) 10FNegri: [C: 03+1] p:wmcs::prometheus: overwrite instance with hostname if there (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [14:12:12] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [14:12:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.4/tests/phpunit/: Backport: [[gerrit:838094|Revert "Introduce LanguageVariantConverter" (T319282)]] (2/2; no wikis use wmf.4 yet, but the code exists, so the change needs to be synced) (duration: 03m 52s) [14:12:35] ok, I should be done [14:12:53] !log filippo@cumin1001 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [14:13:08] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [14:14:40] !log netbox - Move VRRP IPs to FHRP group feature - T311218 [14:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:46] T311218: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 [14:18:37] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:20:30] (03PS1) 10Snwachukwu: role::common::aqs: update mw history snapshop [puppet] - 10https://gerrit.wikimedia.org/r/838167 [14:21:29] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:22:27] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [14:22:59] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) Ran the following, then confirmed that there is no diff after a Homer run. `lang=python,lines=20 import uuid request_id = uuid.uuid4() user = User.objects.get(us... [14:23:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:24:25] jbond volans we're on! the sync-netbox-hiera cookbook ran successfully now, thanks for your help [14:25:01] godog: great! I'm a bit worried about the generation time, I'll try to have a look [14:25:18] (03CR) 10David Caro: [C: 03+1] prometheus: Add new scrape target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/836310 (owner: 10Raymond Ndibe) [14:26:34] (03CR) 10David Caro: [V: 03+1 C: 03+2] p:wmcs::prometheus: overwrite instance with hostname if there [puppet] - 10https://gerrit.wikimedia.org/r/838069 (https://phabricator.wikimedia.org/T318650) (owner: 10David Caro) [14:27:00] yeah that's fair, the specific task is T319299 in case you missed it [14:27:01] T319299: Investigate longer run time for hiera_export netbox script - https://phabricator.wikimedia.org/T319299 [14:27:15] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1005.eqiad.wmnet with OS bullseye [14:28:27] (03PS1) 10JMeybohm: k8s: Align formatting along k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/838168 (https://phabricator.wikimedia.org/T307943) [14:28:33] (03PS1) 10JMeybohm: Remove unused mwautopull class [puppet] - 10https://gerrit.wikimedia.org/r/838169 (https://phabricator.wikimedia.org/T284628) [14:29:42] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1006.eqiad.wmnet with OS bullseye [14:29:47] (03CR) 10Jbond: [C: 03+2] P:netbox::data: add profile to load common netbox data [puppet] - 10https://gerrit.wikimedia.org/r/838159 (https://phabricator.wikimedia.org/T319299) (owner: 10Jbond) [14:30:12] !log on going maintenance on msw1-codfw [14:30:14] godog: cool, i have merged the other change so you should be abl to use this with: [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:20] include profile::netbox::data [14:30:27] $profile::netbox::data::mgmt [14:31:15] jbond: amazing! TYVM, I'll give it a try [14:31:49] np let us know how it gose [14:32:02] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore2002.codfw.wmnet with reason: Prep for reimage [14:32:15] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore2002.codfw.wmnet with reason: Prep for reimage [14:34:32] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2002.codfw.wmnet with OS buster [14:38:37] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:40:08] !log installing maven-shared-utils security updates [14:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:51] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) [14:43:34] 10SRE, 10Infrastructure-Foundations, 10netops: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10ayounsi) [14:43:41] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) a:03ayounsi [14:43:46] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:43:46] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:43:46] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:43:46] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:43:47] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) 05Open→03Resolved [14:45:16] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:47:48] (03CR) 10Muehlenhoff: [C: 03+2] Enable Ganeti 3 on ganeti/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/838116 (https://phabricator.wikimedia.org/T311687) (owner: 10Muehlenhoff) [14:47:52] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:48:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2002.codfw.wmnet with reason: host reimage [14:48:45] (JobUnavailable) firing: (2) Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:14] (03PS1) 10Ayounsi: Remove "old" VRRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/838171 (https://phabricator.wikimedia.org/T260363) [14:50:52] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2002.codfw.wmnet with reason: host reimage [14:50:57] (03PS2) 10Ayounsi: Remove "old" VRRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/838171 (https://phabricator.wikimedia.org/T260363) [14:51:16] (03PS1) 10Muehlenhoff: Remove profile::ganeti::ganeti3 setting [puppet] - 10https://gerrit.wikimedia.org/r/838172 [14:51:35] !log disable Puppet on A:cp and A:esams for T309651 [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [14:52:23] (03PS3) 10Ayounsi: Remove "old" VRRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/838171 (https://phabricator.wikimedia.org/T260363) [14:52:55] (03CR) 10Jbond: [C: 03+1] k8s: Align formatting along k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/838168 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:53:59] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.51 ms [14:53:59] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [14:54:15] (03CR) 10David Caro: Add cookbook to restart openstack services (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [14:54:17] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.57 ms [14:54:23] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.57 ms [14:54:27] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [14:54:49] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:51] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in esams to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838154 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:55:23] !log maintenance complete on msw1-codfw [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:26] PROBLEM - Check nf_conntrack usage in neutron netns on cloudnet1005 is CRITICAL: CRITICAL: no netns defined? https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:58:29] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye [14:59:06] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [14:59:13] (JobUnavailable) resolved: (2) Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:20] !log installing snakeyaml security updates [15:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:02:53] 10SRE, 10observability: Overlap between "check systemd state" alert and "check unit status of " - https://phabricator.wikimedia.org/T319304 (10fgiunchedi) [15:03:05] (03CR) 10Vgutierrez: [C: 03+1] hiera: upgrade cp hosts in eqiad to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838155 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:03:22] (03PS3) 10AOkoth: vrts: enable vrts-daemon on WMCS instance [puppet] - 10https://gerrit.wikimedia.org/r/834510 (https://phabricator.wikimedia.org/T317059) [15:03:24] (03PS1) 10AOkoth: admin: remove old ssh key for jnuche [puppet] - 10https://gerrit.wikimedia.org/r/838175 (https://phabricator.wikimedia.org/T319274) [15:03:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:03:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:03:51] (03PS2) 10AOkoth: admin: remove old ssh key for jnuche [puppet] - 10https://gerrit.wikimedia.org/r/838175 (https://phabricator.wikimedia.org/T319274) [15:05:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:06:26] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=codfw [15:08:39] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) >>! In T317247#8283057, @ayounsi wrote: > From diffscan: > ` > STATUS HOST PORT PROTO OPREV CPREV DNS > OPEN 198.35.26.7 22 tcp 0 6 dns4003.wikimedia.org > ` > That hos... [15:08:48] (03CR) 10Milimetric: [C: 03+1] role::common::aqs: update mw history snapshop [puppet] - 10https://gerrit.wikimedia.org/r/838167 (owner: 10Snwachukwu) [15:08:57] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: sync [15:09:06] 10SRE, 10Domains, 10Traffic-Icebox: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10BCornwall) Untagging the Traffic team: While we're happy to help out when this is needed, this currently appears to be more of a discussion with other teams since we are unable by poli... [15:09:15] 10SRE, 10Domains: wikibase.org should redirect to wikiba.se - https://phabricator.wikimedia.org/T254957 (10BCornwall) [15:09:17] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [15:10:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2002.codfw.wmnet with OS buster [15:10:10] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw [15:10:24] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on sessionstore2003.codfw.wmnet with reason: Prep for reimage [15:10:37] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on sessionstore2003.codfw.wmnet with reason: Prep for reimage [15:11:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:11:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:11:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:12:13] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Horsey - https://phabricator.wikimedia.org/T318729 (10vyuen) Noting manager approval in case this is necessary [15:12:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:13:24] 10SRE, 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10RobH) >>! In T319235#8283064, @ayounsi wrote: > There is an outstanding diff for cr3-ulsfo: > `lang=diff > [edit interfaces fxp0] > - description "Core: msw1-ulsfo:12 {#1021}"; > ` > This seems to be because it lost its cable... [15:16:08] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host sessionstore2003.codfw.wmnet with OS buster [15:16:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Remove old production ssh key for RelEng user - https://phabricator.wikimedia.org/T319274 (10Arnoldokoth) 05Open→03In progress a:03Arnoldokoth [15:17:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Remove old production ssh key for RelEng user - https://phabricator.wikimedia.org/T319274 (10Arnoldokoth) p:05Triage→03Medium [15:21:50] 10SRE, 10ops-ulsfo: swap msw1-ulsfo - https://phabricator.wikimedia.org/T319235 (10RobH) [15:25:25] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:25:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:29:25] 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Jgreen) [15:29:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2003.codfw.wmnet with reason: host reimage [15:30:25] 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Jgreen) p:05Triage→03Unbreak! Marking "Unbreak Now!" because this is blocking us from investigating why frdb2001 failed to recover from a reboot. [15:33:15] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2003.codfw.wmnet with reason: host reimage [15:33:19] papaul: ^ is known maintenance right based on your log? [15:33:23] The frack task [15:34:45] Oh you logged that ended, maybe not or leftover then [15:35:04] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37431/console" [puppet] - 10https://gerrit.wikimedia.org/r/838169 (https://phabricator.wikimedia.org/T284628) (owner: 10JMeybohm) [15:37:13] RhinosF1: that is another issue I talked with Jeff thanks [15:38:08] papaul: ok [15:42:01] (03PS1) 10Ebernhardson: envoy: Add service proxys for cirrussearch read traffic [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) [15:42:56] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS bullseye [15:43:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye [15:47:55] !log disable Puppet on A:cp and A:eqiad for T309651 [15:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:59] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [15:48:24] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore,name=codfw [15:50:54] (03CR) 10Zabe: vcl: stop overriding cache-control header for bad title errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/837742 (https://phabricator.wikimedia.org/T316932) (owner: 10Zabe) [15:51:54] !log restarting `/usr/bin/scap stage-train --yes auto` after failed staging (T314193), cc: ^demon [15:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:59] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [15:53:03] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: sync [15:53:06] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838185 (https://phabricator.wikimedia.org/T314193) [15:53:08] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838185 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [15:53:22] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [15:53:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw [15:53:56] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838185 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [15:54:17] !log brennen@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.4 refs T314193 [15:54:34] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2003.codfw.wmnet with OS buster [15:56:07] <^demon> brennen: ack'd [15:56:13] <^demon> What went wrong? [15:57:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:58:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37432/console" [puppet] - 10https://gerrit.wikimedia.org/r/838168 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:58:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/838175 (https://phabricator.wikimedia.org/T319274) (owner: 10AOkoth) [15:59:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:59:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:00:05] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:00:31] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [16:03:12] (03CR) 10AOkoth: [C: 03+2] admin: remove old ssh key for jnuche [puppet] - 10https://gerrit.wikimedia.org/r/838175 (https://phabricator.wikimedia.org/T319274) (owner: 10AOkoth) [16:03:54] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [16:03:55] (03PS1) 10Ayounsi: LibreNMS report: ignore licenses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 [16:05:23] (03CR) 10AOkoth: [C: 03+2] vrts: enable vrts-daemon on WMCS instance [puppet] - 10https://gerrit.wikimedia.org/r/834510 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [16:06:37] (03PS1) 10Ayounsi: Use license keys stored in Netbox instead of homer-private [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) [16:09:33] (03PS7) 10Herron: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [16:10:41] 10SRE, 10ops-codfw, 10DC-Ops: Frack codfw management network issue, many DRACs inaccessible - https://phabricator.wikimedia.org/T319311 (10Papaul) @Jgreen removing the power on the fmsw fixed the issue. please check if you can login to the servers and let me know. Thanks [16:12:36] (03CR) 10Ayounsi: [V: 03+1] "Tested in netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 (owner: 10Ayounsi) [16:14:18] (03PS8) 10Herron: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [16:15:02] (03CR) 10Ottomata: "Just curious, what is this for? 'eventschemas cluster' is nothing but a static nginx http fileserver." [cookbooks] - 10https://gerrit.wikimedia.org/r/836181 (owner: 10Muehlenhoff) [16:15:49] (03CR) 10CI reject: [V: 04-1] victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [16:17:01] (03PS9) 10Herron: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [16:20:19] (03CR) 10Herron: victorps.py: add print_weekly_schedule command (033 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [16:21:30] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye completed: - dns4003 (**FAIL**) - Removed... [16:21:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4003.wikimedia.org with OS bullseye executed with errors: - dns4003 (**FAIL**)... [16:23:13] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.4 refs T314193 (duration: 28m 55s) [16:23:58] (03PS1) 10SBassett: Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) [16:24:12] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [16:24:45] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) So the failure is just for the script results, and its refusing proxy connection to that url, which has since started to work. All items were processed, dns4003 is rea... [16:25:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [16:25:18] !log brennen@deploy1002 Pruned MediaWiki: 1.40.0-wmf.2 (duration: 02m 02s) [16:25:52] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) Sorry, you also asked about servers; for probably-non-invasive tests (reboots &c), y... [16:26:54] (03CR) 10Ayounsi: [V: 03+1] "tested locally" [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi) [16:30:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:33:48] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:35:03] (03PS11) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [16:36:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:58] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37433/console" [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [16:37:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:37:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:39:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be20[28-39].codfw.wmnet - https://phabricator.wikimedia.org/T318689 (10Papaul) [16:39:48] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be20[28-39].codfw.wmnet - https://phabricator.wikimedia.org/T318689 (10Papaul) 05Open→03Resolved complete [16:43:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:45:55] (03PS11) 10Dzahn: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [16:48:17] (03CR) 10Hnowlan: [V: 03+1] "pcc package list is a little busy to read so this change removes no packages but adds the following as part of our standard webserver conf" [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [16:49:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:55:05] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:56:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:56:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:57:50] (03PS1) 10D3r1ck01: ParsoidHandler: use metrics from SiteConfig [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838104 [17:01:35] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "lgtm and compiler output as well: https://puppet-compiler.wmflabs.org/pcc-worker1002/37435/" [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [17:02:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:03:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "the templates under /var/lib/gerrit2/review_site/etc/its/templates/ became root-owned but can still be ready by any user and that was the " [puppet] - 10https://gerrit.wikimedia.org/r/832345 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [17:04:49] !log gerrit - deployed 832345 - scap and daemon users became decoupled (T317412) [17:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:53] T317412: Automate Gerrit deployment steps - https://phabricator.wikimedia.org/T317412 [17:05:16] (03PS7) 10Dzahn: gerrit: change deployment user on devtools [puppet] - 10https://gerrit.wikimedia.org/r/832507 (owner: 10Hashar) [17:06:32] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: upgrade cp hosts in eqiad to ATS9 [puppet] - 10https://gerrit.wikimedia.org/r/838155 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:07:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) Hey, @Ottomata You are listed as one of the approvers for members joining this list. Kindly approve. [17:10:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "for some reason the devtools instance is unknown in compiler again so I will just show it's noop in prod and go with it: https://puppet-co" [puppet] - 10https://gerrit.wikimedia.org/r/832507 (owner: 10Hashar) [17:13:10] (03CR) 10Dzahn: gerrit: make homedir variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [17:13:52] (03PS1) 10Urbanecm: Mentee table: fix wrong less import [extensions/GrowthExperiments] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838105 (https://phabricator.wikimedia.org/T319321) [17:13:56] (03CR) 10Dzahn: gerrit: make homedir variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [17:14:10] (03PS1) 10Urbanecm: Mentee table: fix wrong less import [extensions/GrowthExperiments] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/838206 (https://phabricator.wikimedia.org/T319321) [17:16:07] (03PS5) 10Dzahn: gerrit: make homedir variable [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [17:18:29] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:37] (03Abandoned) 10Urbanecm: Mentee table: fix wrong less import [extensions/GrowthExperiments] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/838206 (https://phabricator.wikimedia.org/T319321) (owner: 10Urbanecm) [17:23:08] (03PS1) 10Ssingh: dns4003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/838203 (https://phabricator.wikimedia.org/T317247) [17:24:20] (03CR) 10Brennen Bearnes: [C: 03+1] Add registry.gitlab.com/security-products/**/* as allowed images [puppet] - 10https://gerrit.wikimedia.org/r/838194 (https://phabricator.wikimedia.org/T312961) (owner: 10SBassett) [17:25:11] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:40] rzl: This is the force_php72=1 that is failing because gets 7.4 instead [17:27:43] ^^^ [17:27:44] ^ ah, that's the `?force_ph72=1` test getting `X-Powered-By: PHP/7.4.30`now [17:27:46] ahahaha [17:27:49] rotfl [17:27:54] yeah I'll send a patch [17:27:58] thx <3 [17:28:48] !log removing 4 files for legal compliance [17:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:59] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:29:47] i added a new train blocker (T319321), which has a patch now. i'll backport the fix unless anyone has any objections. [17:29:48] T319321: Mentee table does not load - https://phabricator.wikimedia.org/T319321 [17:30:02] (fyi ^demon as the train conductor) [17:30:27] <^demon> Ack [17:31:18] volans: huh, it's still there in the apache config so I'm not sure what the state of the transition is, but I'll mail this patch anyway and maybe someone will correct me :P [17:31:43] no objections => hitting the deployment buttons. [17:31:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838105 (https://phabricator.wikimedia.org/T319321) (owner: 10Urbanecm) [17:31:55] rzl: lol, it's there but doesn't work? [17:32:38] yeah the most recent change to enable 7.4 everywhere must have been at a different config layer(?) and we haven't ripped the rest out yet [17:32:49] I haven't been following super closely [17:34:18] rzl: AFAIK that's done at the MW config layer (see wgWMENewPHPVersion and wgWMENewPHPSamplingRate in operations/mediawiki-config:wmf-config/InitialiseSettings.php [17:34:19] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37439/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [17:35:07] MW converts that into the PHP_ENGINE cookie [17:37:18] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/838187 (owner: 10Ayounsi) [17:38:49] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/838188 (https://phabricator.wikimedia.org/T311008) (owner: 10Ayounsi) [17:39:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:40:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:42:01] urbanecm: ah of course, thanks [17:42:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:42:41] (03CR) 10BCornwall: [C: 03+2] dns4003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/838203 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [17:42:54] (03CR) 10BCornwall: [C: 03+1] dns4003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/838203 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [17:43:33] no problem [17:43:33] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:45:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10HShaikh) [17:45:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:46:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10HShaikh) [17:46:35] PROBLEM - Confd vcl based reload on cp1086 is CRITICAL: reload-vcl failed to run since 0h, 4 minutes. https://wikitech.wikimedia.org/wiki/Varnish [17:47:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10HShaikh) ssh key for Haroon Shaikh: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDAV4F2rDDJC9NsjkZ7Vz9zoe6d+wd0/IUVhKFhlbbja07vuuufg5qa5Im+gHpDy9exPWpeuwg5fWDY9CvHvMBAn... [17:49:57] (03Merged) 10jenkins-bot: Mentee table: fix wrong less import [extensions/GrowthExperiments] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838105 (https://phabricator.wikimedia.org/T319321) (owner: 10Urbanecm) [17:50:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10HShaikh) I approve this request for Prabhat @prabhat [17:50:20] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:838105|Mentee table: fix wrong less import (T319321)]] [17:50:24] T319321: Mentee table does not load - https://phabricator.wikimedia.org/T319321 [17:50:50] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:838105|Mentee table: fix wrong less import (T319321)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [17:51:46] 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10ssingh) [17:51:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) [17:52:28] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) 05Open→03Resolved ` sukhe@cumin2002:~$ sudo cumin 'A:cp' '/usr/bin/traffic_server --version' 92 hosts will be targeted: cp[2027-2042].codfw.wmnet,cp[6001... [17:54:15] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiler shows File[/etc/default/gerrit] is being modified,, verifying on gerrit2002 first" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [17:55:13] !log installing libsndfile security updates [17:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:58] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "ok, it's just the "no newline at end of file" that changed it" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [17:56:13] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:57:19] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:838105|Mentee table: fix wrong less import (T319321)]] (duration: 06m 58s) [17:57:23] T319321: Mentee table does not load - https://phabricator.wikimedia.org/T319321 [17:58:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:58:51] (03PS1) 10Muehlenhoff: Add dedicated Phab Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/838237 [18:00:04] ^demon and brennen: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T1800). [18:00:19] (03CR) 10Dzahn: gerrit: use daemon_user variable everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [18:01:48] (03PS5) 10Dzahn: gerrit: use daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [18:02:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:02:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:02:28] (03CR) 10Dzahn: [C: 03+1] "agreed! by service name makes more sense" [puppet] - 10https://gerrit.wikimedia.org/r/838237 (owner: 10Muehlenhoff) [18:03:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:04:05] (03CR) 10Dzahn: [C: 03+1] "gerrit/gitlab/contint etc already have their own aliases. combining them makes little sense" [puppet] - 10https://gerrit.wikimedia.org/r/838237 (owner: 10Muehlenhoff) [18:04:39] looks like both outstanding train blockers have been fixed? [18:06:07] urbanecm: ^ [18:06:25] T319321 should be fixed now! [18:06:25] T319321: Mentee table does not load - https://phabricator.wikimedia.org/T319321 [18:07:56] (03PS1) 10Ssingh: sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) [18:08:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10prabhat) public key for Prabhat: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGGlHTnlWPgIVAiGrqPSJTUN6+5WoaWjNta+RoZ42Qmv prabhat@wmf2921 [18:09:14] (03CR) 10Muehlenhoff: [C: 03+2] Add dedicated Phab Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/838237 (owner: 10Muehlenhoff) [18:09:16] (03CR) 10Dzahn: [V: 03+1] "compiled and looks alright https://puppet-compiler.wmflabs.org/pcc-worker1003/37440/" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [18:09:30] cool, thx. [18:09:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Ottomata) Approved for both! [18:09:32] ^demon: about? [18:12:28] <^demon> I am, yeah [18:12:30] (03CR) 10Ssingh: [C: 03+2] dns4003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/838203 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [18:12:56] cool cool. train looks good to go. [18:14:49] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838240 (https://phabricator.wikimedia.org/T314193) [18:14:51] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838240 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [18:15:38] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838240 (https://phabricator.wikimedia.org/T314193) (owner: 10TrainBranchBot) [18:19:43] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.4 refs T314193 [18:19:48] T314193: 1.40.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T314193 [18:21:00] !log installing gdk-pixbuf security updates [18:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:24:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:24:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:24:15] !log removing 1 file for legal compliance [18:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:28:33] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:28:50] (03PS1) 10Ssingh: hiera: rename dns4001.yaml (decomm) to dns4003.yaml (active) [puppet] - 10https://gerrit.wikimedia.org/r/838242 (https://phabricator.wikimedia.org/T317247) [18:29:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4003.wikimedia.org with OS buster [18:30:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster [18:30:17] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:33:51] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gerrit: use daemon_user variable everywhere [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [18:34:15] !log removing 1 file for legal compliance [18:34:16] !log gerrit - deploying puppet refactoring change [18:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:09] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:37:05] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Bug: T317412" [puppet] - 10https://gerrit.wikimedia.org/r/832507 (owner: 10Hashar) [18:37:20] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Bug: T317412" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [18:37:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Bug: T317412" [puppet] - 10https://gerrit.wikimedia.org/r/833379 (owner: 10Hashar) [18:38:58] (03CR) 10Ssingh: [C: 03+2] hiera: rename dns4001.yaml (decomm) to dns4003.yaml (active) [puppet] - 10https://gerrit.wikimedia.org/r/838242 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [18:40:15] (03CR) 10CI reject: [V: 04-1] ats: Alert on high connection/request count [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:47:20] 10SRE, 10Performance-Team (Radar): Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10Krinkle) Moving to Radar. This remains of interest for awareness, but we're not actively pushing for it as the vast majority of loaded images are Commons-hosted. [18:48:06] (03CR) 10BCornwall: "It looks like testing is looking for specific labels/annotations for every list entry under groups['rules'] rather than specifically alert" [alerts] - 10https://gerrit.wikimedia.org/r/830950 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:48:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [18:49:41] (03CR) 10CDanis: C:varnish: Rate limit hotlinking dry-run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (https://phabricator.wikimedia.org/T317799) (owner: 10Jbond) [18:51:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4003.wikimedia.org with reason: host reimage [18:59:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed carefully, complete noop confirmed on both servers" [puppet] - 10https://gerrit.wikimedia.org/r/833385 (owner: 10Hashar) [19:01:36] 10SRE, 10Editing-team, 10Fundraising-Backlog, 10Platform Engineering, and 5 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [19:02:36] 10SRE, 10Traffic-Icebox: HTTP 502 Error when trying to create new page (500k characters) on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10BCornwall) 05Open→03Stalled Hi, @Bogdan! Thanks for the report. There have been numerous changes in the stack since you've reported this; Would you... [19:03:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:09:00] PROBLEM - Recursive DNS on 198.35.26.7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:10:26] ^ yeah this should be fixed shortly [19:10:34] hopefully™ [19:11:16] (03PS1) 10Bking: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) [19:12:03] (03CR) 10CI reject: [V: 04-1] elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [19:13:17] thanks! [19:14:50] (03PS2) 10Ryan Kemper: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [19:15:38] (03CR) 10CI reject: [V: 04-1] elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [19:15:58] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 (10CDanis) Here's my jupyter notebook with a rough analysis of a very impactful hotlink incident (on 2022-09-13) and our biggest organic traffic surge to date (Queen... [19:17:02] a/win 14 [19:17:06] (03PS3) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:18:43] (03PS3) 10Bking: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) [19:19:31] (03CR) 10CI reject: [V: 04-1] elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [19:21:34] (03PS3) 10Raymond Ndibe: prometheus: Add new scrape target [puppet] - 10https://gerrit.wikimedia.org/r/836310 [19:22:12] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:7 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [19:23:01] ^ known [19:24:38] (03PS4) 10Bking: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) [19:24:40] (03CR) 10Raymond Ndibe: prometheus: Add new scrape target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/836310 (owner: 10Raymond Ndibe) [19:24:50] (03PS5) 10Bking: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) [19:25:08] (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [19:28:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [19:33:41] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37442/gitlab-runner1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:34:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:35:47] 10SRE-swift-storage, 10Arc-Lamp, 10Performance-Team, 10Patch-For-Review: Swift container for performance flame graphs (ArcLamp) - https://phabricator.wikimedia.org/T244776 (10Krinkle) 05Open→03Resolved a:03dpifke Mostly done. As part of the parent task, some of this will end up removed as plans have... [19:39:20] (03CR) 10Dzahn: P:gitlab::runner: Provide proxy variables to runner jobs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/833125 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [19:41:22] (03PS1) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 [19:42:52] (03CR) 10Jbond: reqconfig: add ip validation for ipblocks (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [19:44:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:50:52] (03CR) 10Dzahn: "Is this intended to be merged at any time or did you want some level of sync for when it happens?" [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [19:53:26] (03PS1) 10Ryan Kemper: [wip] logstash: remove old files [puppet] - 10https://gerrit.wikimedia.org/r/838255 [19:54:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10lanebecker) Approved for @HShaikh [19:54:50] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4003.wikimedia.org with OS buster [19:54:56] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4003.wikimedia.org with OS buster executed with errors:... [19:56:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper) [19:57:11] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:59:11] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.133 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221004T2000). [20:00:05] xSavitar, Aishik, and MdsShakil: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:45] (03PS1) 10Ssingh: P:dns::auth:dotls: ensure ordering for sysd_glue [puppet] - 10https://gerrit.wikimedia.org/r/838257 [20:01:28] (03PS2) 10Ssingh: P:dns::auth:dotls: ensure ordering for sysd_glue [puppet] - 10https://gerrit.wikimedia.org/r/838257 [20:01:30] o/ [20:01:42] I can deploy o/ [20:01:58] xSavitar: starting with yours [20:02:34] (03PS7) 10Clare Ming: Add wordmark and tagline for Bengali Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838207 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:02:49] (03CR) 10Ebernhardson: "these should be removed from all jvm.options files" [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper) [20:03:21] (03CR) 10CI reject: [V: 04-1] Add wordmark and tagline for Bengali Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838207 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:03:25] (03CR) 10BCornwall: [C: 03+1] P:dns::auth:dotls: ensure ordering for sysd_glue [puppet] - 10https://gerrit.wikimedia.org/r/838257 (owner: 10Ssingh) [20:03:57] cjming, okay, thanks [20:04:14] (03CR) 10Ssingh: [C: 03+2] P:dns::auth:dotls: ensure ordering for sysd_glue [puppet] - 10https://gerrit.wikimedia.org/r/838257 (owner: 10Ssingh) [20:04:34] xSavitar: to make CI happy, are you able to push up quick fix or would you like me to handle it? [20:05:55] cjming, this is the patch right: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838104? Or did I schedule the wrong patch? [20:06:37] Is CI unhappy with the patch? Or are my missing something? [20:06:44] cc @duesen ^^ [20:06:56] oh - the patch on the deployment calendar is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/838207 [20:07:04] is this not the correct patch? ^^ [20:07:46] No, let me update it quickly, sorry :( [20:07:56] xSavitar: no worries! would you like to self-deploy? [20:08:14] this is the revision I made: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2016241&oldid=2016140 [20:08:24] Something happened and the ID changed :( [20:08:26] Let me update [20:09:06] A user changed it here: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=next&oldid=2016271 :( [20:09:30] cjming I reverted the edit, should be good now. [20:09:43] sorry about the inconvenience [20:10:09] np! [20:10:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838104 (owner: 10D3r1ck01) [20:10:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) [20:11:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) [20:14:58] 10SRE, 10Data Engineering Planning: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10CDanis) [20:16:29] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:16:34] cjming I missed your question about the "self-deploy part". No you can go ahead please. Brain is too foggy to deploy now :D [20:17:27] xSavitar: no worries - just waiting for merge - few more mins [20:17:51] \o/ [20:19:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) [20:25:05] (03Merged) 10jenkins-bot: ParsoidHandler: use metrics from SiteConfig [core] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/838104 (owner: 10D3r1ck01) [20:25:26] !log cjming@deploy1002 Started scap: Backport for [[gerrit:838104|ParsoidHandler: use metrics from SiteConfig]] [20:25:52] !log cjming@deploy1002 cjming and d3r1ck01: Backport for [[gerrit:838104|ParsoidHandler: use metrics from SiteConfig]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:25:55] xSavitar: should be up on any of the test servers if you want to verify [20:27:04] xSavitar, I am here as well. [20:27:18] thanks subbu [20:29:03] Aishik: if you are around, your patch is up next and it needs a quick fix - are you able to do that or would you like me to take care of it? [20:29:27] xSavitar: shall i sync? [20:29:40] do it for me...... [20:30:05] cjming, subbu is testing the patch on officewiki, but you can continue with the next patch on the line to deploy [20:30:33] (03PS8) 10Clare Ming: Add wordmark and tagline for Bengali Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838207 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:31:28] xSavitar: sounds good - standing by [20:34:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:35:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:35:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:35:45] cjming, you can go ahead and sync [20:35:51] great - going live [20:36:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:36:27] thanks! [20:39:56] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:838104|ParsoidHandler: use metrics from SiteConfig]] (duration: 14m 29s) [20:40:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838207 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:40:24] xSavitar: np! should be live now [20:40:56] cjming thank you :) [20:41:11] (03Abandoned) 10Jdlrobson: Fix page toolbar border [skins/Vector] (wmf/1.40.0-wmf.3) - 10https://gerrit.wikimedia.org/r/836993 (https://phabricator.wikimedia.org/T318952) (owner: 10Jdlrobson) [20:41:16] (03Merged) 10jenkins-bot: Add wordmark and tagline for Bengali Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838207 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:41:41] !log cjming@deploy1002 Started scap: Backport for [[gerrit:838207|Add wordmark and tagline for Bengali Wikibooks (T319320)]] [20:41:46] T319320: Add wordmark and tagline for Bengali Wikibooks - https://phabricator.wikimedia.org/T319320 [20:41:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-private-users for Slavina Stefanova - https://phabricator.wikimedia.org/T318807 (10Arnoldokoth) [20:41:57] (03PS1) 10BBlack: p::dns::recursor: fix anycast->pdns_rec dep [puppet] - 10https://gerrit.wikimedia.org/r/838263 [20:42:04] !log cjming@deploy1002 cjming and aishik: Backport for [[gerrit:838207|Add wordmark and tagline for Bengali Wikibooks (T319320)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:42:07] Aishik: your patch is up on debug servers if you'd like to test [20:45:14] (03CR) 10BBlack: [C: 03+2] p::dns::recursor: fix anycast->pdns_rec dep [puppet] - 10https://gerrit.wikimedia.org/r/838263 (owner: 10BBlack) [20:45:21] Whoah! One thing has gone wrong! [20:45:46] Aishik: i'm guessing i should not sync [20:45:55] /Wikibooks [20:45:55]   'bnwikibooks' => [ // T319320 [20:45:56]   'src' => '/static/images/mobile/copyright/wikibooks-tagline-bn.svg', [20:45:56]   'width' => 110, [20:45:57]   'height' => 15, [20:45:57]  ], [20:46:14] here height should be 25 [20:46:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:03] (03PS1) 10BBlack: Revert "p::dns::recursor: fix anycast->pdns_rec dep" [puppet] - 10https://gerrit.wikimedia.org/r/838208 [20:47:08] Aishik: i'm going to revert [20:47:16] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "p::dns::recursor: fix anycast->pdns_rec dep" [puppet] - 10https://gerrit.wikimedia.org/r/838208 (owner: 10BBlack) [20:47:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:47:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:49:01] !log cjming@deploy1002 Sync cancelled. [20:49:37] (03PS1) 10Aishik Rehman: Revert "Add wordmark and tagline for Bengali Wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838209 [20:50:00] (03PS1) 10TrainBranchBot: Revert "Add wordmark and tagline for Bengali Wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838264 [20:50:02] (03CR) 10TrainBranchBot: "cjming@deploy1002 created a revert of this change as Iff7a9f388619c6dd454a562edb9e34b054ea09ab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838207 (https://phabricator.wikimedia.org/T319320) (owner: 10Aishik Rehman) [20:50:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838264 (owner: 10TrainBranchBot) [20:51:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:52:24] (03PS2) 10Aishik Rehman: Revert "Add wordmark and tagline for Bengali Wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838209 [20:52:29] (03Merged) 10jenkins-bot: Revert "Add wordmark and tagline for Bengali Wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838264 (owner: 10TrainBranchBot) [20:52:55] !log cjming@deploy1002 Started scap: Backport for [[gerrit:838264|Revert "Add wordmark and tagline for Bengali Wikibooks"]] [20:53:18] !log cjming@deploy1002 cjming and trainbranchbot: Backport for [[gerrit:838264|Revert "Add wordmark and tagline for Bengali Wikibooks"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:53:49] Aishik: can you check a debug server now for the revert? [20:53:54] (03PS3) 10Aishik Rehman: Revert "Add wordmark and tagline for Bengali Wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838209 [20:55:02] (03CR) 10Clare Ming: "hi Aishik: i reverted the original change via scap backport so this can be abandoned" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838209 (owner: 10Aishik Rehman) [20:55:16] cjming: I have checked it, it looks fine. [20:56:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:56:07] Aishik: if you want to revert the revert with the proper height, feel free and we can try again after i deploy the next patch [20:56:22] thanks MdsShakil - shall we move onto your patch now? [20:56:30] cjming: please [20:56:49] (03PS4) 10Clare Ming: Enable wgMinervaEnableSiteNotice for bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838101 (https://phabricator.wikimedia.org/T319317) (owner: 10MdsShakil) [20:57:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:57:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:58:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:58:47] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:59:30] (03PS4) 10Aishik Rehman: Revert "Add wordmark and tagline for Bengali Wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838209 [20:59:30] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:838264|Revert "Add wordmark and tagline for Bengali Wikibooks"]] (duration: 06m 35s) [20:59:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838101 (https://phabricator.wikimedia.org/T319317) (owner: 10MdsShakil) [21:00:35] (03CR) 10Andrew Bogott: [C: 03+2] Dumps: remove ensure->absent clause [puppet] - 10https://gerrit.wikimedia.org/r/837677 (owner: 10Andrew Bogott) [21:00:54] (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838101 (https://phabricator.wikimedia.org/T319317) (owner: 10MdsShakil) [21:01:19] !log cjming@deploy1002 Started scap: Backport for [[gerrit:838101|Enable wgMinervaEnableSiteNotice for bnwikibooks (T319317)]] [21:01:23] T319317: Enable wgMinervaEnableSiteNotice for bnwikibooks - https://phabricator.wikimedia.org/T319317 [21:01:42] !log cjming@deploy1002 cjming and mdsshakil: Backport for [[gerrit:838101|Enable wgMinervaEnableSiteNotice for bnwikibooks (T319317)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:01:44] MdsShakil: can you test on a debug server? [21:01:59] (03PS1) 10Andrew Bogott: wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) [21:02:20] cjming: looks fine [21:02:32] (03CR) 10CI reject: [V: 04-1] wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott) [21:03:02] MdsShakil: great - syncing [21:03:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:03:21] Not a p.age, but I have opened an incident status doc: https://docs.google.com/document/d/1KUfVX9-tymmhbWGj0f7_rUTX3ym6RJ7FQ61SOezQcNk/edit and https://phabricator.wikimedia.org/T319346 [21:03:59] All users are currently unable to spawn new jupyterhub servers, which is adversely affecting the product analytics team among others. [21:04:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:04:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:05:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:05:17] (03PS1) 10Clare Ming: Revert "Revert "Add wordmark and tagline for Bengali Wikibooks"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838210 [21:05:30] (03PS2) 10Clare Ming: Revert "Revert "Add wordmark and tagline for Bengali Wikibooks"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838210 [21:05:36] (03PS2) 10Andrew Bogott: wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) [21:06:59] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:838101|Enable wgMinervaEnableSiteNotice for bnwikibooks (T319317)]] (duration: 05m 40s) [21:07:04] T319317: Enable wgMinervaEnableSiteNotice for bnwikibooks - https://phabricator.wikimedia.org/T319317 [21:07:20] MdsShakil: your patch should be live [21:07:32] cjming: Thank you [21:07:49] Aishik: sorry for confusion -- let's go with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/838210 because i can't seem to rebase your follow up patch [21:08:05] (03PS1) 10BBlack: P::dns::recursor: fix anycast svc dep issue [puppet] - 10https://gerrit.wikimedia.org/r/838266 [21:08:22] Aishik: can you add the proper height in a follow up commit to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/838210 and i will deploy that? [21:08:39] wait a minute...... [21:10:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:10:47] (03CR) 10BBlack: [C: 03+2] P::dns::recursor: fix anycast svc dep issue [puppet] - 10https://gerrit.wikimedia.org/r/838266 (owner: 10BBlack) [21:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:11:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:11:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:43] (03PS1) 10BBlack: Revert "P::dns::recursor: fix anycast svc dep issue" [puppet] - 10https://gerrit.wikimedia.org/r/838211 [21:12:53] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "P::dns::recursor: fix anycast svc dep issue" [puppet] - 10https://gerrit.wikimedia.org/r/838211 (owner: 10BBlack) [21:13:11] Aishik: are you still with me? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/838210 just needs the proper height for the tagline -- can you push up a follow up commit? [21:13:33] Sorry! the patch doesn't loading [21:13:39] or I can - is it 25 for the tagline svg? [21:14:05] sorry! again! my bad! the wordmark height should be 25 [21:14:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:04] I am totally confused today! just sure the height of wordmark is 25 [21:16:00] ok - i'm pushing up the commit - please +1 if it lgtu [21:16:20] (03PS3) 10Clare Ming: Revert "Revert "Add wordmark and tagline for Bengali Wikibooks"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838210 [21:17:09] Aishik: oh - are you not able to see the patch? [21:17:41] i updated the height of the wordmark to 25 [21:17:55] Thanks. [21:17:58] shall we try deploying again? [21:18:08] But my browser says Error code: Out of Memory [21:18:42] Sure, deploy it..... [21:18:57] huh - will you be able to test this? [21:19:15] Yeap! [21:19:25] ok - trying again - 1 sec [21:19:29] (03PS11) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [21:19:35] WikimediaDebug is enabled [21:19:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838210 (owner: 10Clare Ming) [21:20:26] (03Merged) 10jenkins-bot: Revert "Revert "Add wordmark and tagline for Bengali Wikibooks"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838210 (owner: 10Clare Ming) [21:20:49] !log cjming@deploy1002 Started scap: Backport for [[gerrit:838210|Revert "Revert "Add wordmark and tagline for Bengali Wikibooks""]] [21:21:13] !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:838210|Revert "Revert "Add wordmark and tagline for Bengali Wikibooks""]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:21:16] Aishik: can you verify new wordmark/tagline for bnwikibooks on a debug server? [21:21:50] Its ok now! [21:21:59] yay! great - then syncing [21:22:38] Sorry for the trouble I caused [21:22:42] 😣 [21:23:06] no worries! glad it got sorted out [21:24:44] (03PS1) 10Ebernhardson: cirrus: remove cross-dc poolcounter increases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838269 [21:24:46] (03PS1) 10Ebernhardson: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) [21:24:48] (03PS1) 10Ebernhardson: [WIP] Use discovery dns for cirrus read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) [21:24:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:25:28] (03CR) 10Jbond: "ready for review with open question in comments" [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [21:25:47] (03CR) 10CI reject: [V: 04-1] cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:25:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:25:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:25:56] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:838210|Revert "Revert "Add wordmark and tagline for Bengali Wikibooks""]] (duration: 05m 06s) [21:25:56] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10BBlack) {F35546970} [21:26:12] (03CR) 10CI reject: [V: 04-1] [WIP] Use discovery dns for cirrus read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:26:41] Aishik: your change should be live now [21:26:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:27:48] Its perfectly working...... Thanks again ............. (: [21:28:04] np! [21:28:07] !log end of UTC late backport window [21:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:26] (03PS1) 10Ebernhardson: beta: Set shard count for commonswiki_file to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838272 (https://phabricator.wikimedia.org/T316711) [21:36:45] (03PS2) 10Ebernhardson: [WIP] Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) [21:37:48] (03CR) 10CI reject: [V: 04-1] [WIP] Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [21:51:52] (03PS12) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [22:07:59] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:13:23] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1034, phab1004, releases1002, releases2002, stat1004, stat1005, stat1007, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_ru [22:13:23] s [22:18:11] (03PS3) 10Andrew Bogott: wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) [22:19:20] (03PS4) 10Andrew Bogott: wmcs haproxy: prepare for IP and user-agent blocking [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) [22:23:11] (03CR) 10Andrew Bogott: "pcc output: https://puppet-compiler.wmflabs.org/pcc-worker1001/37444/cloudcontrol1005.wikimedia.org/index.html" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [22:23:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:24:29] (03CR) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [22:25:10] (03CR) 10Andrew Bogott: "pcc output: https://puppet-compiler.wmflabs.org/pcc-worker1001/37444/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/838265 (https://phabricator.wikimedia.org/T319313) (owner: 10Andrew Bogott) [22:30:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-ro.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-swift-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:13] PROBLEM - Confd template for /var/lib/gdnsd/discovery-inference.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-inference.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:13] PROBLEM - Confd template for /var/lib/gdnsd/discovery-sessionstore.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-sessionstore.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:13] PROBLEM - Confd template for /var/lib/gdnsd/discovery-netbox.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-netbox.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:13] PROBLEM - Confd template for /var/lib/gdnsd/discovery-apt.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-apt.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:13] PROBLEM - Confd template for /var/lib/gdnsd/discovery-proton.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-proton.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:14] PROBLEM - Confd template for /var/lib/gdnsd/discovery-apple-search.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-apple-search.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:14] PROBLEM - Confd template for /var/lib/gdnsd/discovery-shellbox-syntaxhighlight.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-shellbox-syntaxhighlight.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:15] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventstreams-internal.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-eventstreams-internal.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:15] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wcqs.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-wcqs.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:16] PROBLEM - Confd template for /var/lib/gdnsd/discovery-thanos-swift.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-thanos-swift.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:19] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns4003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [22:30:21] PROBLEM - Confd template for /var/lib/gdnsd/discovery-mwdebug.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-mwdebug.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:21] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-toolhub.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:21] PROBLEM - Check systemd state on dns4003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens3f0np0.service,prometheus-nic-firmware-textfile.service,prometheus_gdnsd_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wikifeeds.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-wikifeeds.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventstreams.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-eventstreams.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-appservers-rw.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-appservers-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-shellbox-media.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-shellbox-media.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-puppetdb-api.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-puppetdb-api.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:33] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-staging.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-staging.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:33] PROBLEM - Confd template for /var/lib/gdnsd/discovery-similar-users.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-similar-users.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift-rw.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-swift-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-releases.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-releases.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-echostore.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-echostore.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-kartotherian.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-kartotherian.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:49] PROBLEM - Auth DNS on dns4003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [22:30:51] PROBLEM - Confd template for /var/lib/gdnsd/discovery-citoid.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-citoid.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:51] PROBLEM - Confd template for /var/lib/gdnsd/discovery-push-notifications.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-push-notifications.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:51] PROBLEM - Confd template for /var/lib/gdnsd/discovery-zotero.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-zotero.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-shellbox.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-shellbox.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:59] PROBLEM - AuthDNS-over-TLS Works on dns4003 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [22:30:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-restbase.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-restbase.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-mathoid.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-mathoid.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-api-gateway.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-api-gateway.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:30:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-tegola-vector-tiles.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-tegola-vector-tiles.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:05] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventgate-analytics.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-eventgate-analytics.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:07] PROBLEM - Confd template for /var/lib/gdnsd/discovery-cxserver.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-cxserver.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:07] PROBLEM - Confd template for /var/lib/gdnsd/discovery-recommendation-api.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-recommendation-api.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:21] PROBLEM - Confd template for /var/lib/gdnsd/discovery-docker-registry.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-docker-registry.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:23] PROBLEM - Confd template for /var/lib/gdnsd/discovery-search.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-search.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:23] PROBLEM - gdnsd daemon runs exactly once on dns4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 497 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [22:31:25] PROBLEM - Confd template for /var/lib/gdnsd/discovery-api-rw.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-api-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:29] PROBLEM - Confd template for /var/lib/gdnsd/discovery-videoscaler.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-videoscaler.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:29] PROBLEM - Confd template for /var/lib/gdnsd/discovery-ores.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-ores.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-shellbox-timeline.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-shellbox-timeline.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventgate-analytics-external.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-eventgate-analytics-external.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-jobrunner.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-jobrunner.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-restbase-async.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-restbase-async.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:39] PROBLEM - Confd template for /var/lib/gdnsd/discovery-blubberoid.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-blubberoid.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:40] PROBLEM - Confd template for /var/lib/gdnsd/discovery-linkrecommendation.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-linkrecommendation.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:40] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs-internal.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-wdqs-internal.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:41] PROBLEM - Confd template for /var/lib/gdnsd/discovery-swift.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-swift.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:41] PROBLEM - Confd template for /var/lib/gdnsd/discovery-puppetboard.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-puppetboard.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:42] PROBLEM - Confd template for /var/lib/gdnsd/discovery-apertium.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-apertium.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-thanos-query.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-thanos-query.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:31:45] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-appservers-ro.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-appservers-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-parsoid-php.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-parsoid-php.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-termbox.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-termbox.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-schema.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-schema.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-mobileapps.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-mobileapps.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:10] PROBLEM - Confd template for /var/lib/gdnsd/discovery-shellbox-constraints.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-shellbox-constraints.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:10] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventgate-main.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-eventgate-main.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:11] PROBLEM - gdnsd checkconf on dns4003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [22:32:11] PROBLEM - Confd template for /var/lib/gdnsd/discovery-eventgate-logging-external.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-eventgate-logging-external.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:12] PROBLEM - Confd template for /var/lib/gdnsd/discovery-api-ro.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-api-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:12] PROBLEM - Confd template for /var/lib/gdnsd/discovery-wdqs.state on dns4003 is CRITICAL: File not found: /var/lib/gdnsd/discovery-wdqs.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:32:42] (03PS2) 10Ebernhardson: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) [22:32:44] (03PS3) 10Ebernhardson: [WIP] Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) [22:32:46] (03PS1) 10Ebernhardson: cirrus: Drop client side connect timeout config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838276 (https://phabricator.wikimedia.org/T143553) [22:34:10] sukhe: ^ if you happen to still be around, is that expected? [22:35:18] (not sure if an expiring downtime or what) [22:50:33] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (22) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephosd1034, cloudnet1006, phab1004, releases1002, releases2002, stat1004, stat1005, stat1007, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23c [22:50:33] pet_run_changes [22:50:54] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Jclark-ctr) a:03Jclark-ctr @Andrew updated firmware to 21.85 [22:53:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1023.eqiad.wmnet with OS bullseye [23:01:17] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:05:04] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Andrew) Same behavior as before: ` PXE-E51: No DHCP or proxyDHCP offers were received. ` [23:09:00] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:09:53] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1023.eqiad.wmnet with OS bullseye [23:11:35] rzl: yeah dns4003 is in a broken state, it's a new host, sorry [23:12:14] I'm guessing downtime expired from the failed imaging [23:12:18] will re-downtime [23:13:59] cool, thanks 👍 I was gonna guess the same but glad to have confirmation [23:14:29] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:20:17] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:33:06] (03CR) 10BCornwall: [C: 03+1] sites.yaml: add dns4003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/838239 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [23:34:03] (03PS1) 10Jdlrobson: Enable Special:Contribute on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838281 (https://phabricator.wikimedia.org/T319240) [23:40:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [23:44:01] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:45:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold