[00:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:19:25] (03PS1) 10Zabe: beta: start reading from rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881729 (https://phabricator.wikimedia.org/T299954) [01:20:17] (03CR) 10Zabe: [C: 03+2] beta: start reading from rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881729 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [01:20:59] (03Merged) 10jenkins-bot: beta: start reading from rev_comment_id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881729 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [01:33:05] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10wiki_willy) a:03Jclark-ctr [01:49:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:47] (JobUnavailable) firing: (11) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:10] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:11] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:59] (03PS5) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [03:46:59] (03CR) 10CI reject: [V: 04-1] flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [03:48:45] (03PS6) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [04:15:44] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (arclamp2001), Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:16:22] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:21:36] (03PS2) 10KartikMistry: Update cxserver to 2023-01-20-051603-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/881051 (https://phabricator.wikimedia.org/T323840) [06:07:46] (JobUnavailable) firing: (11) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:16] (03CR) 10Hashar: gerrit: split user and application directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230120T0700) [07:12:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf and ops for Jennifer Ebe - https://phabricator.wikimedia.org/T327255 (10JEbe-WMF) >>! In T327255#8542025, @Eevans wrote: >>>! In T327255#8539372, @JEbe-WMF wrote: >> >> [ ... ] >> >> I am not exactly certain. Because I am new, I am not sure what I need and d... [07:15:02] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:29:24] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:58:35] !log `apt-get clean` on doh4001 to free space (root partition almost filled) [07:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:45] cc: sukhe: --^ [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230120T0800) [08:10:33] !log restart kubelet on kubernetes2007 - node reported issues with it, marked as "notready" by the control plane [08:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881702 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:19:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (and yes, that means that noone will be able to login anymore)." [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881699 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:21:45] (03CR) 10Muehlenhoff: "Note the grants will need to be deployed by one of the DBAs" [puppet] - 10https://gerrit.wikimedia.org/r/881701 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [08:24:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:36:58] (03CR) 10Jelto: [C: 04-1] P:gitlab: manage gitlab with gitlab module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684487 (owner: 10Jbond) [08:59:06] !log installing ping2003 T273509 [08:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] T273509: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 [09:28:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:30:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.663 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:32:40] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [09:36:20] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:52] RECOVERY - Disk space on dumpsdata1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [09:51:39] Some GitLab alerts may pop up here for the replicas, they are expected [09:57:46] (JobUnavailable) firing: (11) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:02] (03PS1) 10Muehlenhoff: Fix MAC [puppet] - 10https://gerrit.wikimedia.org/r/881829 (https://phabricator.wikimedia.org/T273509) [10:00:20] !log jnuche@deploy1002 Installing scap version "4.33.1" for 1 hosts [10:00:30] !log jnuche@deploy1002 Installation of scap version "4.33.1" completed for 1 hosts [10:00:52] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:21] (03PS1) 10Elukey: ml-services: add it goodfaith to ml-staging to test kserve 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881830 (https://phabricator.wikimedia.org/T325528) [10:10:52] (03CR) 10Muehlenhoff: [C: 03+2] Fix MAC [puppet] - 10https://gerrit.wikimedia.org/r/881829 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [10:11:05] (03CR) 10Elukey: [C: 03+2] ml-services: add it goodfaith to ml-staging to test kserve 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881830 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [10:12:39] !log imported jenkins 2.375-2 to thirdparty/ci T326531 [10:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:42] T326531: Upgrade Jenkins to latest LTS 2.375.2 - https://phabricator.wikimedia.org/T326531 [10:13:20] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:13:20] !log installing emacs security updates on bullseye [10:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:08] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:22:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:27:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:32:46] !log restart kubelet on ml-staging200* nodes (some fs-inotify-related issues with the istio-proxy of newly created containers) [10:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:06] (03CR) 10Muehlenhoff: "Patch is fine, but one comment on the versioning" [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [10:37:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:38:54] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:49:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new ping host - jmm@cumin2002" [10:50:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new ping host - jmm@cumin2002" [10:52:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:54:32] (03PS1) 10Muehlenhoff: Move ping offload from ping2002 to ping2003 in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/881837 (https://phabricator.wikimedia.org/T273509) [10:58:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:02:34] (03PS1) 10Elukey: Revert "ml-services: add it goodfaith to ml-staging to test kserve 0.9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881851 [11:07:33] (03CR) 10Muehlenhoff: puppet: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868703 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:13:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:14:14] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:21] (03CR) 10MVernon: [C: 03+2] swift: make rclone less fussy [puppet] - 10https://gerrit.wikimedia.org/r/881662 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [11:18:26] (03CR) 10Elukey: [C: 03+2] Revert "ml-services: add it goodfaith to ml-staging to test kserve 0.9" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881851 (owner: 10Elukey) [11:18:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:20:04] 10SRE-swift-storage, 10Patch-For-Review: Rclone is fussy about missing objects - https://phabricator.wikimedia.org/T327269 (10MatthewVernon) 05Open→03Resolved Resolved by implementing option 1. We might want to revisit this once T327253 is done. [11:37:12] (03PS1) 10Effie Mouzeli: maps: do not run the pregeneration job temporarily on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/881842 [11:41:28] (03CR) 10Jelto: [C: 03+1] "lgtm and like the "best" workaround to keep the cronjob around for manual troubleshooting." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881842 (owner: 10Effie Mouzeli) [11:42:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Jclark-ctr) Confirmed: Service Request 160647566 was successfully submitted. [11:55:28] (03CR) 10Effie Mouzeli: [C: 03+2] maps: do not run the pregeneration job temporarily on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/881842 (owner: 10Effie Mouzeli) [12:00:55] (03Merged) 10jenkins-bot: maps: do not run the pregeneration job temporarily on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/881842 (owner: 10Effie Mouzeli) [12:02:58] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [12:03:12] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [12:04:20] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2040.codfw.wmnet with OS bullseye [12:05:42] (03CR) 10Ssingh: Release 0.44.0+ds1-2 (031 comment) [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [12:17:10] !log installing ping1003 T273509 [12:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:14] T273509: upgrade ping offload servers to bullseye (was: ping servers running out of disk) - https://phabricator.wikimedia.org/T273509 [12:19:30] (03PS4) 10Ssingh: Release 0.44.0+ds1-1~wmf1 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) [12:20:28] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2040.codfw.wmnet with reason: host reimage [12:23:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2040.codfw.wmnet with reason: host reimage [12:26:08] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: keepalived_state_change.py to avoid debug log flood [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) [12:26:26] (03CR) 10CI reject: [V: 04-1] openstack: neutron: keepalived_state_change.py to avoid debug log flood [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) (owner: 10Arturo Borrero Gonzalez) [12:29:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [12:31:09] (03CR) 10Ssingh: [C: 03+2] Release 0.44.0+ds1-1~wmf1 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/881689 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [12:36:12] (03PS1) 10Jcrespo: dbbackups: Setting up grants for new dbprov hosts [puppet] - 10https://gerrit.wikimedia.org/r/881868 (https://phabricator.wikimedia.org/T327155) [12:36:14] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: keepalived_state_change.py to avoid debug log flood [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) [12:38:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2040.codfw.wmnet with OS bullseye [12:39:35] (03PS2) 10Jcrespo: dbbackups: Setting up grants for new dbprov hosts [puppet] - 10https://gerrit.wikimedia.org/r/881868 (https://phabricator.wikimedia.org/T327155) [12:40:55] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: keepalived_state_change.py to avoid debug log flood [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) [12:42:28] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/881868 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [12:44:49] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected https://puppet-compiler.wmflabs.org/output/881866/39190/" [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) (owner: 10Arturo Borrero Gonzalez) [12:45:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new ping host - jmm@cumin2002" [12:45:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] flink - fix access to k8s api (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [12:47:35] (03PS1) 10Muehlenhoff: Move ping offload from ping1002 to ping1003 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/881869 (https://phabricator.wikimedia.org/T273509) [12:57:56] (03PS1) 10Ilias Sarantopoulos: ml-services: disable multi-porcessing on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) [12:58:13] (03PS2) 10Ilias Sarantopoulos: ml-services: disable multi-processing on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) [12:58:42] (03CR) 10Hnowlan: helmfile.d: add a new test workflow for Lifting to changeprop's staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [13:00:42] !log reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia cadvisor_0.44.0+ds1-1~wmf1_amd64.changes: T325557 [13:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:47] T325557: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 [13:01:57] !log installing libxstream-java security updates [13:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:52] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) Thank you! Let us know when the DIMM arrives so we can stop the host and power it off for you [13:03:48] (03PS3) 10Jcrespo: dbbackups: Setting up grants for new dbprov hosts [puppet] - 10https://gerrit.wikimedia.org/r/881868 (https://phabricator.wikimedia.org/T327155) [13:03:51] (03CR) 10FNegri: [C: 03+1] "LGTM, can you add a link to the upstream bug/patch, either in the commit message or as a comment in the file?" [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) (owner: 10Arturo Borrero Gonzalez) [13:04:45] (03CR) 10Hnowlan: "Logic LGTM as far as changeprop is concerned, one style nice-to-have that can be ignored" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [13:07:24] (03PS4) 10Arturo Borrero Gonzalez: openstack: neutron: keepalived_state_change.py to avoid debug log flood [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) [13:08:41] !log installing node-minimatch security updates [13:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new ping host - jmm@cumin2002" [13:09:08] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setting up grants for new dbprov hosts [puppet] - 10https://gerrit.wikimedia.org/r/881868 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [13:09:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron: keepalived_state_change.py to avoid debug log flood [puppet] - 10https://gerrit.wikimedia.org/r/881866 (https://phabricator.wikimedia.org/T327463) (owner: 10Arturo Borrero Gonzalez) [13:18:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) >>! In T325004#8541303, @BTullis wrote: >>>! In T325004#8525568, @taavi wrote: >> Re-opening. The developer account `Hxi-ctr` has shell... [13:44:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10Marostegui) For what is worth, this host crashed mysql yesterday again. Probably the safest thing to do once the new memory arrives is simply reclone it. [13:47:05] (03PS1) 10Aqu: Project deprecation [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/881873 (https://phabricator.wikimedia.org/T326194) [13:47:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) Diff's at https://puppet-compiler.wmflabs.org/output/881872/39191/bast3005.wikimedia.org/fulldiff.html >>! In T32... [13:48:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) @HXi-WMF, we are going to have to rename your account from hxi-ctr to xihua due to a mistake on my part. Let us kn... [13:49:08] (03PS1) 10Aklapper: phabricator weekly changes email: Remove unneeded Bugzilla special case [puppet] - 10https://gerrit.wikimedia.org/r/881874 (https://phabricator.wikimedia.org/T327503) [13:55:01] (03CR) 10Andrew Bogott: openstack: encapi: create parent directories for files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881711 (owner: 10Majavah) [14:04:10] (03PS1) 10Muehlenhoff: Change Kwaku's account [puppet] - 10https://gerrit.wikimedia.org/r/881878 [14:06:22] (03CR) 10Muehlenhoff: [C: 03+2] Change Kwaku's account [puppet] - 10https://gerrit.wikimedia.org/r/881878 (owner: 10Muehlenhoff) [14:07:39] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1198 crash due to memory errors - https://phabricator.wikimedia.org/T327107 (10jcrespo) >>! In T327107#8543519, @Marostegui wrote: > For what is worth, this host crashed mysql yesterday again. Probably the safest thing to do once the new memory arrives is simply reclo... [14:08:05] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [14:08:17] (03CR) 10Clément Goubert: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert) [14:15:56] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q3), 10User-fgiunchedi: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10lmata) [14:17:35] (03CR) 10Elukey: ml-services: disable multi-processing on staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:20:27] (03PS3) 10Ilias Sarantopoulos: ml-services: disable multi-processing on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) [14:20:44] (03CR) 10Ilias Sarantopoulos: ml-services: disable multi-processing on staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:21:01] (03CR) 10Elukey: ml-services: disable multi-processing on staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:24:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:24:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:26:44] (03CR) 10Elukey: [C: 03+2] ml-services: disable multi-processing on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:27:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:28:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:29:40] (03CR) 10MVernon: "Sorry for the slow response!" [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:32:01] (03Merged) 10jenkins-bot: ml-services: disable multi-processing on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881870 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:33:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! Note that the only use of $nginx_tune_for_media is in the removed code block, do so you can also remove that one in a followup" [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [14:34:54] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: drain thanos-be[1,2]004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:35:55] (03CR) 10MVernon: [C: 03+2] thanos: drain thanos-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:52:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:40] (03PS1) 10Aklapper: phabricator weekly changes email: List Herald actions on archived tags [puppet] - 10https://gerrit.wikimedia.org/r/881884 (https://phabricator.wikimedia.org/T327508) [15:04:27] (03PS1) 10Muehlenhoff: openstack::cinder::user: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881885 [15:10:19] 10SRE, 10SRE-Access-Requests: Requesting access to WMF Production for Kavitha Appakayala - https://phabricator.wikimedia.org/T327450 (10Aklapper) a:05Kappakayala→03None Hi and welcome. Please provide your SSH public key (production must be a separate key from Cloud VPS). [15:12:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881885 (owner: 10Muehlenhoff) [15:20:30] (03PS1) 10Muehlenhoff: openstack::glance::service: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/881887 [15:22:30] (03PS1) 10Ilias Sarantopoulos: ml-services: Upgrade revscoring staging image with bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/881888 (https://phabricator.wikimedia.org/T325657) [15:27:07] (03CR) 10Elukey: [C: 03+2] ml-services: Upgrade revscoring staging image with bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/881888 (https://phabricator.wikimedia.org/T325657) (owner: 10Ilias Sarantopoulos) [15:32:00] (03Merged) 10jenkins-bot: ml-services: Upgrade revscoring staging image with bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/881888 (https://phabricator.wikimedia.org/T325657) (owner: 10Ilias Sarantopoulos) [15:49:54] (03PS2) 10Clément Goubert: httpd-cgi: Bump ecs version to 1.11.0-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 [15:50:14] (03PS2) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [15:54:25] (03PS3) 10Elukey: changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) [15:54:27] (03PS2) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [15:55:25] (03CR) 10Cwhite: [C: 04-1] httpd-cgi: Bump ecs version to 1.11.0-2 (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert) [15:56:36] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/881895 [15:57:43] (03CR) 10Cwhite: [C: 04-1] mediawiki: Update ecs logging to 1.11.0-2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [15:59:10] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/881895 (owner: 10Muehlenhoff) [16:00:16] (03PS3) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [16:02:37] (03PS3) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [16:03:25] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) ` sukhe@cumin2002:~$ sudo cumin 'A:cp and A:bullseye' 3 hosts will be targeted: cp[2041-2042].codfw.wmnet,cp5032.eqsin.wmnet ` All three current bullseye host... [16:04:41] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:04:52] (03PS4) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 [16:04:55] (03CR) 10Hnowlan: [C: 03+1] "lgtm, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:05:10] (03CR) 10Clément Goubert: mediawiki: Update ecs logging to 1.11.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [16:05:26] (03PS3) 10Clément Goubert: httpd-cgi: Bump ecs version to 1.11.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 [16:05:45] (03CR) 10Clément Goubert: httpd-cgi: Bump ecs version to 1.11.0 (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert) [16:13:23] (03PS4) 10Elukey: changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) [16:13:25] (03PS4) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [16:13:58] (03CR) 10Elukey: changeprop: add liftwing revscoring streams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:14:29] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:14:39] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:14:49] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:14:56] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:15:06] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:15:16] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:15:24] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:24:33] (03CR) 10Ottomata: flink - fix access to k8s api (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:25:30] (03PS5) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [16:25:42] (03PS1) 10Cwhite: Clarify ecs.version field format in docs [software/ecs] - 10https://gerrit.wikimedia.org/r/881809 (https://phabricator.wikimedia.org/T292585) [16:27:57] (03PS2) 10Cwhite: logstash: expand ecs pre and post filter gates [puppet] - 10https://gerrit.wikimedia.org/r/831949 (https://phabricator.wikimedia.org/T292585) [16:29:32] (03PS9) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [16:29:54] (03CR) 10CI reject: [V: 04-1] Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:30:43] (03PS7) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [16:31:15] (03PS10) 10Stevemunene: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [16:31:44] (03PS8) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [16:32:51] (03CR) 10Ottomata: flink - fix access to k8s api (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:33:06] (03Abandoned) 10Herron: slo_dashboards: move to one SLO/SLI per dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/849131 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [16:33:54] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39192/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [16:34:11] (03PS5) 10Elukey: changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) [16:34:13] (03PS6) 10Elukey: helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) [16:34:23] (03CR) 10BCornwall: [V: 03+1 C: 03+2] tlsproxy: Remove ssl_dyn_rec support [puppet] - 10https://gerrit.wikimedia.org/r/881717 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [16:34:38] (03CR) 10Elukey: "Completed the last bits, should be ready for the last review! (thanks all for the patience)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:34:45] (03CR) 10Elukey: "Completed the last bits, should be ready for the last review! (thanks all for the patience)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [16:41:26] (03PS1) 10BCornwall: tlsproxy: Remove nginx_tune_for_media [puppet] - 10https://gerrit.wikimedia.org/r/881902 (https://phabricator.wikimedia.org/T228730) [16:44:38] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39193/console" [puppet] - 10https://gerrit.wikimedia.org/r/881902 (https://phabricator.wikimedia.org/T228730) (owner: 10BCornwall) [16:45:27] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10Eevans) [16:46:22] RECOVERY - Maps - OSM synchronization lag - codfw on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.756e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12 [16:48:04] (03CR) 10BCornwall: [C: 03+2] tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff) [16:50:39] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: TLS config issue for nginx on Buster - https://phabricator.wikimedia.org/T228730 (10BCornwall) 05Open→03Resolved a:03BCornwall ssl_dyn_rec has been removed entirely. Thanks for reporting! [17:04:33] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-jbond: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BCornwall) 05Open→03Stalled [17:04:55] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-jbond: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BCornwall) a:03BBlack @bblack: There are a number of patches kindly offered by @jbond that, on first glance, provide the functionality you mused abou... [17:08:04] (03PS9) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [17:08:45] (03PS10) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [17:09:20] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10hnowlan) [17:09:45] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10hnowlan) [17:18:01] (03PS11) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [17:18:57] (03PS4) 10Bking: flink-operator: bump version to 1.3.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) [17:20:41] (03CR) 10Bking: flink-operator: bump version to 1.3.1 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [17:26:05] (03CR) 10Hnowlan: [C: 03+1] helmfile.d: add a new test workflow for Lifting to changeprop's staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/881664 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [17:26:11] (03CR) 10Hnowlan: [C: 03+1] changeprop: add liftwing revscoring streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/881594 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [17:26:51] (03PS12) 10Ottomata: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) [17:33:05] (03CR) 10Ottomata: [C: 03+2] flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:33:22] (03PS1) 10Bking: flink-kubernetes-operator: bump version to 1.3.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881907 (https://phabricator.wikimedia.org/T324576) [17:37:44] (03Merged) 10jenkins-bot: flink - fix access to k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/881605 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:46:34] (03PS1) 10Vlad.shapik: Add a wider list of thumbor local configs [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) [17:48:38] (03CR) 10Vlad.shapik: "Such a list of thumbor configs will be more useful than the previous short one." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [17:53:24] (03PS1) 10Ottomata: sync flink-kubernetes-operator-crds and flink-kubernetes-operator versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/881910 (https://phabricator.wikimedia.org/T324576) [18:01:31] (03CR) 10Ottomata: [C: 03+2] sync flink-kubernetes-operator-crds and flink-kubernetes-operator versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/881910 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:06:00] (03CR) 10Dzahn: [C: 03+2] debmonitor: remove racktables links [puppet] - 10https://gerrit.wikimedia.org/r/881702 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:07:28] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: Remove unneeded Bugzilla special case [puppet] - 10https://gerrit.wikimedia.org/r/881874 (https://phabricator.wikimedia.org/T327503) (owner: 10Aklapper) [18:11:45] (03PS1) 10Ottomata: Set chartVersions for flink-kubernetes-operator and -crds in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/881911 [18:13:13] (03CR) 10Dzahn: [C: 03+2] idp: remove racktables related settings [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:13:26] (03PS3) 10Dzahn: idp: remove racktables related settings [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) [18:22:10] (03CR) 10Ottomata: "Merging to try, please revert if this is the wrong thing to do." [deployment-charts] - 10https://gerrit.wikimedia.org/r/881911 (owner: 10Ottomata) [18:22:14] (03CR) 10Ottomata: [C: 03+2] Set chartVersions for flink-kubernetes-operator and -crds in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/881911 (owner: 10Ottomata) [18:22:32] !log deploying new grants for backups on m1 T327155 [18:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:38] T327155: Setup dbprov1004 an dbprov2004 as an expansion of the dbprov (database provisioning) cluster, in preparation of binlog backups backup implementation - https://phabricator.wikimedia.org/T327155 [18:25:23] (03CR) 10Ottomata: "Hm, it looks like core.database is still in the output. This is because there is another place the default value of sql_alchemy_conn is s" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [18:26:15] (03CR) 10Dzahn: "ran puppet on idp* hosts. Filebucketed /etc/cas/services/racktables-18.json." [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:26:53] (03Merged) 10jenkins-bot: Set chartVersions for flink-kubernetes-operator and -crds in admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/881911 (owner: 10Ottomata) [18:27:02] (03CR) 10Dzahn: "thanks Jelto for confirming. sounds to me like we can abandon this, John" [puppet] - 10https://gerrit.wikimedia.org/r/684487 (owner: 10Jbond) [18:27:41] (03CR) 10Dzahn: [C: 03+2] "yep:) all merged" [puppet] - 10https://gerrit.wikimedia.org/r/881359 (https://phabricator.wikimedia.org/T323262) (owner: 10Hashar) [18:27:45] (03CR) 10Ottomata: "Sorry about the merge conflict! Good luck!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881458 (https://phabricator.wikimedia.org/T324576) (owner: 10Bking) [18:29:13] (03CR) 10Dzahn: [C: 03+2] "worked. now it says "Application Not Authorized to Use CAS" [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:31:01] (03PS1) 10Cwhite: logstash: enable filters for ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/881812 (https://phabricator.wikimedia.org/T326794) [18:31:35] (03CR) 10Dzahn: [C: 03+2] trafficserver/cache::text: remove racktables [puppet] - 10https://gerrit.wikimedia.org/r/881699 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [18:31:42] (03PS2) 10Dzahn: trafficserver/cache::text: remove racktables [puppet] - 10https://gerrit.wikimedia.org/r/881699 (https://phabricator.wikimedia.org/T327405) [18:35:29] (03CR) 10Cwhite: "This needs deployed before I614b48ae3a4705b4a91d7ca952efb8f144142c66 will work correctly." [puppet] - 10https://gerrit.wikimedia.org/r/881812 (https://phabricator.wikimedia.org/T326794) (owner: 10Cwhite) [18:36:50] (03CR) 10Cwhite: "I think inline is the last no such field. Otherwise LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881877 (owner: 10Clément Goubert) [18:37:19] (03CR) 10Cwhite: [C: 03+1] "Looks good! Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881876 (owner: 10Clément Goubert) [18:37:32] PROBLEM - Static CodeReview archive HTTP on miscweb2002 is CRITICAL: connect to address 10.192.16.211 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Static-codereview.wikimedia.org [18:37:36] PROBLEM - Check systemd state on miscweb2002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:44] PROBLEM - racktables.wikimedia.org requires authentication on miscweb2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:38:17] (03CR) 10Dzahn: "Maybe you could get this done together with claime and rzl." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [18:38:38] well, the miscweb2002 issues are caused by me [18:38:44] while decom'ing racktables [18:38:51] on it now [18:39:16] PROBLEM - Static CodeReview archive HTTP on miscweb1002 is CRITICAL: connect to address 10.64.32.187 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Static-codereview.wikimedia.org [18:39:20] PROBLEM - racktables.wikimedia.org requires authentication on miscweb1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:39:24] PROBLEM - Check systemd state on miscweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:51] (03PS1) 10Dzahn: Revert "idp: remove racktables related settings" [puppet] - 10https://gerrit.wikimedia.org/r/881861 [18:44:20] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "idp: remove racktables related settings" [puppet] - 10https://gerrit.wikimedia.org/r/881861 (owner: 10Dzahn) [18:45:38] RECOVERY - Static CodeReview archive HTTP on miscweb1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 610 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Static-codereview.wikimedia.org [18:45:44] RECOVERY - racktables.wikimedia.org requires authentication on miscweb1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 626 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:45:48] RECOVERY - Check systemd state on miscweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:34] (03CR) 10Herron: [C: 03+1] logstash: enable filters for ecs 1.11.0 [puppet] - 10https://gerrit.wikimedia.org/r/881812 (https://phabricator.wikimedia.org/T326794) (owner: 10Cwhite) [18:47:06] RECOVERY - Static CodeReview archive HTTP on miscweb2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 610 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Static-codereview.wikimedia.org [18:47:12] RECOVERY - Check systemd state on miscweb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:20] RECOVERY - racktables.wikimedia.org requires authentication on miscweb2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 627 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:52:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:53:18] (03CR) 10Dzahn: "Aklapper, this query works but returns an empty set. but that could be normal and expected, right?" [puppet] - 10https://gerrit.wikimedia.org/r/881884 (https://phabricator.wikimedia.org/T327508) (owner: 10Aklapper) [18:55:18] (03PS2) 10Dzahn: ci: add contint2002 as a rsync destination host [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) [18:55:30] (03CR) 10Dzahn: ci: add contint2002 as a rsync destination host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [18:55:47] (03CR) 10Dzahn: [C: 03+2] ci: add contint2002 as a rsync destination host [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [18:57:45] (03CR) 10Dzahn: [C: 03+2] "noop on contint1002 AND contint2001.. wait.. what.. why is nothing happening there on puppet run :p" [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:01:50] (03CR) 10Dzahn: [C: 03+2] "role(ci::master) is on contint servers, but not yet on contint2002. and it would only affect the destination host.. so it does make sense" [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:02:20] (03CR) 10Dzahn: [C: 03+2] "maybe we want ONLY the migration profile on this new host at first" [puppet] - 10https://gerrit.wikimedia.org/r/867711 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:05:11] (03PS1) 10Ottomata: Revert "Set chartVersions for flink-kubernetes-operator and -crds in admin_ng" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881862 [19:11:46] (03CR) 10Ottomata: [C: 03+2] Revert "Set chartVersions for flink-kubernetes-operator and -crds in admin_ng" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881862 (owner: 10Ottomata) [19:15:43] (03Merged) 10jenkins-bot: Revert "Set chartVersions for flink-kubernetes-operator and -crds in admin_ng" [deployment-charts] - 10https://gerrit.wikimedia.org/r/881862 (owner: 10Ottomata) [19:25:37] (03PS2) 10Vlad.shapik: Add a wider list of thumbor local configs and fix make online-test command [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) [19:26:39] (03CR) 10Dzahn: [C: 03+2] ci: add contint2002 to zuul_merger firewall, ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/867710 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:26:44] (03PS2) 10Dzahn: ci: add contint2002 to zuul_merger firewall, ferm_srange [puppet] - 10https://gerrit.wikimedia.org/r/867710 (https://phabricator.wikimedia.org/T324659) [19:26:46] (03PS3) 10Vlad.shapik: Add a longer list of thumbor local configs and fix make online-test command [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/881909 (https://phabricator.wikimedia.org/T325811) [19:31:02] (03CR) 10Dzahn: "firewall rules have been adjusted on contint2001 and contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/867710 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:31:22] (03CR) 10Dzahn: [C: 03+2] ci: add contint2002 to firewall, jenkins and zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/867703 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:32:42] (03CR) 10Dzahn: [C: 03+2] "on contint1002 and contint2001- +&R_SERVICE(tcp, 4730, (208.80.153.15 208.80.154.132 208.80.153.39));" [puppet] - 10https://gerrit.wikimedia.org/r/867703 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [19:34:22] (03PS1) 10Sharvaniharan: New config entries for migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881918 [19:34:52] (03PS2) 10Sharvaniharan: New config entries for migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881918 (https://phabricator.wikimedia.org/T324167) [19:35:10] (03CR) 10Dzahn: "yea. so.. turns out we have 2 x +1 and one -1 here. I am not sure entirely if there is still a counter proposal how to solve this differen" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [19:54:13] (03CR) 10Dzahn: [C: 03+2] "testing on inactive phab server with puppet disabled on active server" [puppet] - 10https://gerrit.wikimedia.org/r/869853 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [19:56:39] (03CR) 10Dzahn: "not working on phab2002, without running phd, gets 500." [puppet] - 10https://gerrit.wikimedia.org/r/879137 (https://phabricator.wikimedia.org/T324311) (owner: 10Dzahn) [19:59:19] 10SRE, 10LDAP-Access-Requests: Grant Access to Wmf group for MShilova - https://phabricator.wikimedia.org/T327546 (10MShilova_WMF) [20:03:41] (03PS1) 10Dzahn: Revert "phabricator: rewrite https://phabricator.wikimedia.org/r/ to gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/881863 [20:05:39] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: rewrite https://phabricator.wikimedia.org/r/ to gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/881863 (owner: 10Dzahn) [20:19:06] (03PS1) 10Jclark-ctr: new servers druid10[09-11] adding basic install info Bug:T314335 [puppet] - 10https://gerrit.wikimedia.org/r/881923 [20:19:26] (03CR) 10CI reject: [V: 04-1] new servers druid10[09-11] adding basic install info Bug:T314335 [puppet] - 10https://gerrit.wikimedia.org/r/881923 (owner: 10Jclark-ctr) [20:34:43] (03Abandoned) 10Jclark-ctr: new servers druid10[09-11] adding basic install info Bug:T314335 [puppet] - 10https://gerrit.wikimedia.org/r/881923 (owner: 10Jclark-ctr) [20:40:25] (03PS1) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881924 (https://phabricator.wikimedia.org/T314335) [20:40:45] (03CR) 10CI reject: [V: 04-1] new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881924 (https://phabricator.wikimedia.org/T314335) (owner: 10Jclark-ctr) [20:47:26] (03Restored) 10Jclark-ctr: new servers druid10[09-11] adding basic install info Bug:T314335 [puppet] - 10https://gerrit.wikimedia.org/r/881923 (owner: 10Jclark-ctr) [20:48:16] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) a:03BCornwall [20:51:35] 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) a:03BCornwall I've contacted @CRoslof via email asking for advisement. [21:00:55] (03PS2) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881924 (https://phabricator.wikimedia.org/T314335) [21:07:41] (03CR) 10Papaul: [C: 03+2] new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881924 (https://phabricator.wikimedia.org/T314335) (owner: 10Jclark-ctr) [21:08:25] 10SRE, 10Traffic-Icebox: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10BCornwall) 05Open→03Stalled Marking as stalled since upstream hasn't worked on it yet. [21:16:28] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Create a generic network performance profile - https://phabricator.wikimedia.org/T274230 (10BCornwall) 05Open→03Stalled @jbond is this still desirable? If so, was the failing test holding you back from poking the tagged reviewers? O... [21:20:06] (03Abandoned) 10Jclark-ctr: new servers druid10[09-11] adding basic install info Bug:T314335 [puppet] - 10https://gerrit.wikimedia.org/r/881923 (owner: 10Jclark-ctr) [21:24:52] (03PS1) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [21:30:18] (03CR) 10Cwhite: [C: 03+2] role, profile: remove logstash(7) role and hiera config [puppet] - 10https://gerrit.wikimedia.org/r/879887 (owner: 10Cwhite) [21:31:53] 10SRE, 10DNS, 10Traffic-Icebox: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10BCornwall) 05Stalled→03Resolved a:03BCornwall As wikimedia.ee is now managed by Wikimedia Estonia (The nameservers were pointed to their hosting provider in T20405... [21:32:25] 10SRE, 10Traffic-Icebox: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez) I believe we could close this one now that ATS doesn't handle TLS termination anymore [21:32:38] (03PS2) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [21:34:23] 10SRE, 10Traffic-Icebox: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10BCornwall) 05Stalled→03Invalid Closing as invalid since ATS doesn't handle TLS termination anymore. [21:44:34] (03PS1) 10Cwhite: conftool-data: add logstash[12]032 to kibana7 backend [puppet] - 10https://gerrit.wikimedia.org/r/881813 [21:45:38] 10SRE, 10Traffic-Icebox: Indexing of https://www.wikidata.org in the Yandex Search Engine - https://phabricator.wikimedia.org/T217407 (10BCornwall) 05Open→03Resolved a:03BCornwall This ticket is quite old and a few answers have been given: > There is no hard and fast limit on read requests, but be consi... [21:45:52] (03CR) 10Cwhite: role: remove kibana7_ecs role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879888 (owner: 10Cwhite) [21:48:32] 10SRE, 10Traffic-Icebox: clean up deprecated TLS certificates from the puppet repo - https://phabricator.wikimedia.org/T211697 (10BCornwall) 05Open→03Resolved a:03BCornwall As @Vgutierrez kindly merged in patches that addressed the ticket description, this can be closed. Any further domains/revisiting of... [21:52:01] (03CR) 10Cwhite: "We'll also have to run a phatality deploy." [puppet] - 10https://gerrit.wikimedia.org/r/881813 (owner: 10Cwhite) [22:01:56] 10SRE: Update Media dashboard in Grafana to use Prometheus metrics - https://phabricator.wikimedia.org/T193445 (10BCornwall) [22:05:13] (03CR) 10Papaul: [V: 04-1] new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) (owner: 10Jclark-ctr) [22:06:10] (03CR) 10Papaul: [V: 04-1] "missing partman/standard.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) (owner: 10Jclark-ctr) [22:08:59] 10SRE, 10Traffic-Icebox: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BCornwall) >>! In T266746#6640849, @BBlack wrote: > Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream. >... [22:16:13] (03CR) 10Zabe: [C: 03+1] Pin CheckUserEventTablesMigrationStage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881390 (https://phabricator.wikimedia.org/T324907) (owner: 10Dreamy Jazz) [22:26:36] (03PS3) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [22:33:38] (03PS4) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [22:52:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:29] (03PS5) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [22:57:38] 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10Dzahn) @BCornwall This should go via MarkMonitor. We have a rep there, or at least we used to. It's probably advised to keep it within MM and not use another registrar, unless things have changed of c... [22:58:03] (03PS6) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [22:59:12] (03PS7) 10Jclark-ctr: new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) [23:02:19] (03PS1) 10Dzahn: idp: remove config for racktables [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) [23:04:48] (03CR) 10Papaul: [C: 03+2] new servers druid10[09-11] adding basic install info [puppet] - 10https://gerrit.wikimedia.org/r/881929 (https://phabricator.wikimedia.org/T314335) (owner: 10Jclark-ctr) [23:09:04] (03PS2) 10Dzahn: idp: remove config for racktables [puppet] - 10https://gerrit.wikimedia.org/r/881938 (https://phabricator.wikimedia.org/T327405) [23:12:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) [23:15:14] (03PS1) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [23:16:00] (03CR) 10Dzahn: [C: 03+2] "so if merge this then the "50-racktables.wikimedia.org" apache site gets removed and then apache can't be restarted anymore, because " [au" [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [23:19:49] (03CR) 10Dzahn: [C: 03+2] idp: remove racktables related settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [23:24:48] (03CR) 10Dzahn: [C: 03+2] "the mods-enabled/auth_cas.conf needs to be removed as well, not just auth_cas.load and the racktables site to fix it" [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [23:25:06] (03PS2) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [23:25:45] (03PS3) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [23:27:44] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39199/console" [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [23:27:54] PROBLEM - racktables.wikimedia.org requires authentication on miscweb2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:29:30] RECOVERY - racktables.wikimedia.org requires authentication on miscweb2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 626 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:29:31] (03CR) 10Dzahn: [C: 03+2] "then if we manually also delete the mod config.. we still have this monitoring alert going off: PROBLEM - racktables.wikimedia.org require" [puppet] - 10https://gerrit.wikimedia.org/r/881697 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [23:34:01] (03PS4) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [23:34:20] (03CR) 10CI reject: [V: 04-1] centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [23:35:44] (03PS5) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) [23:36:38] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39201/console" [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [23:39:24] (03CR) 10Andrea Denisse: centrallog: apply role::syslog::centralserver on centrallog instances [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [23:40:35] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/881939/39201/" [puppet] - 10https://gerrit.wikimedia.org/r/881939 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)