[00:02:01] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:23] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:30:23] (03PS3) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) [00:31:35] (03PS11) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [00:32:27] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:08] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [00:33:31] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [00:34:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) 05Open→03Resolved Row D maintenance complete [00:35:19] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [00:36:25] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [00:37:02] (03PS1) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) [00:39:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-drop-webrequest-sequence-stats-partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:15] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10lmata) >In T309033#8140576, @herron wrote: > Please see https://phabricator.wikimedia.org/T313229#8130640 Thank you! [00:41:47] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:35] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:05] (03PS1) 10Andrea Denisse: netmon: Create the OpenSSH directory inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) [00:57:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow [00:57:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow [00:58:01] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls [00:58:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be [00:58:09] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe [01:02:51] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:50] (03PS1) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822197 (https://phabricator.wikimedia.org/T308620) [01:08:26] (03Abandoned) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [01:12:03] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:15:51] (03CR) 10Andrea Denisse: "Hello team, here are the PCC results for this patch: https://puppet-compiler.wmflabs.org/pcc-worker1001/36691/" [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [01:19:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Hello team, after further testing it the least disruptive and simplest approach is to create the `.ssh` directory using Puppet. It nee... [01:19:23] !log tstarling@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw [01:19:58] !log tstarling@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw [01:32:57] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:01] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:34:10] (03CR) 10Tim Starling: [C: 03+2] Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling) [01:34:54] (03Merged) 10jenkins-bot: Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling) [01:37:25] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:48] !log tstarling@deploy1002 Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s) [01:38:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:39:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:39:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:40:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:42:17] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:07] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:29] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:05] (03PS13) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:25] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:29] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:51] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:15:58] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:11] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:26:07] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:41] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:01] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:49] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:35] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:34:17] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:40:57] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:44:12] (03PS12) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [02:46:23] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:57:00] hello :) [02:57:02] upstream connect error or disconnect/reset before headers. reset reason: overflow [02:57:10] I was wondering if it was just me [02:57:19] went away now though [02:57:25] same [02:57:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [02:57:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [02:58:12] hey, looking [02:58:18] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:48] online as well [03:00:33] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:09] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:02:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:03:18] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:11:33] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:17] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:55] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:31] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) I implemented option 3, and created T314868 for tracking the roll-out. [03:26:27] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:29] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:45] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:21] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:04] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I did a pywikibot edit on testwiki from my Dallas test instance. The time between the completion of the last codfw sessionstore write and the eqia... [03:34:32] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [03:35:42] (03CR) 10Mary Yang: Use proxy for wikifunctions beta blackbox probe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [03:41:49] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:44:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:11] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:27] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10SRE Observability (FY2022/2023-Q1): LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10ayounsi) Looks like permission issues: `name=netmon1003 ayounsi@... [03:51:03] !log chown librenms /srv/librenms/rrd/* on netmon1003 T314972 [03:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:07] T314972: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 [03:52:31] Thanks cwhite, , I think we should run the 'chown' recursively. Looking at the puppet repo to send a patch that fixes this. [03:53:27] denisse|m: good catch [03:53:35] reran with -R [03:55:17] !log chown -R librenms /srv/librenms/rrd/ on netmon1003 T314972 [03:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:18] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821698 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [04:02:58] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:24] (03CR) 10Cwhite: [C: 03+1] netmon: Add the netmon1003 host to the alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/822126 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [04:07:57] (03CR) 10Cwhite: [C: 03+1] netmon: Use netmon1003's IP address for the librenms endpoint [puppet] - 10https://gerrit.wikimedia.org/r/822124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [04:11:29] Looking at the code that folder belongs to the 'www-data' user. https://github.com/wikimedia/puppet/blob/production/modules/librenms/manifests/init.pp#L107 [04:11:29] So I guess it's override to 'deploy-librenms' during the 'rsync::quickdatacopy' process... [04:12:19] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:01] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:57] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:21] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:17] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:58] (03CR) 10Ayounsi: [C: 03+1] "LGTM on the overall logic (and PCC)." [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [04:18:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10ayounsi) > @ayounsi do you anticipate any fallout from this? I agree that it's better to check host keys, so +1 as long as: * there is some kind of al... [04:26:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:33:33] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:15] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:59] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:59] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:02] (03PS1) 10Andrea Denisse: netmon: Set correct owner for the LibreNMS rrd directory. [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) [04:53:48] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) I think that the owner is override to '`deploy-librenms`' during the [[ ht... [04:55:25] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:25] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36692/" [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [05:00:05] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:11] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:15] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:15:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:18:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368 [05:18:49] T314368: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T314368 [05:19:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368 [05:19:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1122 with weight 0 T314368', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json [05:22:05] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:32:27] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:14] (03PS2) 10Ladsgroup: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819525 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot) [05:39:19] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819525 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot) [05:39:22] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10tstarling) >>! In T40010#8144396, @Arthur2e5 wrote: > I am… getting impatient enough to ask: how hard is it to, really, just make our own statically-compiled rsvg-convert bina... [05:41:57] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:57] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:17] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:55:13] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:33] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T0600). Please do the needful. [06:00:14] o/ [06:00:15] let's go [06:00:19] !log Starting s2 eqiad failover from db1162 to db1122 - T314368 [06:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:23] T314368: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T314368 [06:00:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T314368', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json [06:01:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write T314368', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json [06:01:15] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:02:19] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:03] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:32] (03PS2) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/819546 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot) [06:04:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/819546 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot) [06:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1162 (T314368 T298555 T312863 T310011 T309311 T60674 T298560 T303603 T310485)', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json [06:06:39] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [06:06:39] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [06:06:40] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [06:06:40] T314368: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T314368 [06:06:40] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:06:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [06:06:41] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:06:41] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:12:31] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:11] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:31] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint [06:17:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint [06:17:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json [06:17:37] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:22:29] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:28:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json [06:47:27] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [06:47:27] wikimedia.org/wiki/Services/Monitoring/restbase [06:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json [06:49:19] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:53:03] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:25] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:39] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Amir1, apergos, jnuche, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T0700). [07:00:18] morning! [07:00:28] there are no trainees signed up and no patches in the window. [07:01:03] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:41] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:21] apergos: today or early next week I'm planning to the patches for reload db config on the fly, that will impact WikiExporter [07:02:43] planning to what the patches, sorry? a verb missing there [07:02:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json [07:02:53] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [07:02:56] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:03:02] (03PS1) 10Giuseppe Lavagetto: Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154 [07:03:05] apergos: deploy, sorry [07:03:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [07:03:09] (03PS2) 10David Caro: p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) [07:03:11] (03PS2) 10David Caro: p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) [07:03:13] (03PS2) 10David Caro: p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) [07:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json [07:03:15] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154 (owner: 10Giuseppe Lavagetto) [07:03:16] context: T298485 [07:03:18] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [07:03:24] (03PS2) 10Giuseppe Lavagetto: Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154 [07:03:31] (03CR) 10Giuseppe Lavagetto: [V: 03+2] Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154 (owner: 10Giuseppe Lavagetto) [07:03:35] mind waiting until early next week? I plan on following some wikimania sessions today and tomorrow [07:03:48] just in case there's an issue [07:04:13] sure [07:05:06] sounds good [07:05:21] I'm subscribed on that task and have been following the patch of course [07:08:03] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:03] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2135 [puppet] - 10https://gerrit.wikimedia.org/r/822310 (https://phabricator.wikimedia.org/T314628) [07:10:24] (03CR) 10Ladsgroup: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/822310 (https://phabricator.wikimedia.org/T314628) (owner: 10Jcrespo) [07:11:03] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:17] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:47] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:11] <_joe_> !log pooling all services in codfw [07:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:53] (03PS1) 10David Caro: pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312 [07:23:48] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=inference [07:24:07] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline [07:24:56] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad [07:25:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:26:03] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:26:37] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:26:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10ayounsi) Thanks for the explanations! I think it would be nice to have them on Wikitech to find them more easily in the future. Based on... [07:27:21] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:31] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:33] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:33] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:55] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:27:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:28:03] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:28:11] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:53] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, logstash2003, mc2024, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe200 [07:41:39] s-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:42:59] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:18] (03PS1) 10Vgutierrez: Revert "lvs: move ulsfo, eqsin off of conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/822155 [07:47:34] (03CR) 10FNegri: [C: 03+1] p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:50:19] _joe_: is that you? ^ [07:50:34] the gdnsd/confd noise [07:50:37] <_joe_> vgutierrez: yes sigh [07:50:41] <_joe_> will fix it [07:50:45] thx <3 [07:50:53] <_joe_> nothing bad happened btw [07:51:05] (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: move ulsfo, eqsin off of conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/822155 (owner: 10Vgutierrez) [07:51:27] !log rolling restart of pybal in eqsin and ulsfo [07:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:25] (03CR) 10FNegri: [C: 03+1] p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:52:43] (03CR) 10FNegri: p:ceph::osd: bring the cluster interface up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:55:33] (03PS1) 10David Caro: icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 [07:55:40] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw [07:56:23] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) After 38 hours of checking and 7 milion rows compared to eqiad's es1021, I can confidently say that data was in a good state after the crash. [07:56:56] (03PS1) 10David Caro: wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319 [07:57:48] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1 [07:58:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1 [07:58:43] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:28] <_joe_> vgutierrez: we should get recoveries soon [07:59:28] (03PS3) 10David Caro: p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) [07:59:30] (03PS3) 10David Caro: p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) [07:59:32] (03CR) 10David Caro: p:ceph::osd: bring the cluster interface up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [07:59:36] (03PS3) 10David Caro: p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) [07:59:43] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:46] _joe_: nice [08:00:59] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, logstash2003, mc2024, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe200 [08:00:59] s-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:01:00] (03CR) 10FNegri: [C: 03+1] p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:03:47] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:04:09] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:15] PROBLEM - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:06:08] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1 [08:06:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1 [08:09:57] (03CR) 10Filippo Giunchedi: [C: 03+2] netmon: Use netmon1003's IP address for the librenms endpoint [puppet] - 10https://gerrit.wikimedia.org/r/822124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [08:10:35] RECOVERY - PyBal connections to etcd on lvs5001 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [08:11:09] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications for db2135 [puppet] - 10https://gerrit.wikimedia.org/r/822310 (https://phabricator.wikimedia.org/T314628) (owner: 10Jcrespo) [08:11:15] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:32] (03CR) 10Filippo Giunchedi: [C: 03+2] netmon: Add the netmon1003 host to the alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/822126 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [08:11:56] jynus: merged your patch too! [08:12:01] thanks, I was about to ask [08:12:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:13:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:15:28] (03CR) 10Filippo Giunchedi: logstash route k8s logs from proxy,httpd containers to webrequest partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [08:16:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you! LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro) [08:16:13] (03CR) 10Filippo Giunchedi: [C: 03+1] pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312 (owner: 10David Caro) [08:21:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:23:54] (03PS1) 10Elukey: Add the experimental k8s ml-serve configuration [labs/private] - 10https://gerrit.wikimedia.org/r/822321 [08:24:29] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the experimental k8s ml-serve configuration [labs/private] - 10https://gerrit.wikimedia.org/r/822321 (owner: 10Elukey) [08:25:30] 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) [08:26:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:31:10] (03CR) 10David Caro: [C: 03+2] p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:31:13] (03CR) 10David Caro: [C: 03+2] p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:31:17] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10jbond) [08:31:19] (03CR) 10David Caro: [C: 03+2] p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [08:31:35] (03PS1) 10Elukey: Add 'experimental' user/ns config for k8s ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/822324 [08:32:15] (03CR) 10Vgutierrez: Enable query sorting for all testwiki requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [08:32:31] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:01] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10fgiunchedi) My apologies! I ran the quickdatacopy the other day ahead of the failover and... [08:34:16] (03CR) 10Elukey: [C: 03+2] Add 'experimental' user/ns config for k8s ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/822324 (owner: 10Elukey) [08:37:51] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) Well, I tried our usual procedure https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings and the first two commands work OK, but attempting t... [08:41:57] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:58] (03PS1) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 [08:47:05] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:47:34] (03PS2) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 [08:47:45] (03CR) 10Filippo Giunchedi: "Thank you Andrea for looking into this!" [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [08:55:05] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 112.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [08:59:01] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable editor line numbering on all namespaces, for twwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight) [09:00:07] !log update unzip [09:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:27] PROBLEM - k8s requests count to the API on ml-serve-ctrl1002 is CRITICAL: 113.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [09:03:21] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:19] !log update gnutls28 on bullseye systems [09:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:47] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:16] (03PS3) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 [09:12:43] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10fgiunchedi) I looked into why quickdatacopy didn't do the right thing: * the rsync server... [09:14:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10fgiunchedi) Agreed on the short term fix to create the `.ssh` directory. However if we were not checking host keys to begin with I think we should keep... [09:15:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:17:33] XioNoX: topranks: fyi ^^ [09:19:43] jbond: interesting, I don't know where the rancid passphrase is :) [09:19:55] but I guess it got armed on netmon1003 so someone should have it? [09:20:08] yes there was some chat in sre yesterday one sec [09:20:48] XioNoX: apprentyl ~/Wikimedia/pw/network-monitoring-keys-passphrase [09:21:09] (03PS1) 10FNegri: Use broader network for Ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) [09:22:04] jbond: done [09:22:08] jbond: note that "access: @ops" [09:22:28] only the homer key is restricted to netops (as it have write access) [09:22:40] thanks and ack [09:22:41] (03CR) 10David Caro: [C: 03+2] pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312 (owner: 10David Caro) [09:22:52] (03CR) 10David Caro: [C: 03+2] icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro) [09:23:40] (03CR) 10AikoChou: ml-services: add the experimental helmfile config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (owner: 10Elukey) [09:25:16] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36694/console" [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:25:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:27:25] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:27:31] (03CR) 10FNegri: [V: 03+1 C: 03+2] Use broader network for Ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [09:27:35] (03CR) 10AikoChou: [C: 03+1] ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (owner: 10Elukey) [09:31:27] (03PS4) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) [09:31:29] (03CR) 10Elukey: ml-services: add the experimental helmfile config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey) [09:32:18] !log arm keyholder on netmon2001 [09:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:45] thank you stashbot [09:33:16] (03CR) 10Elukey: [C: 03+1] mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [09:34:39] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:33] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:38:15] PROBLEM - Check systemd state on netboxdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@13-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:41:03] 10SRE, 10Infrastructure-Foundations, 10netops: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ayounsi) 05Open→03Resolved a:03ayounsi [09:43:57] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [09:48:18] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 (owner: 10Gergő Tisza) [09:48:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:48:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:49:59] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10fgiunchedi) re: the original failed disk I can confirm that `slot 0` (where the disk was) isn't currently listed: ` root@ms-be2067:~# megacli -pdlist -aALL | grep 'Slot Number' Slot Number: 1 S... [09:51:06] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey) [09:51:40] (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [09:52:28] (03PS3) 10Filippo Giunchedi: mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) [09:55:13] (03CR) 10CI reject: [V: 04-1] mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [09:55:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:55:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:56:08] (03CR) 10Jbond: [C: 03+2] P:base::firewall: use either etc or abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [09:56:51] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot: UX improvements - https://phabricator.wikimedia.org/T314843 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe [09:56:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:56:55] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [09:56:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:57:24] (03PS4) 10Filippo Giunchedi: mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) [10:00:05] mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1000). [10:00:36] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10fgiunchedi) Thank you for vopsbot, looks really good and useful! A perhaps silly/minor thing: I think we should be using `-` instead of `_` as a delimiter for commands, a... [10:00:40] Beta cluster 503ing [10:01:32] bad scap runs x3 [10:02:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:02:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:02:34] (03CR) 10CI reject: [V: 04-1] ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey) [10:02:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [10:03:23] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 294, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:04:57] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:11] PROBLEM - confd service on sretest1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:07:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [10:12:10] !log upload cond package to bullseye-wikimedia [10:13:01] !log (correction) upload *confd* package to bullseye-wikimedia [10:14:13] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:58] 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe I'm resolving the task because I think the current changes are enough for the current goal. I'll come back to look at what @Rhi... [10:16:03] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [10:31:45] RECOVERY - Check systemd state on netboxdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:09] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:28] (03PS1) 10Ayounsi: Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) [10:37:30] (03PS1) 10Jbond: O:sretest: enable confd based abuse filter on sretest [puppet] - 10https://gerrit.wikimedia.org/r/822340 (https://phabricator.wikimedia.org/T313825) [10:38:43] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:09] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:30] (03CR) 10CI reject: [V: 04-1] Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:49:47] (03PS1) 10Jbond: C:postgresql::slave: update recovery configueration [puppet] - 10https://gerrit.wikimedia.org/r/822342 [10:50:27] (03CR) 10Jbond: [C: 03+2] O:sretest: enable confd based abuse filter on sretest [puppet] - 10https://gerrit.wikimedia.org/r/822340 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [10:58:41] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:00:54] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:09] (03CR) 10CI reject: [V: 04-1] C:postgresql::slave: update recovery configueration [puppet] - 10https://gerrit.wikimedia.org/r/822342 (owner: 10Jbond) [11:06:51] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:09:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [11:10:01] RECOVERY - confd service on sretest1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:10:17] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:24] (03PS2) 10Ayounsi: Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) [11:14:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [11:20:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:20:39] (03PS1) 10Jbond: C:ferm: allow useres to read config files, needed for nrpe [puppet] - 10https://gerrit.wikimedia.org/r/822361 (https://phabricator.wikimedia.org/T313825) [11:20:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [11:26:36] (03PS2) 10Jbond: C:ferm: allow useres to read config files, needed for nrpe [puppet] - 10https://gerrit.wikimedia.org/r/822361 (https://phabricator.wikimedia.org/T313825) [11:32:57] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:48] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [11:39:57] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:51] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder) [11:45:36] hi there! just a FYI, the WMCS cluster is currently having some issues. I have disabled the sync of WM code to beta for the time being until things stabilize a bit [11:52:41] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey) [11:55:53] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [11:59:43] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:02:51] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:03:17] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:03] (03CR) 10Elukey: [C: 03+2] ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey) [12:05:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:09:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:10:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:11:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:12:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:12:29] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) @fgiunchedi yeah that may be an option. I'm not sure how easy it is to change Rancid to add that to the command when running ssh, but I'm sur... [12:13:55] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet [12:14:12] (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi) [12:14:23] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:16:53] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:17:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:17:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:17:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:17:38] 10SRE, 10Traffic, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Me and @Vgutierrez have fixed the existing histograms and I've added a test for `buckets -1` so we d... [12:23:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Another oddity here with rancid from netmon1003. The permission change has removed the problem for most of our estate (all the Juniper device... [12:23:18] (03PS1) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) [12:23:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:26:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet [12:26:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet [12:27:36] if someone saw my message above, I've re-enabledd beta sync [12:32:15] 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10fgiunchedi) [12:32:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Logs suggest a timeout: ` scs-oe16-esams.mgmt.esams.wmnet oglogin error: Error: TIMEOUT reached scs-oe16-esams.mgmt.esams.wmnet: missed cmd(s... [12:33:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "That's correct Mary, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822179 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [12:33:47] (03CR) 10Filippo Giunchedi: [C: 03+2] Add proxy_url to prometheus::blackbox::check:http as a parameter. [puppet] - 10https://gerrit.wikimedia.org/r/822179 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [12:34:55] PROBLEM - Host logstash2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:40] known ^, host is a lemon [12:36:34] (03PS2) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) [12:37:11] (03PS3) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) [12:39:53] 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) p:05Triage→03Medium a:03Papaul [12:46:26] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:49:05] (03PS4) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) [12:49:30] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:51:24] (03PS5) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) [12:55:37] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [12:55:41] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [12:56:47] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye [12:56:56] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1060.eqiad.wmnet with OS bullseye [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1300) [13:00:05] RoanKattouw, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1300). [13:00:05] awight, koi, MatmaRex, and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] yup hi [13:00:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:24] o/ [13:00:34] I can self-deploy, and happy to do anyone else's patches. [13:01:03] (03PS2) 10Awight: Enable editor line numbering on all namespaces, for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) [13:01:10] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight) [13:01:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:01:36] o/ [13:01:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) I believe the issue is that the expect script Rancid is running for these is not saying "yes" to accept the host key. This did not happen in... [13:02:06] (03Merged) 10jenkins-bot: Enable editor line numbering on all namespaces, for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight) [13:02:10] hrm logspam-watch is broken [13:02:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.210 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:03:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:43] looks okay on debug, deploying [13:05:57] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:06:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36697/console" [puppet] - 10https://gerrit.wikimedia.org/r/822342 (owner: 10Jbond) [13:06:48] 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10BTullis) Since our meeting, I have been reading the docs around benthos and I've got to say, I find it really compelling! This looks to m... [13:07:16] (03PS3) 10Awight: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 (owner: 10Stang) [13:08:44] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822073|Enable editor line numbering on all namespaces, for twwiki (T302852)]] (duration: 03m 42s) [13:08:48] T302852: Enable line numbering in all namespaces for more wikis (collection of requests) - https://phabricator.wikimedia.org/T302852 [13:08:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:09:00] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 (owner: 10Stang) [13:09:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:47] (03Merged) 10jenkins-bot: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 (owner: 10Stang) [13:10:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:26] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage [13:11:34] koi: the 500k logo change is ready on mwdebug1001 if you wish to test it [13:11:40] looking [13:11:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:12:17] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [13:12:19] awight: LGTM [13:12:26] ack [13:12:34] (03PS1) 10Jgreen: Enable icinga monitoring for frlog1002. [puppet] - 10https://gerrit.wikimedia.org/r/822369 (https://phabricator.wikimedia.org/T312581) [13:14:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage [13:14:24] I think I can deploy wmf-config first, but a bit worried that there might be a race condition with logos/ [13:15:33] Generally, patches should be split up so that each one is safe regardless of the order in which each file is synced. [13:15:52] (03CR) 10Jgreen: [C: 03+2] Enable icinga monitoring for frlog1002. [puppet] - 10https://gerrit.wikimedia.org/r/822369 (https://phabricator.wikimedia.org/T312581) (owner: 10Jgreen) [13:16:17] noticed, will think about that in the future [13:16:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:59] Deployers: can I run createExtensionTables.php safely, or is that an Ops thing? [13:17:12] urbanecm: ^ [13:17:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:18:01] !log awight@deploy1002 Synchronized wmf-config/: Config: [[gerrit:821330|Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 1) (duration: 03m 13s) [13:18:31] T313173#8101583 [13:18:31] T313173: add WikiLove extension in ptwikinews - https://phabricator.wikimedia.org/T313173 [13:18:39] awight: previous task might help ^ [13:18:44] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:18:50] (03CR) 10Cathal Mooney: [C: 03+2] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [13:18:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:19:04] awight: deployers can run that script [13:19:20] koi: very helpful, thanks! [13:19:25] urbanecm: great [13:19:44] !log merging CR821781 to expose additional network info in puppet facts [13:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:20] !log awight@deploy1002 Synchronized logos/: Config: [[gerrit:821330|Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 2) (duration: 03m 09s) [13:24:02] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [13:25:08] (03PS2) 10Awight: trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang) [13:25:14] !log awight@deploy1002 Synchronized static/images: Config: [[gerrit:821330|Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 3) (duration: 03m 09s) [13:26:12] (03CR) 10Awight: [C: 03+2] trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang) [13:27:11] (03Merged) 10jenkins-bot: trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang) [13:27:55] koi: wikilove can be tested on trwikiquote using mwdebug1001 [13:28:00] looking [13:28:50] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [13:29:38] (03PS1) 10Vgutierrez: smokeping: Use asw1-b12-drmrs instead of lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/822373 [13:29:42] awight: LGTM [13:30:04] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:54] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:32:06] koi: +1 ty [13:33:58] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet [13:34:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:35:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:36:00] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822130|trwikiquote: Install WikiLove extension (T314895)]] (duration: 03m 30s) [13:36:03] T314895: Enable the WikiLove extension on trwikiquote - https://phabricator.wikimedia.org/T314895 [13:36:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:16] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 20.30 ms [13:36:18] MatmaRex: Would you like to self-deploy, or shall I? [13:36:35] awight: please do, i don't have access [13:36:41] sure! [13:36:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye [13:37:04] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1060.eqiad.wmnet with OS bullseye completed: - elastic1060 (... [13:38:07] (03CR) 10Awight: "Many of the .html files have conflict markers--is this a problem?" [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński) [13:38:15] MatmaRex: Can you check that? ^ [13:38:40] Seems like the tests should have failed... [13:39:14] (03CR) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (031 comment) [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński) [13:39:21] yeah [13:39:36] the tests parse the HTML, maybe they look enough like HTML tags [13:39:41] let me try to rebuild the tests [13:40:14] kk, I'll move on to phuedx's patches in the meantime [13:41:07] (03PS3) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) [13:41:43] (03PS3) 10Awight: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 (owner: 10Phuedx) [13:41:55] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 (owner: 10Phuedx) [13:42:06] oh wait, i see [13:42:12] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:42:12] those files aren't used. oops [13:42:47] (03Merged) 10jenkins-bot: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 (owner: 10Phuedx) [13:44:01] phuedx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820666 is on mwdebug1001 [13:44:12] PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:18] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:44:20] (03PS4) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) [13:45:07] (03CR) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (031 comment) [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński) [13:45:35] awight: the test files with conflict markers weren't actually used. fixed now [13:45:36] phuedx: I'm not sure how to test, all I can say is that I don't see js console errors and the site still works when I mouse around. [13:45:40] MatmaRex: ty [13:45:57] (03CR) 10MVernon: [C: 03+2] Hieradata: move restbase prod to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/819578 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [13:46:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:46:22] awight: Testing now. I'm verifying that the stream config for the stream is only sent to the client on testwiki and not on, say, enwiki [13:46:48] phuedx: good thing you're testing ;-), I was accidentally on enwiki [13:46:54] awight: LGTM [13:47:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:47:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:47:58] ack [13:48:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:50:30] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [13:50:59] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:820666|Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream""]] (duration: 03m 10s) [13:51:41] (03CR) 10Awight: [C: 03+2] "Deploying. This historical block deserves a celebration of newfound emptiness!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:52:03] !log mvernon@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002 [13:52:07] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [13:53:24] (03PS3) 10Ori: Enable query sorting for all testwiki requests [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) [13:54:13] (03CR) 10Awight: [C: 03+2] "Deploying." [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński) [13:55:00] (03PS2) 10Awight: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:55:15] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:55:17] (03CR) 10Ori: Enable query sorting for all testwiki requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [13:55:40] vgutierrez: ^ [13:56:03] (03Merged) 10jenkins-bot: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [13:56:29] (03PS1) 10Ladsgroup: Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865) [13:56:40] (03CR) 10Ori: [C: 03+1] Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [13:57:33] phuedx: the second event patch is ready on mwdebug1001 [13:57:46] (03PS2) 10Jbond: C:postgresql::slave: update recovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/822342 [13:58:25] (pushing the deployment window a few minutes beyond 14:00) [13:59:06] RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:17] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:01:34] (03CR) 10Jbond: [C: 03+2] C:postgresql::slave: update recovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/822342 (owner: 10Jbond) [14:01:51] (03Merged) 10jenkins-bot: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński) [14:02:30] awight: Something's up. Don't proceed with that patch. The default WikibaseTermboxInteraction set in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/814192/1/extension-repo.json#b607 isn't being propagated to the client [14:02:40] phuedx: okay, reverting! [14:03:05] (03PS1) 10Awight: Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822394 [14:03:13] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:03:15] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822394 (owner: 10Awight) [14:03:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:03:25] We'll try again next week! [14:04:02] (03Merged) 10jenkins-bot: Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822394 (owner: 10Awight) [14:04:16] MatmaRex: DiscussionTools patch is ready to test on mwdebug1001 [14:04:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:04:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:04:49] jouncebot: nowandnext [14:04:49] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [14:04:49] In 1 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1600) [14:04:58] looking [14:05:06] phuedx: If you don't mind, can you confirm that the event is back to normal? [14:05:10] (mwdebug1001) [14:05:16] Amir1: I should be done in < 10 min [14:05:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:05:22] awight: let me know once you're done, I have some evil patches to push [14:05:26] Thanks <3 [14:05:30] :-D I expect no less [14:05:37] (no less than evil ;-) [14:06:04] awight: looks good [14:06:07] ty! [14:06:15] (03CR) 10Andrew Bogott: [C: 03+2] Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [14:06:25] (03PS5) 10Andrew Bogott: Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) [14:06:33] awight: I was testing on the wrong wiki. Of course... *facepalm* [14:06:50] Anyway, everything's in a good state [14:07:10] phuedx: hehe okay +1 since this is just cleanup, AFAICT, I won't de-revert. [14:07:27] MatmaRex: deploying... [14:08:13] (03CR) 10Jbond: [C: 03+2] C:ferm: allow useres to read config files, needed for nrpe [puppet] - 10https://gerrit.wikimedia.org/r/822361 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [14:08:27] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [14:08:50] (03CR) 10CI reject: [V: 04-1] Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [14:09:04] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org [14:09:58] awight: No worries. I'll queue up the de-revert for next week [14:10:14] !log awight@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: [[gerrit:822149|CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707)]] (duration: 03m 31s) [14:10:19] T314707: New topic tool and topic subscriptions don't work when reply tool is disabled and the page would have reply links - https://phabricator.wikimedia.org/T314707 [14:10:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:10:36] (03PS1) 10Phuedx: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 [14:11:10] !log EU backport window complete [14:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:12] Amir1: ^ [14:11:16] (03PS6) 10Andrew Bogott: Remove puppet refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) [14:11:20] thanks awight [14:11:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:11:25] awesome [14:11:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:11:27] My pleasure! [14:11:31] (03PS2) 10Ladsgroup: Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865) [14:11:35] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [14:12:20] (03Merged) 10jenkins-bot: Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [14:12:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:13:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json [14:13:25] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:14:06] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) >>! In T265549#8144450, @tstarling wrote: >>>! In T40010#8144396, @Arthur2e5 wrote: >> I am… getting impatient enough to ask: how hard is it to, really, just make our ow... [14:15:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:15:27] (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [14:15:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [14:16:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:16:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:16:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:17:16] (03PS1) 10Ssingh: hiera: enable ATS9 on cp1089 [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) [14:17:17] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822375|Stop writing to the old templatelinks fields in s2 (T312865)]] (duration: 03m 25s) [14:17:21] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [14:17:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:18:12] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) >>! In T40010#7996397, @TheDJ wrote: > I would like to note that this can all easily be implemented for non-wmf wikis. If someone just spent some time on adapting SVGHan... [14:18:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:18:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:18:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:18:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org [14:19:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:19:21] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:19:30] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36698/console" [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:19:45] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org [14:20:23] (03PS1) 10Ssingh: hiera: enable ATS9 on cp1090 [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) [14:21:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36699/console" [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:22:19] (03PS1) 10Ssingh: hiera: enable ATS9 on cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) [14:22:23] (03PS1) 10Andrew Bogott: cloudcontrols: adjust fernet key rotation times [puppet] - 10https://gerrit.wikimedia.org/r/822385 [14:22:51] PROBLEM - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table ptwiki.templatelinks: Duplicate entry 6941876-0- for key tl_from, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1156-bin.001819, end_log_pos 231590299 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:23:05] PROBLEM - Check systemd state on kubernetes1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:06] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36700/console" [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:23:39] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:23:39] (03PS1) 10Ssingh: hiera: enable ATS9 on cp3065 [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) [14:23:42] 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata) I agree benthos looks really fun! I think there is a real need for easy to use stream processors. We evaluated Knative Event... [14:24:01] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [14:24:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36701/console" [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:25:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:57] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01266 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:27:47] * jbond looking [14:28:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org [14:28:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json [14:29:14] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol100[34] - https://phabricator.wikimedia.org/T313268 (10Andrew) a:05Andrew→03Cmjohnson [14:30:02] (03PS2) 10Phuedx: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303) [14:30:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:31:19] (03PS2) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) [14:32:02] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 132 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:32:14] jbond: did you get a page for this? Because I didn't [14:32:52] i don't see a page listed in klaxon [14:33:06] Amir1: no i have an irc highlight for that specific alert [14:33:19] aha, amazing. Thanks [14:33:26] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:33:27] :) no probs [14:33:30] PROBLEM - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 949.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:33:59] the s2 lag is known ^ i already mentioned it to Amir1 [14:34:08] I'm working on it [14:34:39] wanted to make sure everyone else knew :) [14:35:30] well, pinging me at middle of the debug just slows me down [14:36:53] i should have dropped the 1 or put a . somewhere [14:37:22] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 59 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:38:38] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 6 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:38:43] (03PS1) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) [14:39:25] it should be catching up now, it seems it had a drift in schema, an extra unique index and just only on ptwiki, I checked some other s2 wikis and they were fine but I need to check each one I think [14:39:35] (03PS1) 10Jdlrobson: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/822396 (https://phabricator.wikimedia.org/T314952) [14:39:44] RECOVERY - MariaDB Replica Lag: s2 on db1155 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:39:44] RECOVERY - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:39:55] (03PS1) 10Mforns: analytics:refinery:job:data_purge: Improve drop-webrequest-sequence-stats [puppet] - 10https://gerrit.wikimedia.org/r/822408 (https://phabricator.wikimedia.org/T270433) [14:40:01] Amir1: replag.toolforge.org looks caught up [14:40:51] it's not in any other wiki of s2 [14:40:58] RECOVERY - k8s requests count to the API on ml-serve-ctrl1002 is OK: (C)100 ge (W)50 ge 34.93 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [14:41:21] (03CR) 10Filippo Giunchedi: [C: 03+2] Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [14:41:46] (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:42:21] (03PS1) 10Jbond: C:ferm: add o+x permissions to ferm directory [puppet] - 10https://gerrit.wikimedia.org/r/822409 [14:42:24] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36703/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:42:45] (03CR) 10Jbond: [C: 03+2] C:ferm: add o+x permissions to ferm directory [puppet] - 10https://gerrit.wikimedia.org/r/822409 (owner: 10Jbond) [14:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json [14:47:30] (03PS2) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) [14:48:52] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135 [14:48:57] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [14:49:40] RECOVERY - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:49:53] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36704/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:50:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:50:30] PROBLEM - Host ps1-c8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:50:37] (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:50:50] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005842 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:51:10] RECOVERY - MariaDB Replica SQL: s2 on db1155 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:52:45] (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:44] (03CR) 10FNegri: [C: 04-1] "The diff here doesn't look right https://puppet-compiler.wmflabs.org/pcc-worker1002/36704/cloudcephosd1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:53:50] RECOVERY - Check systemd state on kubernetes1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:18] (CertAlmostExpired) firing: Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:44] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:55:38] RECOVERY - Check systemd state on poolcounter1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:45] !log bking@cumin1001 running puppet agent across eqiad elastic hosts [14:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:58:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json [15:01:22] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:03:20] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:08] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:05:48] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:05:56] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:06:06] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:06:18] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:06:24] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:06:30] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:06:30] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:06:50] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:20] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:22] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:30] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:46] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:48] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:56] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:08:00] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:09:32] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:09:57] <3 _joe_ [15:15:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:47] (03PS3) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) [15:16:50] (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:18:16] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36705/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:20:34] RECOVERY - DNS on db1191.mgmt is OK: DNS OK: 0.012 seconds response time. db1191.mgmt.eqiad.wmnet returns 10.65.3.4 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:44] (03PS4) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) [15:22:46] (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:22:48] (03CR) 10David Caro: ceph: use many cluster and public networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:23:16] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:23:18] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36706/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:24:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:27:04] RECOVERY - DNS on db1193.mgmt is OK: DNS OK: 0.013 seconds response time. db1193.mgmt.eqiad.wmnet returns 10.65.3.9 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:28:54] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [15:29:37] (03CR) 10David Caro: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:29:59] (03CR) 10Btullis: [C: 03+2] Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [15:30:14] (03CR) 10David Caro: [V: 03+1] "The diffs look good now, only changing the two lines with the public and cloud networks configuration file (and params for those in the cl" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:31:16] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms [15:32:29] (03Merged) 10jenkins-bot: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [15:38:33] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [15:43:11] (03CR) 10FNegri: ceph: use many cluster and public networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:43:48] (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:43:54] (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp3065 [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:43:56] RECOVERY - DNS on db1192.mgmt is OK: DNS OK: 0.011 seconds response time. db1192.mgmt.eqiad.wmnet returns 10.65.3.5 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:01] (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp1090 [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:44:07] (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp1089 [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:45:08] (03PS1) 10Ayounsi: Add names to flow collectors [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) [15:45:12] RECOVERY - DNS on db1187.mgmt is OK: DNS OK: 0.020 seconds response time. db1187.mgmt.eqiad.wmnet returns 10.65.3.0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:48] RECOVERY - DNS on db1185.mgmt is OK: DNS OK: 0.021 seconds response time. db1185.mgmt.eqiad.wmnet returns 10.65.2.254 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:48:45] (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrols: adjust fernet key rotation times [puppet] - 10https://gerrit.wikimedia.org/r/822385 (owner: 10Andrew Bogott) [15:49:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:50:00] (03CR) 10Ayounsi: "Local test returns:" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [15:51:34] (03CR) 10Ayounsi: "Passes junoser too: `junoser -c output/cr3-ulsfo.wikimedia.org.out`" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi) [15:53:54] RECOVERY - DNS on db1190.mgmt is OK: DNS OK: 0.012 seconds response time. db1190.mgmt.eqiad.wmnet returns 10.65.3.3 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:59:06] RECOVERY - DNS on db1195.mgmt is OK: DNS OK: 0.011 seconds response time. db1195.mgmt.eqiad.wmnet returns 10.65.3.12 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:05] jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:53] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10MatthewVernon) T308677 shows an example where the installer destroys a filesystem. [16:05:52] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10MatthewVernon) Also related is that following T309027, all the SSDs on ms-* reliably appear as non-rotational, so could in theory be... [16:10:19] TheresNoTime: your new message got used ^ [16:10:40] (03PS1) 10Giuseppe Lavagetto: Add stub data for profile::vopsbot [labs/private] - 10https://gerrit.wikimedia.org/r/822417 (https://phabricator.wikimedia.org/T314840) [16:12:40] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic1100 [16:12:42] RECOVERY - DNS on db1189.mgmt is OK: DNS OK: 0.012 seconds response time. db1189.mgmt.eqiad.wmnet returns 10.65.3.2 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:13:49] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet [16:14:04] RECOVERY - DNS on db1186.mgmt is OK: DNS OK: 0.013 seconds response time. db1186.mgmt.eqiad.wmnet returns 10.65.2.255 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:14:50] RECOVERY - Host ps1-c8-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms [16:15:04] RECOVERY - DNS on db1194.mgmt is OK: DNS OK: 0.016 seconds response time. db1194.mgmt.eqiad.wmnet returns 10.65.3.10 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:34] (03CR) 10David Caro: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:17:39] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [16:22:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) [16:22:58] (03Abandoned) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [16:26:15] !log bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues [16:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:47] (03PS1) 10Cwhite: tcpircbot: add and enable ecs logging handler [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861) [16:27:49] (03PS1) 10Cwhite: tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) [16:28:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [16:29:38] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810 [16:29:41] T309810: Service implementation for elastic1[084-102].eqiad.wmnet - https://phabricator.wikimedia.org/T309810 [16:29:52] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810 [16:30:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002 [16:30:24] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [16:33:01] (03PS1) 10Cwhite: tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) [16:35:29] !log mvernon@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002 [16:35:33] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [16:38:36] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:44:02] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [16:45:18] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:23] (03PS5) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) [16:50:50] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:55:01] (03CR) 10Ottomata: [C: 03+2] analytics:refinery:job:data_purge: Improve drop-webrequest-sequence-stats [puppet] - 10https://gerrit.wikimedia.org/r/822408 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [17:00:05] bd808: My dear minions, it's time we take the moon! Just kidding. Time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1700). [17:00:35] (03PS3) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) [17:01:54] * bd808 checks for things to deploy [17:02:48] (03PS3) 10Andrew Bogott: openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268) [17:04:21] meh. not worth a deploy for the amount of new translations for dev portal. [17:08:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [17:13:20] 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) 05Open→03Resolved @fgiunchedi it was a cable issue. Now fixed [17:14:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:15:34] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet [17:18:25] !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet [17:19:00] !log bking@cumin1001 conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet [17:19:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:21:47] !bash Krinkle: when in doubt, add another index [17:21:47] Amir1: Stored quip at https://bash.toolforge.org/quip/8I3tjYIBa_6PSCT9Ln_v [17:22:36] hehe [17:22:54] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp1089 [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:23:20] * Krinkle now knows what it feels like to be quoted out of context [17:26:56] (03PS2) 10Cwhite: tcpircbot: add and enable ecs logging handler [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861) [17:27:55] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul) [17:28:00] !log testing ATS 9.1.3-1wm1 on cp1089: T309651 [17:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:04] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [17:28:24] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268) (owner: 10Andrew Bogott) [17:31:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp3065 [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:33:08] !log testing ATS 9.1.3-1wm1 on cp3065: T309651 [17:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:12] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [17:34:25] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host netmon2002 [17:35:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002 [17:36:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:36:37] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp1090 [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:38:44] !log testing ATS 9.1.3-1wm1 on cp1090: T309651 [17:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:48] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [17:40:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED [17:41:53] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [17:44:38] (03PS1) 10Majavah: Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) [17:45:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10BCornwall) 05In progress→03Stalled @MRaishWMF as asked, are you just wanting us to add your SSH key to your account? Seeing as you're already part o... [17:46:53] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) 05In progress→03Stalled Hi, @soworu, are you still wanting this access? If so, it'd be useful to answer the questions posed by @Ottomata and @Vgutierrez [17:46:59] !log testing ATS 9.1.3-1wm1 on cp3064: T3096515 [17:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED [17:51:50] (03CR) 10Ladsgroup: [C: 03+1] Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah) [17:52:05] jouncebot: nowandnext [17:52:05] For the next 0 hour(s) and 7 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1700) [17:52:05] In 2 hour(s) and 7 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T2000) [17:52:26] !log testing ATS 9.1.3-1wm1 on cp3064: T309651 [17:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:30] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [17:52:46] i'm quickly deploying a mw patch [17:52:57] (03CR) 10Majavah: [C: 03+2] Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah) [17:53:40] (03Merged) 10jenkins-bot: Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah) [17:55:14] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:57:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:57:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:57:54] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ottomata) Are there actionables on this task? I'm considering re... [17:58:37] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:822428|Fix labtestwiki database name servers (T310795)]] (duration: 03m 39s) [17:58:42] T310795: Revive Labtestwikitech (formerly: Abolish labtestwikitech) - https://phabricator.wikimedia.org/T310795 [17:58:43] * taavi done [17:58:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:59:07] 10SRE, 10Discovery-Search (Current work): Possible problem communicating between racks F2 and F3 in EQIAD - https://phabricator.wikimedia.org/T315038 (10bking) [18:00:21] 10SRE, 10Discovery-Search (Current work): Possible problem communicating between racks F2 and F3 in EQIAD - https://phabricator.wikimedia.org/T315038 (10bking) [18:01:03] 10SRE, 10Discovery-Search (Current work): Possible problem communicating between racks F2 and F3 in EQIAD - https://phabricator.wikimedia.org/T315038 (10bking) [18:02:18] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:33] 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10BCornwall) p:05Triage→03Medium [18:04:15] 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10BCornwall) 05Open→03Resolved a:03BCornwall Hi, @mfossati, you've been given access so I'm going to close this ticket. Feel free to reopen if the issue isn't solved! [18:04:31] 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10BCornwall) a:05BCornwall→03Gehel [18:06:40] 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) 05Open→03Resolved Thanks for handling this! Since access has been granted, I'm going to close this ticket. Feel free to re-open if there's more... [18:07:24] 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) 05Resolved→03Open Hey @Papaul unfortunately I'm still seeing timeouts when connecting to this host: ` --- logstash2003.mgmt.codfw.wmnet ping statistics --- 3 packet... [18:15:53] 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) @herron the first issue was "The host hasn't come back and I can't reach its mgmt " for the timeout issue i will check the firmware version if it is old i will upgrade... [18:16:42] 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10RKemper) [18:17:11] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul) [18:19:19] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jcrespo) @Ottomata: The actionables of the task pending is to und... [18:20:26] 10ops-codfw, 10Gerrit, 10decommission-hardware, 10serviceops-radar, 10Release-Engineering-Team (The Decommission Mission 💀): decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Dzahn) [18:20:50] 10ops-codfw, 10Gerrit, 10decommission-hardware, 10serviceops-radar, 10Release-Engineering-Team (The Decommission Mission 💀): decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Dzahn) [18:25:26] 10ops-codfw, 10Gerrit, 10decommission-hardware, 10serviceops-radar, 10Release-Engineering-Team (The Decommission Mission 💀): decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Dzahn) a:03Papaul @Papaul This is WMF6408 in rack D5 U11. The decom cookbook has fi... [18:26:01] 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) Thank you, although re: the first issue I still cannot reach the mgmt, or the host interface of logstash2003. Ssh and ping both time out, and the host is flagged as dow... [18:30:08] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) This is now handed over to dcops for physical decom steps and continues still at T315040. [18:32:34] (03CR) 10Dzahn: [C: 03+1] "lgtm, compiler output and overall" [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [18:33:18] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:35:45] (03PS2) 10Dzahn: scap: Provide a working SSH key pair for the scap keyholder agent [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:36:19] (03CR) 10Dzahn: scap: Provide a working SSH key pair for the scap keyholder agent (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:36:29] (03PS3) 10Dzahn: scap: Provide a working SSH key pair for the scap keyholder agent [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:37:25] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "NOT the prod key but a real key" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:38:58] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "@dduvall see how I adjusted the key comment, i think keyholder relies on the comment string" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:44:09] (03PS6) 10Dzahn: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [18:45:41] (03PS1) 10Andrew Bogott: clouddb2002-dev: make a db node [puppet] - 10https://gerrit.wikimedia.org/r/822432 (https://phabricator.wikimedia.org/T306854) [18:46:59] (03CR) 10Majavah: "um, what's the use case of this?" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:48:28] (03CR) 10Andrew Bogott: [C: 03+2] clouddb2002-dev: make a db node [puppet] - 10https://gerrit.wikimedia.org/r/822432 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott) [18:48:37] (03CR) 10Ottomata: "We need the .deb to be installable first, in order to use this?" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [18:50:04] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "deploying within a cloud vps project from a local deployment server to the test instance" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:50:19] 10SRE, 10ops-codfw, 10Traffic: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) [18:50:49] 10SRE, 10ops-codfw, 10Traffic: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) p:05Triage→03Medium [18:52:40] (03CR) 10Majavah: scap: Provide a working SSH key pair for the scap keyholder agent (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall) [18:53:00] (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:53:54] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [18:57:13] (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:58:50] (03CR) 10Dzahn: "The part I don't understand yet is why we remove the entire "phabricator::redirector" and "file {"${phabdir}/robots.txt". Is that really i" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [19:02:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) Hi @BCornwall, sorry for the delay and thanks for the ping. Yes, I had intended to add an SSH key to my account to facilitate some analytics... [19:06:40] 10SRE, 10Traffic, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10Gehel) [19:06:51] 10SRE, 10Traffic, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10Gehel) We check the ferm rules, which seem to open those ports as expected. I suspect there is something going on at a lower networ... [19:11:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10BCornwall) 05Stalled→03In progress @MRaishWMF Thanks for replying! Please note that we follow the [[ https://en.wikipedia.org/wiki/Principle_of_leas... [19:11:28] (03CR) 10Dzahn: "I understand better now after looking at define phabricator::redirector. Because those all write into the phab conf dir. Compiling it.." [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [19:12:20] (03CR) 10Cwhite: [C: 03+2] logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [19:12:45] (JobUnavailable) resolved: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:13:33] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36708/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [19:14:35] (03PS1) 10Papaul: Add new PDU model for ps1-c8 [puppet] - 10https://gerrit.wikimedia.org/r/822436 (https://phabricator.wikimedia.org/T310145) [19:16:04] (03PS2) 10BCornwall: admin: Add SSH key to mraish user [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [19:16:25] (03CR) 10Papaul: [C: 03+2] Add new PDU model for ps1-c8 [puppet] - 10https://gerrit.wikimedia.org/r/822436 (https://phabricator.wikimedia.org/T310145) (owner: 10Papaul) [19:17:50] (Device rebooted) firing: Alert for device ps1-c8-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:18:18] (03PS3) 10BCornwall: admin: Add SSH key to mraish user [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [19:19:50] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) [19:20:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002 [19:20:55] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [19:22:10] 10SRE, 10Infrastructure-Foundations, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10RKemper) [19:22:50] (Device rebooted) resolved: Device ps1-c8-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [19:26:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:28:34] RECOVERY - Host cp2042 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [19:28:48] ^ mutante :D [19:30:19] sukhe: :) good old powercycle fixes it [19:30:32] but yea.. they are a bit mysterious then [19:30:54] I recall those cases and then there was nothing in syslog.. just ..it was doing things..and then it got rebooted [19:30:59] yeah... I am still curious why it happened at all [19:31:01] 10SRE, 10ops-codfw, 10Traffic: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) 05Open→03Resolved a:03ssingh Thanks to recommendations by @Dzahn, I did the following: ` racadm serveraction powercycle ` This... [19:31:10] but this is around the time of the PDU upgrade, so I am guessing something because of that [19:31:41] so the issue is still that you could not use IPMI /ipmitool , right [19:31:50] this seems like it needs DRAC reset [19:32:01] there are like 3 levels, soft, hard and factory reset afaik [19:32:05] I could use it but I got the weird message I shared above [19:32:12] soft and hard you can do without resetting the password [19:32:16] but yeah, it didn't work if that's what you meant but clearly it did connect (?) [19:32:47] now that the host is back you can do this https://wikitech.wikimedia.org/wiki/Management_Interfaces#Does_IPMI_work_locally? [19:33:02] it is "does IPMI work locally" and then "does it work remotely" [19:33:22] and then since you are directly on the DRAC via SSH, you can reset the DRAC and it might fix IPMI [19:33:49] IPMI seems fine [19:33:49] sukhe@cp2042:~$ sudo ipmi-chassis --get-chassis-status [19:33:50] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:33:53] System Power : on [19:33:54] Power overload : false [19:33:54] Interlock : inactive [19:34:04] ok, that's good [19:35:00] sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status [19:35:03] from a remote host [19:35:07] cumin host? [19:35:12] yep [19:35:47] following the recommendation just above, https://wikitech.wikimedia.org/wiki/Management_Interfaces#How_to_execute_remote_IPMI_commands [19:35:54] (03PS1) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [19:36:08] well, if it all works it's just one of those cases where we powercycle and it's back like nothing happened.. tag it "fluke" [19:36:27] (03PS1) 10Esanders: Enable DiscussionTools visual enhancements as beta everywhere except en/de/jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822440 (https://phabricator.wikimedia.org/T312672) [19:36:56] (03CR) 10CI reject: [V: 04-1] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [19:36:58] if it happens more than once.. it would go back to Dell and they might ask to upgrade firmware :p [19:38:07] there's probably a couple tickets for cp hosts doing this but not often enough [19:41:18] (03PS2) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [19:41:23] !log disabling puppet on C:profile::phabricator::main [19:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [19:44:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) [19:44:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) p:05Triage→03High [19:46:00] mutante: yeah I am going to ascribe it to a one-off or PDU upgrade for now and if it happens again, we will see :) [19:46:22] for the cp hosts: mostly it's DIMM errors that racadm reports but this one is pretty new, at least for me [19:48:11] (03PS1) 10Dzahn: phabricator: add /etc/phabricator/config.yaml for scap [puppet] - 10https://gerrit.wikimedia.org/r/822441 (https://phabricator.wikimedia.org/T313950) [19:48:43] sukhe: yea, agreed. we have had both types before afair [19:49:08] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:09] (03CR) 10CI reject: [V: 04-1] phabricator: add /etc/phabricator/config.yaml for scap [puppet] - 10https://gerrit.wikimedia.org/r/822441 (https://phabricator.wikimedia.org/T313950) (owner: 10Dzahn) [19:51:39] (03CR) 10Dzahn: [C: 03+2] "disabled puppet on prod phab hosts and testing on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [19:55:13] (03CR) 10Dzahn: [C: 03+2] "I am concerned about the change to scap::target that affects a lot more hosts than just phabricator hosts and it wasn't compiled on those." [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [20:00:05] brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T2000). [20:00:05] koi and Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ present [20:00:18] o/ I will be your brennen today [20:02:01] Jdlrobson: so since there's no train this week, wmf.24 isn't live, wmf.23 is (and then we'll deploy wmf.25, confusingly) so I'm going to tweak your backport to point to wmf.23 [20:02:23] (03CR) 10Dzahn: [C: 03+2] "noop confirmed on a couple other scap::target hosts in prod (gerrit,webperf,mwdebug,..)" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [20:02:28] (03PS1) 10Thcipriani: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822402 (https://phabricator.wikimedia.org/T314952) [20:02:50] (03CR) 10Thcipriani: [C: 03+2] Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822402 (https://phabricator.wikimedia.org/T314952) (owner: 10Thcipriani) [20:02:51] oh right it should be wmf23 [20:02:57] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) The above patch uses the new puppet facts to define vlan sub-interface and bridge relations as described in... [20:03:09] thcipriani: thanks for noticing that :) [20:03:14] cool, just got to wait for jenkins :) [20:03:24] (03CR) 10Dzahn: [C: 03+2] "well, it's NOT actually noop everywhere. this is what I had in mind with my previous concern:" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [20:03:40] koi: ping for backport if you're around [20:09:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) @BCornwall thanks again. I anticipate needing to SSH into stat machines in order to access Jupyter Lab and run spark queries. I'll update the... [20:10:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) [20:11:22] thcipriani: do i need to backport to wmf24 too? [20:11:37] (03PS1) 10Cwhite: logstash: replace legacy routing filters [puppet] - 10https://gerrit.wikimedia.org/r/822444 (https://phabricator.wikimedia.org/T314139) [20:11:39] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Schniggendiller) Also on deWP: https://d... [20:13:28] Jdlrobson: nah, it'll never get deployed [20:14:15] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [20:14:18] We cut the branch every week regardless of whether we cancel train because the automation is "simpler" [20:14:40] ack [20:20:20] (03CR) 10Dzahn: [C: 03+2] "checked on phab2001 next. this is all it does:" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [20:20:31] (03Merged) 10jenkins-bot: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822402 (https://phabricator.wikimedia.org/T314952) (owner: 10Thcipriani) [20:21:17] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10herron) [20:21:47] Jdlrobson: live on mwdebug1002, check please [20:22:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:45] thcipriani: Should we change that? (unconditional branch cut) [20:23:22] I have no strong feelings about it [20:23:33] thcipriani: almost done [20:23:52] 👍🏾 [20:23:52] cool, thanks for testing :) [20:23:53] !log merging change on prod phabricator host to allow scap deployment, part 1 [20:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:39] (03CR) 10Dzahn: [C: 03+2] "deployed in prod. same as above on phab1001. puppet re-enabled" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [20:25:11] LGTM thcipriani please sync [20:25:51] going live now [20:25:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:26] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:49] !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: [[gerrit:822396|Do not show incompatible skin warning when page is not editable (T314952)]] (duration: 03m 16s) [20:29:53] T314952: Misleading message shows in skins where VE is compatible but the page because of its state isn't - https://phabricator.wikimedia.org/T314952 [20:29:59] (03Abandoned) 10Dzahn: phabricator: add /etc/phabricator/config.yaml for scap [puppet] - 10https://gerrit.wikimedia.org/r/822441 (https://phabricator.wikimedia.org/T313950) (owner: 10Dzahn) [20:30:00] ^ Jdlrobson should be live [20:30:05] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Krinkle) [20:30:22] koi: last ping for UTC late backport [20:30:26] thanks thcipriani will monitor the logs. Hoping to see some results there. [20:30:32] I appreciate your help! [20:30:46] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:57] Jdlrobson: anytime, thanks for testing, and shepherding the patch! [20:31:12] (03PS6) 10Dzahn: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [20:31:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:56] 10SRE: decom cookbook should ignore site.pp - https://phabricator.wikimedia.org/T314954 (10Dzahn) @jbond per request from IRC [20:36:18] 10SRE: decom cookbook should ignore site.pp - https://phabricator.wikimedia.org/T314954 (10Dzahn) p:05Triage→03Low [20:36:46] sorry abot that, here's me [20:37:33] (03PS4) 10Thcipriani: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [20:37:50] koi: o/ ready to backport some patches? [20:38:01] yeah, sure! [20:38:54] (03CR) 10Thcipriani: [C: 03+2] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [20:39:57] (03Merged) 10jenkins-bot: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [20:40:36] ^ koi looks like a noop, correct? [20:41:10] thcipriani: yeah, the first patch is a noop [20:41:13] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Krinkle) Note that unlike most other packages, there is an especially sensitive dependency on the behaviour of librsvg which is the component of Thumbor responsible for converting SVGs... [20:41:28] k, syncing independently for completeness [20:42:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:42:53] (03PS12) 10Thcipriani: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [20:43:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:44:45] (03CR) 10Thcipriani: [C: 03+2] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [20:44:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:45:36] (03Merged) 10jenkins-bot: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [20:46:01] Actually this one is a noop too :) [20:46:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows - https://phabricator.wikimedia.org/T314698 (10Dzahn) @Novem_Linguae While we are still thinking about a better fix, there is at least one work around using WSL on Wind... [20:46:18] looks like most of them *should be* noops :) [20:46:27] but I'll still pull down and let you verify [20:47:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) I had a quick look and can't find any smoking gun so far. The issue seems to be related to... [20:47:24] php-fpm restart is taking a moment... [20:47:30] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806944|Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 07s) [20:47:35] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:47:35] T305692: Support language fallback for logo variants - https://phabricator.wikimedia.org/T305692 [20:48:21] koi: live on mwdebug1002 --- everything still looking good there? [20:49:58] thcipriani: visit zhwiki's main page with different variant and nothing wrong happened, I think we could move on [20:49:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:50:10] koi: ok, syncing, thanks for checking [20:50:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:51:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:51:06] (03PS2) 10Thcipriani: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:51:09] (03CR) 10Thcipriani: [C: 03+2] zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:51:42] (03Abandoned) 10Jdlrobson: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/822396 (https://phabricator.wikimedia.org/T314952) (owner: 10Jdlrobson) [20:52:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:52:02] (03Merged) 10jenkins-bot: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:53:34] koi: oh, scap won't let me sync it: Notice: Undefined variable: wmgSiteLogoVariantFallback in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 1057 Notice: Undefined variable: wmgSiteLogoVariantFallback in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 1062 [20:54:35] I wonder if default null vs default false is the cause of ^ [20:54:50] checking [20:55:13] this is from: mwscript eval.php --wiki aawiki '' [20:55:27] scap runs that as a quick check pre-sync [20:56:04] so, If I like to modify the default value, which is inside the first patch, what should I do now [20:57:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:57:18] I can revert this and we can merge that one [20:57:35] if you've got a patch ready [20:57:40] note re: logspam-watch: ~/bin/brennen/logspam & ~/bin/brennen/logspam-watch are fixed [20:57:41] thcipriani: fyi beta scap failed too with same reason [20:57:46] otherwise we can revert and try again another day [20:57:54] to be clear, revert all merged patches [20:58:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:58:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:58:08] koi: right, if you want to try again another day [20:58:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:58:46] ok, I'll try to come up with a patch [20:58:55] urgh that page can't be good [20:58:59] thcipriani: ^ [20:59:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:59:22] hey [20:59:33] is there a known correlation to something deploy related above? [20:59:34] We are mid deployment as an fyi [20:59:45] ok, looking as well [20:59:58] bblack: unclear, just deployed something, reverting (doubtful it's related, but reverting anyway) [21:00:03] I can access meta here [21:00:07] thcipriani: ack, thank you! [21:00:20] * Krinkle said something about re-ordering the patches at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799415/12#message-31cd0f8206f0fa7af1b692523d362bfc85c4bc06 [21:00:31] glad we caught it before flooding logstash [21:01:18] thcipriani: I guess now that we have atomic deploys through fpm restarts, maybe syncing both at once would work.. e.g. over wmf-config/ as a whole. [21:01:29] not tried before, at your risk :) [21:01:41] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows - https://phabricator.wikimedia.org/T314698 (10Dzahn) mailman3 upstream docs at https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/templates.htm... [21:01:45] I believe the merged state is without this error notice right? [21:02:10] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows (mailman3 template names use colon in file names) - https://phabricator.wikimedia.org/T314698 (10Dzahn) [21:02:24] Krinkle: we only got the first patch in this series out, should be a noop [21:02:29] scap caught the rest [21:02:40] first being: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/806944 [21:03:09] (although 2 more are currently merged) [21:03:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [21:04:16] bblack: ^ the graph looks like a very temp drop. Is it safe for things to carry on as normal? [21:04:35] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: revert [[gerrit:806944|Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 15s) [21:04:39] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [21:04:40] T305692: Support language fallback for logo variants - https://phabricator.wikimedia.org/T305692 [21:05:25] koi: we've ran over the window. I'm going to merge my reverts and let's try this again another day. [21:06:33] (03PS1) 10Thcipriani: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 [21:06:48] (03CR) 10CI reject: [V: 04-1] Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 (owner: 10Thcipriani) [21:07:16] thcipriani: got it, will have another patch some other days [21:07:18] (03PS1) 10Thcipriani: Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822404 [21:07:39] koi: thanks and sorry. [21:07:52] (03PS1) 10Thcipriani: Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822405 [21:08:29] (03CR) 10Thcipriani: [C: 03+2] Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822404 (owner: 10Thcipriani) [21:09:05] re: the paging alert, I don't *think* the deploy was related. Can't be certain, though. [21:09:11] (03Merged) 10jenkins-bot: Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822404 (owner: 10Thcipriani) [21:09:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:09:20] i doubt it was tbh [21:09:36] unless it was only unavailable from the canaries [21:09:53] Why does that certAlmostExpired go here [21:10:04] bblack: thanks for that update. Reverting because we were squeezing things in at the end of the window, and there were different minor errors. [21:14:18] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:14:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:15:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:15:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:16:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:17:42] (03CR) 10Thcipriani: [C: 03+2] Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822405 (owner: 10Thcipriani) [21:18:30] (03Merged) 10jenkins-bot: Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822405 (owner: 10Thcipriani) [21:19:01] thcipriani: beta CI has passed now after failing with the same issue as the prod sync [21:19:38] (03PS2) 10Thcipriani: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 [21:20:33] RhinosF1: proof of a production-like beta :) [21:20:56] (03CR) 10Thcipriani: [C: 03+2] Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 (owner: 10Thcipriani) [21:21:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:21:37] thcipriani: yep! [21:21:41] (03Merged) 10jenkins-bot: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 (owner: 10Thcipriani) [21:21:46] Beta has had a good run at being broken today though! [21:22:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:22:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:23:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:24:43] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) > My goal is to proceed and update the automation to set switch interface access/trunk and allowed vlans onc... [21:28:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:29:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:29:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:30:24] ok, merged state matches deployed state once again [21:30:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:39:10] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:50:10] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36714/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [21:50:59] (03PS2) 10Cwhite: logstash: replace legacy routing filters [puppet] - 10https://gerrit.wikimedia.org/r/822444 (https://phabricator.wikimedia.org/T305175) [21:51:01] (03PS1) 10Cwhite: logstash: use logstash routing for w3creportingapi stream [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175) [21:51:03] (03PS1) 10Cwhite: logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) [21:51:05] (03PS1) 10Cwhite: logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) [21:52:05] (03CR) 10Dzahn: [C: 03+2] phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [21:52:06] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [21:54:40] (03CR) 10Dzahn: [C: 03+2] "this added a line for the "www" user in /etc/phabricator/config.yaml and otherwise was noop on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche) [21:55:20] (03PS1) 10Brennen Bearnes: logspam: handle higher-resolution timestamps [puppet] - 10https://gerrit.wikimedia.org/r/822453 [21:56:05] (03CR) 10CI reject: [V: 04-1] logstash: replace legacy routing filters [puppet] - 10https://gerrit.wikimedia.org/r/822444 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [21:57:01] (03CR) 10CI reject: [V: 04-1] logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [21:58:53] (03CR) 10Dzahn: [C: 03+2] phabricator: Change local.json group to www-data / world readable [puppet] - 10https://gerrit.wikimedia.org/r/820779 (https://phabricator.wikimedia.org/T313950) (owner: 10Brennen Bearnes) [22:02:34] (03PS1) 10Dzahn: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 [22:04:39] (03CR) 10Dzahn: "should this change be made on the local puppetmaster instead? but then don't we always have cherry-picks?" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn) [22:15:43] (03PS1) 10BBlack: Add wikifunctions to MW canonical redirects [puppet] - 10https://gerrit.wikimedia.org/r/822455 (https://phabricator.wikimedia.org/T275904) [22:17:58] (03PS1) 10Cwhite: logstash: do not overwrite partition in routing [puppet] - 10https://gerrit.wikimedia.org/r/822456 (https://phabricator.wikimedia.org/T314139) [22:23:48] (03CR) 10Cwhite: [C: 03+2] logstash: do not overwrite partition in routing [puppet] - 10https://gerrit.wikimedia.org/r/822456 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [22:27:11] 10SRE-Access-Requests: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10Dzahn) [22:27:27] (03PS2) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) [22:28:18] (03PS3) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) [22:28:44] (03PS4) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048) [22:29:37] mutante: can i get a quick +2 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453 ? [22:29:49] minor regex change for deployer log monitoring [22:30:23] (worst case it's already broken) [22:31:08] ok, yes, I recall this file [22:31:55] (03CR) 10Dzahn: [C: 03+2] logspam: handle higher-resolution timestamps [puppet] - 10https://gerrit.wikimedia.org/r/822453 (owner: 10Brennen Bearnes) [22:31:59] thanks! [22:33:24] it affects mwlog1002/mwlog2002. change has been applied.. now. (ran puppet) [22:35:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10Dzahn) @thcipriani You are group approver for this shell group. [22:37:39] confirmed working; thx again. [22:41:12] :) laters then [22:49:32] (03PS2) 10Cwhite: logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175) [22:53:00] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:33] 10SRE: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Legoktm) [23:00:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows (mailman3 template names use colon in file names) - https://phabricator.wikimedia.org/T314698 (10Legoktm) [23:01:27] 10SRE: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Legoktm) Sorry, this slipped off my radar to work on. The proper fix I had planned is to deploy the mailman-templates Debian package (https://gerrit.wikimedia.org/g/ope... [23:02:32] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:04] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [23:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:36:50] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) I do not know PHP or Python, but here are the changes needed to wiki configuration, SVGHandler.php, and Thumbor's svg.py. * https://commons.wikimedia.org/wiki/User:Glrx... [23:39:27] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Glrx) >>! In T265549#8147272, @Glrx wrote: > I do not know PHP or Python, but here are the changes needed to wiki configuration, SVGHandler.php, and Thumbor's svg.py. > > * https://com...