[00:02:01] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:23] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:30:23] <wikibugs>	 (03PS3) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692)
[00:31:35] <wikibugs>	 (03PS11) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692)
[00:32:27] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul)
[00:33:31] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[00:34:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) 05Open→03Resolved Row D maintenance complete
[00:35:19] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[00:36:25] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul)
[00:37:02] <wikibugs>	 (03PS1) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620)
[00:39:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-drop-webrequest-sequence-stats-partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:40:15] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q1): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10lmata) >In T309033#8140576, @herron wrote: > Please see https://phabricator.wikimedia.org/T313229#8130640  Thank you!
[00:41:47] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:35] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:05] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Create the OpenSSH directory inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936)
[00:57:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
[00:57:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2042.codfw.wmnet with reason: host down; depooled and will debug tomorrow
[00:58:01] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-tls
[00:58:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=ats-be
[00:58:09] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2042.codfw.wmnet,service=varnish-fe
[01:02:51] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:05:50] <wikibugs>	 (03PS1) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822197 (https://phabricator.wikimedia.org/T308620)
[01:08:26] <wikibugs>	 (03Abandoned) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800793 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[01:12:03] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[01:15:51] <wikibugs>	 (03CR) 10Andrea Denisse: "Hello team, here are the PCC results for this patch: https://puppet-compiler.wmflabs.org/pcc-worker1001/36691/" [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse)
[01:19:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Hello team, after further testing it the least disruptive and simplest approach is to create the `.ssh` directory using Puppet.  It nee...
[01:19:23] <logmsgbot>	 !log tstarling@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore,name=codfw
[01:19:58] <logmsgbot>	 !log tstarling@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=(appservers|api)-ro,name=codfw
[01:32:57] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:33:01] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:34:10] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling)
[01:34:54] <wikibugs>	 (03Merged) 10jenkins-bot: Microsecond timestamp resolution in UDP logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820904 (owner: 10Tim Starling)
[01:37:25] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:38:48] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/logging.php: (no justification provided) (duration: 03m 25s)
[01:38:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:39:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:39:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:40:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:42:17] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:07] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:45:29] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:05] <wikibugs>	 (03PS13) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:54:53] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:03:25] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:29] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:51] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:15:58] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:25:11] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:26:07] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:27:41] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:01] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:49] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:35] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:34:17] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:40:57] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:44:12] <wikibugs>	 (03PS12) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[02:46:23] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:57:00] <Tamzin>	 hello :)
[02:57:02] <Tamzin>	 upstream connect error or disconnect/reset before headers. reset reason: overflow
[02:57:10] <Tks4Fish>	 I was wondering if it was just me
[02:57:19] <Tks4Fish>	 went away now though
[02:57:25] <Tamzin>	 same
[02:57:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[02:57:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[02:58:12] <rzl>	 hey, looking
[02:58:18] <jinxer-wm>	 (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:59:48] <jhathaway>	 online as well
[03:00:33] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:09] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:02:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[03:03:18] <jinxer-wm>	 (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:11:33] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:12:17] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:55] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:21:31] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) I implemented option 3, and created T314868 for tracking the roll-out.
[03:26:27] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:28:29] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:45] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:21] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:33:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I did a pywikibot edit on testwiki from my Dallas test instance. The time between the completion of the last codfw sessionstore write and the eqia...
[03:34:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[03:35:42] <wikibugs>	 (03CR) 10Mary Yang: Use proxy for wikifunctions beta blackbox probe. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[03:41:49] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:44:53] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:47:11] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:47:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10SRE Observability (FY2022/2023-Q1): LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10ayounsi) Looks like permission issues: `name=netmon1003 ayounsi@...
[03:51:03] <cwhite>	 !log chown librenms /srv/librenms/rrd/* on netmon1003 T314972
[03:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:51:07] <stashbot>	 T314972: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972
[03:52:31] <denisse|m>	 Thanks cwhite, , I think we should run the 'chown' recursively. Looking at the puppet repo to send a patch that fixes this.
[03:53:27] <cwhite>	 denisse|m: good catch
[03:53:35] <cwhite>	 reran with -R
[03:55:17] <denisse|m>	 !log chown -R librenms /srv/librenms/rrd/ on netmon1003 T314972
[03:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:01:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/821698 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[04:02:58] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:04:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Add the netmon1003 host to the alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/822126 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[04:07:57] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Use netmon1003's IP address for the librenms endpoint [puppet] - 10https://gerrit.wikimedia.org/r/822124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[04:11:29] <denisse|m>	 Looking at the code that folder belongs to the 'www-data' user. https://github.com/wikimedia/puppet/blob/production/modules/librenms/manifests/init.pp#L107
[04:11:29] <denisse|m>	 So I guess it's override to 'deploy-librenms' during the 'rsync::quickdatacopy' process...
[04:12:19] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:13:01] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:13:57] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:21] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:17] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:58] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM on the overall logic (and PCC)." [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse)
[04:18:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10ayounsi) > @ayounsi do you anticipate any fallout from this?  I agree that it's better to check host keys, so +1 as long as: * there is some kind of al...
[04:26:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:33:33] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:15] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:59] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:59] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:02] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Set correct owner for the LibreNMS rrd directory. [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972)
[04:53:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) I think that the owner is override to '`deploy-librenms`' during the [[ ht...
[04:55:25] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:57:25] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36692/" [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse)
[05:00:05] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:04:11] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:11:15] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:15:05] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:15:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:18:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368
[05:18:49] <stashbot>	 T314368: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T314368
[05:19:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T314368
[05:19:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1122 with weight 0 T314368', diff saved to https://phabricator.wikimedia.org/P32349 and previous config saved to /var/cache/conftool/dbconfig/20220811-051913-ladsgroup.json
[05:22:05] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:32:27] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:39:14] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819525 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot)
[05:39:19] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1122 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/819525 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot)
[05:39:22] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10tstarling) >>! In T40010#8144396, @Arthur2e5 wrote: > I am… getting impatient enough to ask: how hard is it to, really, just make our own statically-compiled rsvg-convert bina...
[05:41:57] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:44:57] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:47:17] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:55:13] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:56:33] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T0600). Please do the needful.
[06:00:14] <Amir1>	 o/
[06:00:15] <Amir1>	 let's go
[06:00:19] <Amir1>	 !log Starting s2 eqiad failover from db1162 to db1122 - T314368
[06:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:23] <stashbot>	 T314368: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T314368
[06:00:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T314368', diff saved to https://phabricator.wikimedia.org/P32350 and previous config saved to /var/cache/conftool/dbconfig/20220811-060042-ladsgroup.json
[06:01:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1122 to s2 primary and set section read-write T314368', diff saved to https://phabricator.wikimedia.org/P32351 and previous config saved to /var/cache/conftool/dbconfig/20220811-060113-ladsgroup.json
[06:01:15] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:02:19] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:03] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:32] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/819546 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot)
[06:04:05] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/819546 (https://phabricator.wikimedia.org/T314368) (owner: 10Gerrit maintenance bot)
[06:06:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1162 (T314368 T298555 T312863 T310011 T309311 T60674 T298560 T303603 T310485)', diff saved to https://phabricator.wikimedia.org/P32352 and previous config saved to /var/cache/conftool/dbconfig/20220811-060625-ladsgroup.json
[06:06:39] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[06:06:39] <stashbot>	 T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674
[06:06:40] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[06:06:40] <stashbot>	 T314368: Switchover s2 master (db1162 -> db1122) - https://phabricator.wikimedia.org/T314368
[06:06:40] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[06:06:41] <stashbot>	 T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311
[06:06:41] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:06:41] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[06:12:31] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:14:11] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:16:31] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:17:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
[06:17:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maint
[06:17:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32353 and previous config saved to /var/cache/conftool/dbconfig/20220811-061734-ladsgroup.json
[06:17:37] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[06:22:29] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:28:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[06:28:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[06:32:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32354 and previous config saved to /var/cache/conftool/dbconfig/20220811-063240-ladsgroup.json
[06:47:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[06:47:27] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[06:47:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32355 and previous config saved to /var/cache/conftool/dbconfig/20220811-064746-ladsgroup.json
[06:49:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:53:03] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:25] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:54:11] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:58:39] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:05] <jouncebot>	 Amir1, apergos, jnuche, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T0700).
[07:00:18] <apergos>	 morning!
[07:00:28] <apergos>	 there are no trainees signed up and no patches in the window.
[07:01:03] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:01:41] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:21] <Amir1>	 apergos: today or early next week I'm planning to the patches for reload db config on the fly, that will impact WikiExporter
[07:02:43] <apergos>	 planning to what the patches, sorry? a verb missing there
[07:02:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312863)', diff saved to https://phabricator.wikimedia.org/P32356 and previous config saved to /var/cache/conftool/dbconfig/20220811-070252-ladsgroup.json
[07:02:53] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[07:02:56] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[07:03:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154
[07:03:05] <Amir1>	 apergos: deploy, sorry
[07:03:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[07:03:09] <wikibugs>	 (03PS2) 10David Caro: p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870)
[07:03:11] <wikibugs>	 (03PS2) 10David Caro: p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870)
[07:03:13] <wikibugs>	 (03PS2) 10David Caro: p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870)
[07:03:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T312863)', diff saved to https://phabricator.wikimedia.org/P32357 and previous config saved to /var/cache/conftool/dbconfig/20220811-070312-ladsgroup.json
[07:03:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154 (owner: 10Giuseppe Lavagetto)
[07:03:16] <Amir1>	 context: T298485
[07:03:18] <stashbot>	 T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485
[07:03:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154
[07:03:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2] Revert "scap: temporarily remove proxy for ongoing maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/822154 (owner: 10Giuseppe Lavagetto)
[07:03:35] <apergos>	 mind waiting until early next week? I plan on following some wikimania sessions today and tomorrow
[07:03:48] <apergos>	 just in case there's an issue
[07:04:13] <Amir1>	 sure
[07:05:06] <apergos>	 sounds good
[07:05:21] <apergos>	 I'm subscribed on that task and have been following the patch of course
[07:08:03] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:03] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2135 [puppet] - 10https://gerrit.wikimedia.org/r/822310 (https://phabricator.wikimedia.org/T314628)
[07:10:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/822310 (https://phabricator.wikimedia.org/T314628) (owner: 10Jcrespo)
[07:11:03] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:12:17] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:47] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:11] <_joe_>	 !log pooling all services in codfw
[07:19:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:53] <wikibugs>	 (03PS1) 10David Caro: pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312
[07:23:48] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=inference
[07:24:07] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=shellbox-timeline
[07:24:56] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
[07:25:55] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:26:03] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:26:37] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:26:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10ayounsi) Thanks for the explanations! I think it would be nice to have them on Wikitech to find them more easily in the future.  Based on...
[07:27:21] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:31] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:33] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:33] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:55] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:55] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:59] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:59] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:27:59] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:28:03] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:28:11] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[07:35:53] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:41:39] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, logstash2003, mc2024, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe200
[07:41:39] <icinga-wm>	 s-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[07:42:59] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:46:18] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "lvs: move ulsfo, eqsin off of conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/822155
[07:47:34] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:50:19] <vgutierrez>	 _joe_: is that you? ^
[07:50:34] <vgutierrez>	 the gdnsd/confd noise
[07:50:37] <_joe_>	 vgutierrez: yes sigh
[07:50:41] <_joe_>	 will fix it
[07:50:45] <vgutierrez>	 thx <3
[07:50:53] <_joe_>	 nothing bad happened btw
[07:51:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: move ulsfo, eqsin off of conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/822155 (owner: 10Vgutierrez)
[07:51:27] <vgutierrez>	 !log rolling restart of pybal in eqsin and ulsfo
[07:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:25] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:52:43] <wikibugs>	 (03CR) 10FNegri: p:ceph::osd: bring the cluster interface up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:55:33] <wikibugs>	 (03PS1) 10David Caro: icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317
[07:55:40] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw
[07:56:23] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) After 38 hours of checking and 7 milion rows compared to eqiad's es1021, I can confidently say that data was in a good state after the crash.
[07:56:56] <wikibugs>	 (03PS1) 10David Caro: wmcs.neutron: Add alert to open a task when agent down [alerts] - 10https://gerrit.wikimedia.org/r/822319
[07:57:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
[07:58:00] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
[07:58:43] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:59:28] <_joe_>	 vgutierrez: we should get recoveries soon
[07:59:28] <wikibugs>	 (03PS3) 10David Caro: p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870)
[07:59:30] <wikibugs>	 (03PS3) 10David Caro: p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870)
[07:59:32] <wikibugs>	 (03CR) 10David Caro: p:ceph::osd: bring the cluster interface up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[07:59:36] <wikibugs>	 (03PS3) 10David Caro: p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870)
[07:59:43] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:59:46] <vgutierrez>	 _joe_: nice
[08:00:59] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (28) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, logstash2003, mc2024, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, stat1004, stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe200
[08:00:59] <icinga-wm>	 s-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[08:01:00] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[08:03:47] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:04:09] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:15] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:06:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr3-ulsfo:xe-0/1/1
[08:06:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr3-ulsfo:xe-0/1/1
[08:09:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] netmon: Use netmon1003's IP address for the librenms endpoint [puppet] - 10https://gerrit.wikimedia.org/r/822124 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[08:10:35] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5001 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[08:11:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications for db2135 [puppet] - 10https://gerrit.wikimedia.org/r/822310 (https://phabricator.wikimedia.org/T314628) (owner: 10Jcrespo)
[08:11:15] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:11:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] netmon: Add the netmon1003 host to the alertmanager API rw [puppet] - 10https://gerrit.wikimedia.org/r/822126 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[08:11:56] <godog>	 jynus: merged your patch too!
[08:12:01] <jynus>	 thanks, I was about to ask
[08:12:41] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:13:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:15:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: logstash route k8s logs from proxy,httpd containers to webrequest partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite)
[08:16:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you! LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro)
[08:16:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312 (owner: 10David Caro)
[08:21:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:23:54] <wikibugs>	 (03PS1) 10Elukey: Add the experimental k8s ml-serve configuration [labs/private] - 10https://gerrit.wikimedia.org/r/822321
[08:24:29] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add the experimental k8s ml-serve configuration [labs/private] - 10https://gerrit.wikimedia.org/r/822321 (owner: 10Elukey)
[08:25:30] <wikibugs>	 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi)
[08:26:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:31:10] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] p:ceph::osd: add the routes only after the interface [puppet] - 10https://gerrit.wikimedia.org/r/822115 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[08:31:13] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] p:ceph::osd: bring the cluster interface up [puppet] - 10https://gerrit.wikimedia.org/r/822116 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[08:31:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10jbond)
[08:31:19] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] p:ceph::osd: also install the ceph-osd package [puppet] - 10https://gerrit.wikimedia.org/r/822117 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[08:31:35] <wikibugs>	 (03PS1) 10Elukey: Add 'experimental' user/ns config for k8s ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/822324
[08:32:15] <wikibugs>	 (03CR) 10Vgutierrez: Enable query sorting for all testwiki requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[08:32:31] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10fgiunchedi) My apologies! I ran the quickdatacopy the other day ahead of the failover and...
[08:34:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add 'experimental' user/ns config for k8s ml-serve clusters [puppet] - 10https://gerrit.wikimedia.org/r/822324 (owner: 10Elukey)
[08:37:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) Well, I tried our usual procedure https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings and the first two commands work OK, but attempting t...
[08:41:57] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:46:58] <wikibugs>	 (03PS1) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326
[08:47:05] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:47:34] <wikibugs>	 (03PS2) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326
[08:47:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you Andrea for looking into this!" [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse)
[08:55:05] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 112.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[08:59:01] <wikibugs>	 (03CR) 10Svantje Lilienthal: [C: 03+1] Enable editor line numbering on all namespaces, for twwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight)
[09:00:07] <jbond>	 !log update unzip
[09:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:27] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl1002 is CRITICAL: 113.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[09:03:21] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:19] <jbond>	 !log update gnutls28 on bullseye systems
[09:09:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:47] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:16] <wikibugs>	 (03PS3) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326
[09:12:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10fgiunchedi) I looked into why quickdatacopy didn't do the right thing: * the rsync server...
[09:14:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10fgiunchedi) Agreed on the short term fix to create the `.ssh` directory. However if we were not checking host keys to begin with I think we should keep...
[09:15:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:17:33] <jbond>	 XioNoX: topranks: fyi ^^
[09:19:43] <XioNoX>	 jbond: interesting, I don't know where the rancid passphrase is :)
[09:19:55] <XioNoX>	 but I guess it got armed on netmon1003 so someone should have it?
[09:20:08] <jbond>	 yes there was some chat in sre yesterday one sec
[09:20:48] <jbond>	 XioNoX: apprentyl ~/Wikimedia/pw/network-monitoring-keys-passphrase
[09:21:09] <wikibugs>	 (03PS1) 10FNegri: Use broader network for Ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870)
[09:22:04] <XioNoX>	 jbond: done
[09:22:08] <XioNoX>	 jbond: note that "access: @ops"
[09:22:28] <XioNoX>	 only the homer key is restricted to netops (as it have write access)
[09:22:40] <jbond>	 thanks and ack
[09:22:41] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] pytest: fix warning about marks not defined [alerts] - 10https://gerrit.wikimedia.org/r/822312 (owner: 10David Caro)
[09:22:52] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] icinga: fix test with duplicated sample [alerts] - 10https://gerrit.wikimedia.org/r/822317 (owner: 10David Caro)
[09:23:40] <wikibugs>	 (03CR) 10AikoChou: ml-services: add the experimental helmfile config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (owner: 10Elukey)
[09:25:16] <wikibugs>	 (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36694/console" [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[09:25:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:27:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[09:27:31] <wikibugs>	 (03CR) 10FNegri: [V: 03+1 C: 03+2] Use broader network for Ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/822332 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[09:27:35] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (owner: 10Elukey)
[09:31:27] <wikibugs>	 (03PS4) 10Elukey: ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982)
[09:31:29] <wikibugs>	 (03CR) 10Elukey: ml-services: add the experimental helmfile config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey)
[09:32:18] <godog>	 !log arm keyholder on netmon2001
[09:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:45] <godog>	 thank you stashbot 
[09:33:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi)
[09:34:39] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:33] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:38:15] <icinga-wm>	 PROBLEM - Check systemd state on netboxdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@13-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:41:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ayounsi) 05Open→03Resolved a:03ayounsi
[09:43:57] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi)
[09:48:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 (owner: 10Gergő Tisza)
[09:48:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:48:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:49:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10fgiunchedi) re: the original failed disk I can confirm that `slot 0` (where the disk was) isn't currently listed:  ` root@ms-be2067:~# megacli -pdlist -aALL | grep 'Slot Number' Slot Number: 1 S...
[09:51:06] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey)
[09:51:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: add -1 bucket to mediawiki_access_log [puppet] - 10https://gerrit.wikimedia.org/r/822077 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi)
[09:52:28] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922)
[09:55:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi)
[09:55:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:55:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:56:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:base::firewall: use either etc or abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/822125 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[09:56:51] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot: UX improvements - https://phabricator.wikimedia.org/T314843 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe
[09:56:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:56:55] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe)
[09:56:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[09:57:24] <wikibugs>	 (03PS4) 10Filippo Giunchedi: mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922)
[10:00:05] <jouncebot>	 mvolz: Dear deployers, time to do the Services – Citoid / Zotero deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1000).
[10:00:36] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10fgiunchedi) Thank you for vopsbot, looks really good and useful!  A perhaps silly/minor thing: I think we should be using `-` instead of `_` as a delimiter for commands, a...
[10:00:40] <TheresNoTime>	 Beta cluster 503ing
[10:01:32] <TheresNoTime>	 bad scap runs x3
[10:02:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[10:02:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[10:02:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey)
[10:02:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[10:03:23] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 294, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:04:57] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:11] <icinga-wm>	 PROBLEM - confd service on sretest1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:07:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[10:12:10] <jbond>	 !log upload cond package to bullseye-wikimedia
[10:13:01] <jbond>	 !log (correction) upload *confd* package to bullseye-wikimedia
[10:14:13] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:15:58] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting: User management in vopsbot - https://phabricator.wikimedia.org/T314842 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe I'm resolving the task because I think the current changes are enough for the current goal. I'll come back to look at what @Rhi...
[10:16:03] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe)
[10:31:45] <icinga-wm>	 RECOVERY - Check systemd state on netboxdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:09] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:28] <wikibugs>	 (03PS1) 10Ayounsi: Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486)
[10:37:30] <wikibugs>	 (03PS1) 10Jbond: O:sretest: enable confd based abuse filter on sretest [puppet] - 10https://gerrit.wikimedia.org/r/822340 (https://phabricator.wikimedia.org/T313825)
[10:38:43] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:42:09] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:49:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[10:49:47] <wikibugs>	 (03PS1) 10Jbond: C:postgresql::slave: update recovery configueration [puppet] - 10https://gerrit.wikimedia.org/r/822342
[10:50:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:sretest: enable confd based abuse filter on sretest [puppet] - 10https://gerrit.wikimedia.org/r/822340 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[10:58:41] <icinga-wm>	 RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:00:54] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:postgresql::slave: update recovery configueration [puppet] - 10https://gerrit.wikimedia.org/r/822342 (owner: 10Jbond)
[11:06:51] <icinga-wm>	 PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1002 is CRITICAL: File not found: /etc/ferm/conf.d/00_defs_requestctl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[11:09:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[11:10:01] <icinga-wm>	 RECOVERY - confd service on sretest1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:10:17] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:24] <wikibugs>	 (03PS2) 10Ayounsi: Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486)
[11:14:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[11:20:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[11:20:39] <wikibugs>	 (03PS1) 10Jbond: C:ferm: allow useres to read config files, needed for nrpe [puppet] - 10https://gerrit.wikimedia.org/r/822361 (https://phabricator.wikimedia.org/T313825)
[11:20:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[11:26:36] <wikibugs>	 (03PS2) 10Jbond: C:ferm: allow useres to read config files, needed for nrpe [puppet] - 10https://gerrit.wikimedia.org/r/822361 (https://phabricator.wikimedia.org/T313825)
[11:32:57] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:48] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[11:39:57] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:51] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10phaultfinder)
[11:45:36] <jnuche>	 hi there! just a FYI, the WMCS cluster is currently having some issues. I have disabled the sync of WM code to beta for the time being until things stabilize a bit
[11:52:41] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey)
[11:55:53] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:58:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[11:59:43] <icinga-wm>	 PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:02:51] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:03:17] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add the experimental helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/822326 (https://phabricator.wikimedia.org/T314982) (owner: 10Elukey)
[12:05:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:09:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:10:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:11:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:12:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:12:29] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:13:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) @fgiunchedi yeah that may be an option.  I'm not sure how easy it is to change Rancid to add that to the command when running ssh, but I'm sur...
[12:13:55] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet
[12:14:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mtail: test for histogram -1 bucket [puppet] - 10https://gerrit.wikimedia.org/r/822056 (https://phabricator.wikimedia.org/T314922) (owner: 10Filippo Giunchedi)
[12:14:23] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:16:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[12:17:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[12:17:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:17:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:17:38] <wikibugs>	 10SRE, 10Traffic, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Me and @Vgutierrez have fixed the existing histograms and I've added a test for `buckets -1` so we d...
[12:23:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Another oddity here with rancid from netmon1003.  The permission change has removed the problem for most of our estate (all the Juniper device...
[12:23:18] <wikibugs>	 (03PS1) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246)
[12:23:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[12:26:15] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase202[367].codfw.wmnet
[12:26:22] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2018.codfw.wmnet
[12:27:36] <jnuche>	 if someone saw my message above, I've re-enabledd beta sync 
[12:32:15] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10fgiunchedi)
[12:32:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Logs suggest a timeout:  ` scs-oe16-esams.mgmt.esams.wmnet oglogin error: Error: TIMEOUT reached scs-oe16-esams.mgmt.esams.wmnet: missed cmd(s...
[12:33:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "That's correct Mary, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822179 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[12:33:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Add proxy_url to prometheus::blackbox::check:http as a parameter. [puppet] - 10https://gerrit.wikimedia.org/r/822179 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[12:34:55] <icinga-wm>	 PROBLEM - Host logstash2003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:35:40] <godog>	 known ^, host is a lemon
[12:36:34] <wikibugs>	 (03PS2) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246)
[12:37:11] <wikibugs>	 (03PS3) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246)
[12:39:53] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) p:05Triage→03Medium a:03Papaul
[12:46:26] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:49:05] <wikibugs>	 (03PS4) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246)
[12:49:30] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:51:24] <wikibugs>	 (03PS5) 10Btullis: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246)
[12:55:37] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[12:55:41] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[12:56:47] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1060.eqiad.wmnet with OS bullseye
[12:56:56] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1060.eqiad.wmnet with OS bullseye
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1300).
[13:00:05] <jouncebot>	 awight, koi, MatmaRex, and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] <MatmaRex>	 yup hi
[13:00:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:00:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:24] <phuedx>	 o/
[13:00:34] <awight>	 I can self-deploy, and happy to do anyone else's patches.
[13:01:03] <wikibugs>	 (03PS2) 10Awight: Enable editor line numbering on all namespaces, for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852)
[13:01:10] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight)
[13:01:19] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:01:36] <koi>	 o/
[13:01:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) I believe the issue is that the expect script Rancid is running for these is not saying "yes" to accept the host key.  This did not happen in...
[13:02:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable editor line numbering on all namespaces, for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822073 (https://phabricator.wikimedia.org/T302852) (owner: 10Awight)
[13:02:10] <awight>	 hrm logspam-watch is broken
[13:02:33] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.210 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:03:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:04:43] <awight>	 looks okay on debug, deploying
[13:05:57] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:06:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36697/console" [puppet] - 10https://gerrit.wikimedia.org/r/822342 (owner: 10Jbond)
[13:06:48] <wikibugs>	 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10BTullis) Since our meeting, I have been reading the docs around benthos and I've got to say, I find it really compelling!  This looks to m...
[13:07:16] <wikibugs>	 (03PS3) 10Awight: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 (owner: 10Stang)
[13:08:44] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822073|Enable editor line numbering on all namespaces, for twwiki (T302852)]] (duration: 03m 42s)
[13:08:48] <stashbot>	 T302852: Enable line numbering in all namespaces for more wikis (collection of requests) - https://phabricator.wikimedia.org/T302852
[13:08:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:09:00] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 (owner: 10Stang)
[13:09:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "trwiki: Change old and new vector logos for 500k articles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/821330 (owner: 10Stang)
[13:10:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:10:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:11:26] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
[13:11:34] <awight>	 koi: the 500k logo change is ready on mwdebug1001 if you wish to test it
[13:11:40] <koi>	 looking
[13:11:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:12:17] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms
[13:12:19] <koi>	 awight: LGTM
[13:12:26] <awight>	 ack
[13:12:34] <wikibugs>	 (03PS1) 10Jgreen: Enable icinga monitoring for frlog1002. [puppet] - 10https://gerrit.wikimedia.org/r/822369 (https://phabricator.wikimedia.org/T312581)
[13:14:03] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1060.eqiad.wmnet with reason: host reimage
[13:14:24] <awight>	 I think I can deploy wmf-config first, but a bit worried that there might be a race condition with logos/
[13:15:33] <awight>	 Generally, patches should be split up so that each one is safe regardless of the order in which each file is synced.
[13:15:52] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Enable icinga monitoring for frlog1002. [puppet] - 10https://gerrit.wikimedia.org/r/822369 (https://phabricator.wikimedia.org/T312581) (owner: 10Jgreen)
[13:16:17] <koi>	 noticed, will think about that in the future
[13:16:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:16:59] <awight>	 Deployers: can I run createExtensionTables.php safely, or is that an Ops thing?
[13:17:12] <awight>	 urbanecm: ^
[13:17:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:17:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:18:01] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/: Config: [[gerrit:821330|Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 1) (duration: 03m 13s)
[13:18:31] <koi>	 T313173#8101583
[13:18:31] <stashbot>	 T313173: add WikiLove extension in ptwikinews - https://phabricator.wikimedia.org/T313173
[13:18:39] <koi>	 awight: previous task might help ^
[13:18:44] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:50] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add additional network device info to puppet facts [puppet] - 10https://gerrit.wikimedia.org/r/821781 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[13:18:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:19:04] <urbanecm>	 awight: deployers can run that script
[13:19:20] <awight>	 koi: very helpful, thanks!
[13:19:25] <awight>	 urbanecm: great
[13:19:44] <topranks>	 !log merging CR821781 to expose additional network info in puppet facts
[13:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:20] <logmsgbot>	 !log awight@deploy1002 Synchronized logos/: Config: [[gerrit:821330|Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 2) (duration: 03m 09s)
[13:24:02] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms
[13:25:08] <wikibugs>	 (03PS2) 10Awight: trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang)
[13:25:14] <logmsgbot>	 !log awight@deploy1002 Synchronized static/images: Config: [[gerrit:821330|Revert "trwiki: Change old and new vector logos for 500k articles"]] (part 3) (duration: 03m 09s)
[13:26:12] <wikibugs>	 (03CR) 10Awight: [C: 03+2] trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang)
[13:27:11] <wikibugs>	 (03Merged) 10jenkins-bot: trwikiquote: Install WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822130 (https://phabricator.wikimedia.org/T314895) (owner: 10Stang)
[13:27:55] <awight>	 koi: wikilove can be tested on trwikiquote using mwdebug1001
[13:28:00] <koi>	 looking
[13:28:50] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[13:29:38] <wikibugs>	 (03PS1) 10Vgutierrez: smokeping: Use asw1-b12-drmrs instead of lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/822373
[13:29:42] <koi>	 awight: LGTM
[13:30:04] <icinga-wm>	 PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100%
[13:30:54] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:32:06] <awight>	 koi: +1 ty
[13:33:58] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host logstash2003.codfw.wmnet
[13:34:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:35:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:35:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:36:00] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822130|trwikiquote: Install WikiLove extension (T314895)]] (duration: 03m 30s)
[13:36:03] <stashbot>	 T314895: Enable the WikiLove extension on trwikiquote - https://phabricator.wikimedia.org/T314895
[13:36:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:36:16] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 20.30 ms
[13:36:18] <awight>	 MatmaRex: Would you like to self-deploy, or shall I?
[13:36:35] <MatmaRex>	 awight: please do, i don't have access
[13:36:41] <awight>	 sure!
[13:36:58] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1060.eqiad.wmnet with OS bullseye
[13:37:04] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1060.eqiad.wmnet with OS bullseye completed: - elastic1060 (...
[13:38:07] <wikibugs>	 (03CR) 10Awight: "Many of the .html files have conflict markers--is this a problem?" [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński)
[13:38:15] <awight>	 MatmaRex: Can you check that? ^
[13:38:40] <awight>	 Seems like the tests should have failed...
[13:39:14] <wikibugs>	 (03CR) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (031 comment) [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński)
[13:39:21] <MatmaRex>	 yeah
[13:39:36] <MatmaRex>	 the tests parse the HTML, maybe they look enough like HTML tags
[13:39:41] <MatmaRex>	 let me try to rebuild the tests
[13:40:14] <awight>	 kk, I'll move on to phuedx's patches in the meantime
[13:41:07] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707)
[13:41:43] <wikibugs>	 (03PS3) 10Awight: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 (owner: 10Phuedx)
[13:41:55] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 (owner: 10Phuedx)
[13:42:06] <MatmaRex>	 oh wait, i see
[13:42:12] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:42:12] <MatmaRex>	 those files aren't used. oops
[13:42:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820666 (owner: 10Phuedx)
[13:44:01] <awight>	 phuedx: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820666 is on mwdebug1001
[13:44:12] <icinga-wm>	 PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:18] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:44:20] <wikibugs>	 (03PS4) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707)
[13:45:07] <wikibugs>	 (03CR) 10Bartosz Dziewoński: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (031 comment) [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński)
[13:45:35] <MatmaRex>	 awight: the test files with conflict markers weren't actually used. fixed now
[13:45:36] <awight>	 phuedx: I'm not sure how to test, all I can say is that I don't see js console errors and the site still works when I mouse around.
[13:45:40] <awight>	 MatmaRex: ty
[13:45:57] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Hieradata: move restbase prod to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/819578 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon)
[13:46:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:46:22] <phuedx>	 awight: Testing now. I'm verifying that the stream config for the stream is only sent to the client on testwiki and not on, say, enwiki
[13:46:48] <awight>	 phuedx: good thing you're testing ;-), I was accidentally on enwiki
[13:46:54] <phuedx>	 awight: LGTM
[13:47:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:47:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:47:58] <awight>	 ack
[13:48:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:50:30] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms
[13:50:59] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:820666|Revert "Revert "testwiki: Add mediawiki.web_ui.interactions stream""]] (duration: 03m 10s)
[13:51:41] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying.  This historical block deserves a celebration of newfound emptiness!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:52:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002
[13:52:07] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[13:53:24] <wikibugs>	 (03PS3) 10Ori: Enable query sorting for all testwiki requests [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868)
[13:54:13] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński)
[13:55:00] <wikibugs>	 (03PS2) 10Awight: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:55:15] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:55:17] <wikibugs>	 (03CR) 10Ori: Enable query sorting for all testwiki requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[13:55:40] <ori>	 vgutierrez: ^
[13:56:03] <wikibugs>	 (03Merged) 10jenkins-bot: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[13:56:29] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865)
[13:56:40] <wikibugs>	 (03CR) 10Ori: [C: 03+1] Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[13:57:33] <awight>	 phuedx: the second event patch is ready on mwdebug1001
[13:57:46] <wikibugs>	 (03PS2) 10Jbond: C:postgresql::slave: update recovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/822342
[13:58:25] <awight>	 (pushing the deployment window a few minutes beyond 14:00)
[13:59:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:17] <icinga-wm>	 RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:01:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:postgresql::slave: update recovery configuration [puppet] - 10https://gerrit.wikimedia.org/r/822342 (owner: 10Jbond)
[14:01:51] <wikibugs>	 (03Merged) 10jenkins-bot: CommentFormatter: Set 'data-mw-comment' even when reply tool disabled [extensions/DiscussionTools] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822149 (https://phabricator.wikimedia.org/T314707) (owner: 10Bartosz Dziewoński)
[14:02:30] <phuedx>	 awight: Something's up. Don't proceed with that patch. The default WikibaseTermboxInteraction set in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/814192/1/extension-repo.json#b607 isn't being propagated to the client
[14:02:40] <awight>	 phuedx: okay, reverting!
[14:03:05] <wikibugs>	 (03PS1) 10Awight: Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822394
[14:03:13] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:15] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822394 (owner: 10Awight)
[14:03:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:03:25] <phuedx>	 We'll try again next week!
[14:04:02] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822394 (owner: 10Awight)
[14:04:16] <awight>	 MatmaRex: DiscussionTools patch is ready to test on mwdebug1001
[14:04:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:04:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:04:49] <Amir1>	 jouncebot: nowandnext
[14:04:49] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 55 minute(s)
[14:04:49] <jouncebot>	 In 1 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1600)
[14:04:58] <MatmaRex>	 looking
[14:05:06] <awight>	 phuedx: If you don't mind, can you confirm that the event is back to normal?
[14:05:10] <awight>	 (mwdebug1001)
[14:05:16] <awight>	 Amir1: I should be done in < 10 min
[14:05:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:05:22] <Amir1>	 awight: let me know once you're done, I have some evil patches to push
[14:05:26] <Amir1>	 Thanks <3
[14:05:30] <awight>	 :-D I expect no less
[14:05:37] <awight>	 (no less than evil ;-)
[14:06:04] <MatmaRex>	 awight: looks good
[14:06:07] <awight>	 ty!
[14:06:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott)
[14:06:25] <wikibugs>	 (03PS5) 10Andrew Bogott: Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268)
[14:06:33] <phuedx>	 awight: I was testing on the wrong wiki. Of course... *facepalm*
[14:06:50] <phuedx>	 Anyway, everything's in a good state
[14:07:10] <awight>	 phuedx: hehe okay +1 since this is just cleanup, AFAICT, I won't de-revert.
[14:07:27] <awight>	 MatmaRex: deploying...
[14:08:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:ferm: allow useres to read config files, needed for nrpe [puppet] - 10https://gerrit.wikimedia.org/r/822361 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond)
[14:08:27] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms
[14:08:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove hiera refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott)
[14:09:04] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1004.wikimedia.org
[14:09:58] <phuedx>	 awight: No worries. I'll queue up the de-revert for next week
[14:10:14] <logmsgbot>	 !log awight@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/DiscussionTools/includes/CommentFormatter.php: Backport: [[gerrit:822149|CommentFormatter: Set 'data-mw-comment' even when reply tool disabled (T314707)]] (duration: 03m 31s)
[14:10:19] <stashbot>	 T314707: New topic tool and topic subscriptions don't work when reply tool is disabled and the page would have reply links - https://phabricator.wikimedia.org/T314707
[14:10:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:10:36] <wikibugs>	 (03PS1) 10Phuedx: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395
[14:11:10] <awight>	 !log EU backport window complete
[14:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:12] <awight>	 Amir1: ^
[14:11:16] <wikibugs>	 (03PS6) 10Andrew Bogott: Remove puppet refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268)
[14:11:20] <MatmaRex>	 thanks awight
[14:11:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:11:25] <Amir1>	 awesome
[14:11:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:11:27] <awight>	 My pleasure!
[14:11:31] <wikibugs>	 (03PS2) 10Ladsgroup: Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865)
[14:11:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[14:12:20] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to the old templatelinks fields in s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822375 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[14:12:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:13:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32360 and previous config saved to /var/cache/conftool/dbconfig/20220811-141309-ladsgroup.json
[14:13:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:14:06] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) >>! In T265549#8144450, @tstarling wrote: >>>! In T40010#8144396, @Arthur2e5 wrote: >> I am… getting impatient enough to ask: how hard is it to, really, just make our ow...
[14:15:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:15:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet refs to cloudcontrol1003 and cloudcontrol1004 [puppet] - 10https://gerrit.wikimedia.org/r/822142 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott)
[14:15:59] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:16:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:16:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:16:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:17:16] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp1089 [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651)
[14:17:17] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:822375|Stop writing to the old templatelinks fields in s2 (T312865)]] (duration: 03m 25s)
[14:17:21] <stashbot>	 T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865
[14:17:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:18:12] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) >>! In T40010#7996397, @TheDJ wrote: > I would like to note that this can all easily be implemented for non-wmf wikis. If someone just spent some time on adapting SVGHan...
[14:18:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:18:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:18:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:18:27] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1004.wikimedia.org
[14:19:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:19:21] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:19:30] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36698/console" [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:19:45] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1003.wikimedia.org
[14:20:23] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp1090 [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651)
[14:21:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36699/console" [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:22:19] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651)
[14:22:23] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcontrols: adjust fernet key rotation times [puppet] - 10https://gerrit.wikimedia.org/r/822385
[14:22:51] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table ptwiki.templatelinks: Duplicate entry 6941876-0- for key tl_from, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1156-bin.001819, end_log_pos 231590299 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:23:05] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:06] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36700/console" [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:23:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:23:39] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp3065 [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651)
[14:23:42] <wikibugs>	 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata)  I agree benthos looks really fun!   I think there is a real need for easy to use stream processors.  We evaluated Knative Event...
[14:24:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[14:24:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36701/console" [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:25:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:26:57] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01266 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:27:47] * jbond looking
[14:28:04] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:28:05] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1003.wikimedia.org
[14:28:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32361 and previous config saved to /var/cache/conftool/dbconfig/20220811-142813-ladsgroup.json
[14:29:14] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol100[34] - https://phabricator.wikimedia.org/T313268 (10Andrew) a:05Andrew→03Cmjohnson
[14:30:02] <wikibugs>	 (03PS2) 10Phuedx: Revert "Revert "Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822395 (https://phabricator.wikimedia.org/T290303)
[14:30:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:31:19] <wikibugs>	 (03PS2) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104)
[14:32:02] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 132 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:32:14] <Amir1>	 jbond: did you get a page for this? Because I didn't
[14:32:52] <RhinosF1>	 i don't see a page listed in klaxon
[14:33:06] <jbond>	 Amir1: no i have an irc highlight for that specific alert
[14:33:19] <Amir1>	 aha, amazing. Thanks
[14:33:26] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:33:27] <jbond>	 :) no probs
[14:33:30] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 949.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:33:59] <RhinosF1>	 the s2 lag is known ^ i already mentioned it to Amir1 
[14:34:08] <Amir1>	 I'm working on it
[14:34:39] <RhinosF1>	 wanted to make sure everyone else knew :)
[14:35:30] <Amir1>	 well, pinging me at middle of the debug just slows me down
[14:36:53] <RhinosF1>	 i should have dropped the 1 or put a . somewhere
[14:37:22] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 59 probes of 689 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:38:38] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 6 probes of 775 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:38:43] <wikibugs>	 (03PS1) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870)
[14:39:25] <Amir1>	 it should be catching up now, it seems it had a drift in schema, an extra unique index and just only on ptwiki, I checked some other s2 wikis and they were fine but I need to check each one I think
[14:39:35] <wikibugs>	 (03PS1) 10Jdlrobson: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/822396 (https://phabricator.wikimedia.org/T314952)
[14:39:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on db1155 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:39:44] <icinga-wm>	 RECOVERY - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:39:55] <wikibugs>	 (03PS1) 10Mforns: analytics:refinery:job:data_purge: Improve drop-webrequest-sequence-stats [puppet] - 10https://gerrit.wikimedia.org/r/822408 (https://phabricator.wikimedia.org/T270433)
[14:40:01] <RhinosF1>	 Amir1: replag.toolforge.org looks caught up
[14:40:51] <Amir1>	 it's not in any other wiki of s2
[14:40:58] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl1002 is OK: (C)100 ge (W)50 ge 34.93 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[14:41:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Use proxy for wikifunctions beta blackbox probe. [puppet] - 10https://gerrit.wikimedia.org/r/822181 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang)
[14:41:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[14:42:21] <wikibugs>	 (03PS1) 10Jbond: C:ferm: add o+x permissions to ferm directory [puppet] - 10https://gerrit.wikimedia.org/r/822409
[14:42:24] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36703/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[14:42:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:ferm: add o+x permissions to ferm directory [puppet] - 10https://gerrit.wikimedia.org/r/822409 (owner: 10Jbond)
[14:43:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32362 and previous config saved to /var/cache/conftool/dbconfig/20220811-144318-ladsgroup.json
[14:47:30] <wikibugs>	 (03PS2) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870)
[14:48:52] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage (bullseye upgrade) - ryankemper@cumin1001 - T289135
[14:48:57] <stashbot>	 T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135
[14:49:40] <icinga-wm>	 RECOVERY - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[14:49:53] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36704/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[14:50:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[14:50:30] <icinga-wm>	 PROBLEM - Host ps1-c8-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:50:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[14:50:50] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005842 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:51:10] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on db1155 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:52:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:53:44] <wikibugs>	 (03CR) 10FNegri: [C: 04-1] "The diff here doesn't look right https://puppet-compiler.wmflabs.org/pcc-worker1002/36704/cloudcephosd1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[14:53:50] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:18] <jinxer-wm>	 (CertAlmostExpired) firing: Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:54:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:55:38] <icinga-wm>	 RECOVERY - Check systemd state on poolcounter1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:45] <inflatador>	 !log bking@cumin1001 running puppet agent across eqiad elastic hosts
[14:55:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:58:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32364 and previous config saved to /var/cache/conftool/dbconfig/20220811-145823-ladsgroup.json
[15:01:22] <icinga-wm>	 PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:03:20] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:08] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:05:48] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:05:56] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:06:06] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:06:18] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:06:24] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:06:30] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:06:30] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:06:50] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:07:20] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:07:22] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:07:30] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:07:46] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:07:48] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:07:56] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:08:00] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[15:09:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[15:09:57] <vgutierrez>	 <3 _joe_ 
[15:15:46] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:15:47] <wikibugs>	 (03PS3) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870)
[15:16:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:18:16] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36705/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:20:34] <icinga-wm>	 RECOVERY - DNS on db1191.mgmt is OK: DNS OK: 0.012 seconds response time. db1191.mgmt.eqiad.wmnet returns 10.65.3.4 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:21:44] <wikibugs>	 (03PS4) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870)
[15:22:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:22:48] <wikibugs>	 (03CR) 10David Caro: ceph: use many cluster and public networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:23:16] <wikibugs>	 (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:23:18] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36706/console" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:24:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:27:04] <icinga-wm>	 RECOVERY - DNS on db1193.mgmt is OK: DNS OK: 0.013 seconds response time. db1193.mgmt.eqiad.wmnet returns 10.65.3.9 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:28:54] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[15:29:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:29:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[15:30:14] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "The diffs look good now, only changing the two lines with the public and cloud networks configuration file (and params for those in the cl" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:31:16] <icinga-wm>	 RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.55 ms
[15:32:29] <wikibugs>	 (03Merged) 10jenkins-bot: Update the VarnishKafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/822367 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[15:38:33] <wikibugs>	 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe)
[15:43:11] <wikibugs>	 (03CR) 10FNegri: ceph: use many cluster and public networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:43:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:43:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp3065 [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:43:56] <icinga-wm>	 RECOVERY - DNS on db1192.mgmt is OK: DNS OK: 0.011 seconds response time. db1192.mgmt.eqiad.wmnet returns 10.65.3.5 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:44:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp1090 [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:44:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp1089 [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:45:08] <wikibugs>	 (03PS1) 10Ayounsi: Add names to flow collectors [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805)
[15:45:12] <icinga-wm>	 RECOVERY - DNS on db1187.mgmt is OK: DNS OK: 0.020 seconds response time. db1187.mgmt.eqiad.wmnet returns 10.65.3.0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:45:48] <icinga-wm>	 RECOVERY - DNS on db1185.mgmt is OK: DNS OK: 0.021 seconds response time. db1185.mgmt.eqiad.wmnet returns 10.65.2.254 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:48:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudcontrols: adjust fernet key rotation times [puppet] - 10https://gerrit.wikimedia.org/r/822385 (owner: 10Andrew Bogott)
[15:49:36] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:50:00] <wikibugs>	 (03CR) 10Ayounsi: "Local test returns:" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi)
[15:51:34] <wikibugs>	 (03CR) 10Ayounsi: "Passes junoser too: `junoser -c output/cr3-ulsfo.wikimedia.org.out`" [homer/public] - 10https://gerrit.wikimedia.org/r/822414 (https://phabricator.wikimedia.org/T313805) (owner: 10Ayounsi)
[15:53:54] <icinga-wm>	 RECOVERY - DNS on db1190.mgmt is OK: DNS OK: 0.012 seconds response time. db1190.mgmt.eqiad.wmnet returns 10.65.3.3 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:59:06] <icinga-wm>	 RECOVERY - DNS on db1195.mgmt is OK: DNS OK: 0.011 seconds response time. db1195.mgmt.eqiad.wmnet returns 10.65.3.12 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:00:05] <jouncebot>	 jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1600). nyaa~
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:03:53] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10MatthewVernon) T308677 shows an example where the installer destroys a filesystem.
[16:05:52] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing swift/puppet problems - example reimage - https://phabricator.wikimedia.org/T308644 (10MatthewVernon) Also related is that following T309027, all the SSDs on ms-* reliably appear as non-rotational, so could in theory be...
[16:10:19] <RhinosF1>	 TheresNoTime: your new message got used ^
[16:10:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add stub data for profile::vopsbot [labs/private] - 10https://gerrit.wikimedia.org/r/822417 (https://phabricator.wikimedia.org/T314840)
[16:12:40] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: name=elastic1100
[16:12:42] <icinga-wm>	 RECOVERY - DNS on db1189.mgmt is OK: DNS OK: 0.012 seconds response time. db1189.mgmt.eqiad.wmnet returns 10.65.3.2 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:13:49] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
[16:14:04] <icinga-wm>	 RECOVERY - DNS on db1186.mgmt is OK: DNS OK: 0.013 seconds response time. db1186.mgmt.eqiad.wmnet returns 10.65.2.255 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:14:50] <icinga-wm>	 RECOVERY - Host ps1-c8-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[16:15:04] <icinga-wm>	 RECOVERY - DNS on db1194.mgmt is OK: DNS OK: 0.016 seconds response time. db1194.mgmt.eqiad.wmnet returns 10.65.3.10 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:16:34] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:17:39] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite)
[16:22:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney)
[16:22:58] <wikibugs>	 (03Abandoned) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE))
[16:26:15] <inflatador>	 !log bking@elastic1054 attempting to ban elastic1100-1102 from cluster due to firewall issues
[16:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:47] <wikibugs>	 (03PS1) 10Cwhite: tcpircbot: add and enable ecs logging handler [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861)
[16:27:49] <wikibugs>	 (03PS1) 10Cwhite: tcpircbot: send !log events to log stream [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861)
[16:28:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[16:29:38] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810
[16:29:41] <stashbot>	 T309810: Service implementation for elastic1[084-102].eqiad.wmnet - https://phabricator.wikimedia.org/T309810
[16:29:52] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810
[16:30:21] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: upgrade to 3.11.13 T309896 - mvernon@cumin2002
[16:30:24] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[16:33:01] <wikibugs>	 (03PS1) 10Cwhite: tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861)
[16:35:29] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002
[16:35:33] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[16:38:36] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:44:02] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms
[16:45:18] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:45:23] <wikibugs>	 (03PS5) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870)
[16:50:50] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:55:01] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] analytics:refinery:job:data_purge: Improve drop-webrequest-sequence-stats [puppet] - 10https://gerrit.wikimedia.org/r/822408 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns)
[17:00:05] <jouncebot>	 bd808: My dear minions, it's time we take the moon! Just kidding. Time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1700).
[17:00:35] <wikibugs>	 (03PS3) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882)
[17:01:54] * bd808 checks for things to deploy
[17:02:48] <wikibugs>	 (03PS3) 10Andrew Bogott: openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268)
[17:04:21] <bd808>	 meh. not worth a deploy for the amount of new translations for dev portal.
[17:08:52] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[17:13:20] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) 05Open→03Resolved @fgiunchedi it was a cable issue. Now fixed
[17:14:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:15:34] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=search-omega-https,name=elastic1100.eqiad.wmnet
[17:18:25] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=yes; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
[17:19:00] <logmsgbot>	 !log bking@cumin1001 conftool action : set/weight=10:pooled=no; selector: service=elasticsearch-omega-ssl,name=elastic1100.eqiad.wmnet
[17:19:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[17:21:47] <Amir1>	 !bash Krinkle: when in doubt, add another index
[17:21:47] <stashbot>	 Amir1: Stored quip at https://bash.toolforge.org/quip/8I3tjYIBa_6PSCT9Ln_v
[17:22:36] <dancy>	 hehe
[17:22:54] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp1089 [puppet] - 10https://gerrit.wikimedia.org/r/822381 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:23:20] * Krinkle now knows what it feels like to be quoted out of context
[17:26:56] <wikibugs>	 (03PS2) 10Cwhite: tcpircbot: add and enable ecs logging handler [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861)
[17:27:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul)
[17:28:00] <sukhe>	 !log testing ATS 9.1.3-1wm1 on cp1089: T309651
[17:28:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:04] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[17:28:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova: use TLS on rabbitmq connections [puppet] - 10https://gerrit.wikimedia.org/r/821298 (https://phabricator.wikimedia.org/T297268) (owner: 10Andrew Bogott)
[17:31:03] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp3065 [puppet] - 10https://gerrit.wikimedia.org/r/822406 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:33:08] <sukhe>	 !log testing ATS 9.1.3-1wm1 on cp3065: T309651
[17:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:12] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[17:34:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host netmon2002
[17:35:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host netmon2002
[17:36:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:36:37] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp1090 [puppet] - 10https://gerrit.wikimedia.org/r/822382 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:38:44] <sukhe>	 !log testing ATS 9.1.3-1wm1 on cp1090: T309651
[17:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:48] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[17:40:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:41:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
[17:41:53] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp3064 [puppet] - 10https://gerrit.wikimedia.org/r/822384 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[17:44:38] <wikibugs>	 (03PS1) 10Majavah: Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795)
[17:45:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10BCornwall) 05In progress→03Stalled @MRaishWMF as asked, are you just wanting us to add your SSH key to your account? Seeing as you're already part o...
[17:46:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10BCornwall) 05In progress→03Stalled Hi, @soworu, are you still wanting this access? If so, it'd be useful to answer the questions posed by @Ottomata and @Vgutierrez
[17:46:59] <sukhe>	 !log testing ATS 9.1.3-1wm1 on cp3064: T3096515
[17:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host netmon2002.mgmt.codfw.wmnet with reboot policy FORCED
[17:51:50] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah)
[17:52:05] <taavi>	 jouncebot: nowandnext
[17:52:05] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T1700)
[17:52:05] <jouncebot>	 In 2 hour(s) and 7 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T2000)
[17:52:26] <sukhe>	 !log testing ATS 9.1.3-1wm1 on cp3064: T309651
[17:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:30] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[17:52:46] <taavi>	 i'm quickly deploying a mw patch
[17:52:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah)
[17:53:40] <wikibugs>	 (03Merged) 10jenkins-bot: Fix labtestwiki database name servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822428 (https://phabricator.wikimedia.org/T310795) (owner: 10Majavah)
[17:55:14] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:57:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:57:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:57:54] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ottomata) Are there actionables on this task?  I'm considering re...
[17:58:37] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:822428|Fix labtestwiki database name servers (T310795)]] (duration: 03m 39s)
[17:58:42] <stashbot>	 T310795: Revive Labtestwikitech  (formerly: Abolish labtestwikitech) - https://phabricator.wikimedia.org/T310795
[17:58:43] * taavi done
[17:58:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:59:07] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Possible problem communicating between racks F2 and F3 in EQIAD - https://phabricator.wikimedia.org/T315038 (10bking)
[18:00:21] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Possible problem communicating between racks F2 and F3 in EQIAD - https://phabricator.wikimedia.org/T315038 (10bking)
[18:01:03] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Possible problem communicating between racks F2 and F3 in EQIAD - https://phabricator.wikimedia.org/T315038 (10bking)
[18:02:18] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:03:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10BCornwall) p:05Triage→03Medium
[18:04:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10BCornwall) 05Open→03Resolved a:03BCornwall Hi, @mfossati, you've been given access so I'm going to close this ticket. Feel free to reopen if the issue isn't solved!
[18:04:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Search Airflow instance for mfossati - https://phabricator.wikimedia.org/T314853 (10BCornwall) a:05BCornwall→03Gehel
[18:06:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Requesting access to production / the sreadmins group for Clément Goubert - https://phabricator.wikimedia.org/T313902 (10BCornwall) 05Open→03Resolved Thanks for handling this! Since access has been granted, I'm going to close this ticket. Feel free to re-open if there's more...
[18:07:24] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) 05Resolved→03Open Hey @Papaul unfortunately I'm still seeing timeouts when connecting to this host:  ` --- logstash2003.mgmt.codfw.wmnet ping statistics --- 3 packet...
[18:15:53] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10Papaul) @herron  the first issue was "The host hasn't come back and I can't reach its mgmt " for the timeout issue i will check the firmware version if it is old i will upgrade...
[18:16:42] <wikibugs>	 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10RKemper)
[18:17:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10Papaul)
[18:19:19] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform Value Stream, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jcrespo) @Ottomata: The actionables of the task pending is to und...
[18:20:26] <wikibugs>	 10ops-codfw, 10Gerrit, 10decommission-hardware, 10serviceops-radar, 10Release-Engineering-Team (The Decommission Mission 💀): decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Dzahn)
[18:20:50] <wikibugs>	 10ops-codfw, 10Gerrit, 10decommission-hardware, 10serviceops-radar, 10Release-Engineering-Team (The Decommission Mission 💀): decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Dzahn)
[18:25:26] <wikibugs>	 10ops-codfw, 10Gerrit, 10decommission-hardware, 10serviceops-radar, 10Release-Engineering-Team (The Decommission Mission 💀): decommission gerrit2001.wikimedia.org (dcops) - https://phabricator.wikimedia.org/T315040 (10Dzahn) a:03Papaul @Papaul This is WMF6408 in rack D5 U11.   The decom cookbook has fi...
[18:26:01] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q1): logstash2003 down, mgmt unreachable - https://phabricator.wikimedia.org/T315000 (10herron) Thank you, although re: the first issue I still cannot reach the mgmt, or the host interface of logstash2003.  Ssh and ping both time out, and the host is flagged as dow...
[18:30:08] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) This is now handed over to dcops for physical decom steps and continues still at T315040.
[18:32:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, compiler output and overall" [puppet] - 10https://gerrit.wikimedia.org/r/822196 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse)
[18:33:18] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:35:45] <wikibugs>	 (03PS2) 10Dzahn: scap: Provide a working SSH key pair for the scap keyholder agent [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:36:19] <wikibugs>	 (03CR) 10Dzahn: scap: Provide a working SSH key pair for the scap keyholder agent (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:36:29] <wikibugs>	 (03PS3) 10Dzahn: scap: Provide a working SSH key pair for the scap keyholder agent [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:37:25] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "NOT the prod key but a real key" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:38:58] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "@dduvall see how I adjusted the key comment, i think keyholder relies on the comment string" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:44:09] <wikibugs>	 (03PS6) 10Dzahn: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[18:45:41] <wikibugs>	 (03PS1) 10Andrew Bogott: clouddb2002-dev: make a db node [puppet] - 10https://gerrit.wikimedia.org/r/822432 (https://phabricator.wikimedia.org/T306854)
[18:46:59] <wikibugs>	 (03CR) 10Majavah: "um, what's the use case of this?" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:48:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouddb2002-dev: make a db node [puppet] - 10https://gerrit.wikimedia.org/r/822432 (https://phabricator.wikimedia.org/T306854) (owner: 10Andrew Bogott)
[18:48:37] <wikibugs>	 (03CR) 10Ottomata: "We need the .deb to be installable first, in order to use this?" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu)
[18:50:04] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "deploying within a cloud vps project from a local deployment server to the test instance" [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:50:19] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh)
[18:50:49] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) p:05Triage→03Medium
[18:52:40] <wikibugs>	 (03CR) 10Majavah: scap: Provide a working SSH key pair for the scap keyholder agent (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[18:53:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:53:54] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[18:57:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:58:50] <wikibugs>	 (03CR) 10Dzahn: "The part I don't understand yet is why we remove the entire "phabricator::redirector" and "file {"${phabdir}/robots.txt". Is that really i" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[19:02:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) Hi @BCornwall, sorry for the delay and thanks for the ping. Yes, I had intended to add an SSH key to my account to facilitate some analytics...
[19:06:40] <wikibugs>	 10SRE, 10Traffic, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10Gehel)
[19:06:51] <wikibugs>	 10SRE, 10Traffic, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10Gehel) We check the ferm rules, which seem to open those ports as expected. I suspect there is something going on at a lower networ...
[19:11:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10BCornwall) 05Stalled→03In progress @MRaishWMF Thanks for replying! Please note that we follow the [[ https://en.wikipedia.org/wiki/Principle_of_leas...
[19:11:28] <wikibugs>	 (03CR) 10Dzahn: "I understand better now after looking at define phabricator::redirector. Because those all write into the phab conf dir. Compiling it.." [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[19:12:20] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash route k8s logs from proxy,httpd containers to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/821323 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite)
[19:12:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:13:33] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/36708/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[19:14:35] <wikibugs>	 (03PS1) 10Papaul: Add new PDU model for ps1-c8 [puppet] - 10https://gerrit.wikimedia.org/r/822436 (https://phabricator.wikimedia.org/T310145)
[19:16:04] <wikibugs>	 (03PS2) 10BCornwall: admin: Add SSH key to mraish user [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[19:16:25] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new PDU model for ps1-c8 [puppet] - 10https://gerrit.wikimedia.org/r/822436 (https://phabricator.wikimedia.org/T310145) (owner: 10Papaul)
[19:17:50] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-c8-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[19:18:18] <wikibugs>	 (03PS3) 10BCornwall: admin: Add SSH key to mraish user [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[19:19:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul)
[19:20:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: upgrade to 3.11.13 T309896 - mvernon@cumin2002
[19:20:55] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[19:22:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10RKemper)
[19:22:50] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-c8-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[19:26:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:28:34] <icinga-wm>	 RECOVERY - Host cp2042 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms
[19:28:48] <sukhe>	 ^ mutante :D
[19:30:19] <mutante>	 sukhe: :) good old powercycle fixes it
[19:30:32] <mutante>	 but yea.. they are a bit mysterious then 
[19:30:54] <mutante>	 I recall those cases and then there was nothing in syslog.. just ..it was doing things..and then it got rebooted
[19:30:59] <sukhe>	 yeah... I am still curious why it happened at all
[19:31:01] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) 05Open→03Resolved a:03ssingh Thanks to recommendations by @Dzahn, I did the following:  ` racadm serveraction powercycle `  This...
[19:31:10] <sukhe>	 but this is around the time of the PDU upgrade, so I am guessing something because of that
[19:31:41] <mutante>	 so the issue is still that you could not use IPMI /ipmitool , right
[19:31:50] <mutante>	 this seems like it needs  DRAC reset
[19:32:01] <mutante>	 there are like 3 levels, soft, hard and factory reset afaik
[19:32:05] <sukhe>	 I could use it but I got the weird message I shared above
[19:32:12] <mutante>	 soft and hard you can do without resetting the password
[19:32:16] <sukhe>	 but yeah, it didn't work if that's what you meant but clearly it did connect (?)
[19:32:47] <mutante>	 now that the host is back you can do this  https://wikitech.wikimedia.org/wiki/Management_Interfaces#Does_IPMI_work_locally?
[19:33:02] <mutante>	 it is "does IPMI work locally" and then "does it work remotely"
[19:33:22] <mutante>	 and then since you are directly on the DRAC via SSH, you can reset the DRAC and it might fix IPMI 
[19:33:49] <sukhe>	 IPMI seems fine
[19:33:49] <sukhe>	 sukhe@cp2042:~$ sudo ipmi-chassis --get-chassis-status
[19:33:50] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:33:53] <sukhe>	 System Power                        : on
[19:33:54] <sukhe>	 Power overload                      : false
[19:33:54] <sukhe>	 Interlock                           : inactive
[19:34:04] <mutante>	 ok, that's good
[19:35:00] <mutante>	 sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status
[19:35:03] <mutante>	 from a remote host
[19:35:07] <mutante>	 cumin host?
[19:35:12] <sukhe>	 yep
[19:35:47] <sukhe>	 following the recommendation just above, https://wikitech.wikimedia.org/wiki/Management_Interfaces#How_to_execute_remote_IPMI_commands
[19:35:54] <wikibugs>	 (03PS1) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[19:36:08] <mutante>	 well, if it all works it's just one of those cases where we powercycle and it's back like nothing happened.. tag it "fluke" 
[19:36:27] <wikibugs>	 (03PS1) 10Esanders: Enable DiscussionTools visual enhancements as beta everywhere except en/de/jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822440 (https://phabricator.wikimedia.org/T312672)
[19:36:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[19:36:58] <mutante>	 if it happens more than once.. it would go back to Dell and they might ask to upgrade firmware :p
[19:38:07] <mutante>	 there's probably a couple tickets for cp hosts doing this but not often enough 
[19:41:18] <wikibugs>	 (03PS2) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832)
[19:41:23] <mutante>	 !log disabling puppet on C:profile::phabricator::main
[19:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:53] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[19:44:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi)
[19:44:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) p:05Triage→03High
[19:46:00] <sukhe>	 mutante: yeah I am going to ascribe it to a one-off or PDU upgrade for now and if it happens again, we will see :)
[19:46:22] <sukhe>	 for the cp hosts: mostly it's DIMM errors that racadm reports but this one is pretty new, at least for me
[19:48:11] <wikibugs>	 (03PS1) 10Dzahn: phabricator: add /etc/phabricator/config.yaml for scap [puppet] - 10https://gerrit.wikimedia.org/r/822441 (https://phabricator.wikimedia.org/T313950)
[19:48:43] <mutante>	 sukhe: yea, agreed. we have had both types before afair
[19:49:08] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:49:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: add /etc/phabricator/config.yaml for scap [puppet] - 10https://gerrit.wikimedia.org/r/822441 (https://phabricator.wikimedia.org/T313950) (owner: 10Dzahn)
[19:51:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "disabled puppet on prod phab hosts and testing on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[19:55:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I am concerned about the change to scap::target that affects a lot more hosts than just phabricator hosts and it wasn't compiled on those." [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[20:00:05] <jouncebot>	 brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220811T2000).
[20:00:05] <jouncebot>	 koi and Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <Jdlrobson>	 o/ present
[20:00:18] <thcipriani>	 o/ I will be your brennen today
[20:02:01] <thcipriani>	 Jdlrobson: so since there's no train this week, wmf.24 isn't live, wmf.23 is (and then we'll deploy wmf.25, confusingly) so I'm going to tweak your backport to point to wmf.23
[20:02:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed on a couple other scap::target hosts in prod (gerrit,webperf,mwdebug,..)" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[20:02:28] <wikibugs>	 (03PS1) 10Thcipriani: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822402 (https://phabricator.wikimedia.org/T314952)
[20:02:50] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822402 (https://phabricator.wikimedia.org/T314952) (owner: 10Thcipriani)
[20:02:51] <Jdlrobson>	 oh right it should be wmf23
[20:02:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) The above patch uses the new puppet facts to define vlan sub-interface and bridge relations as described in...
[20:03:09] <Jdlrobson>	 thcipriani: thanks for noticing that :)
[20:03:14] <thcipriani>	 cool, just got to wait for jenkins :)
[20:03:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "well, it's NOT actually noop everywhere. this is what I had in mind with my previous concern:" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[20:03:40] <thcipriani>	 koi: ping for backport if you're around
[20:09:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) @BCornwall thanks again. I anticipate needing to SSH into stat machines in order to access Jupyter Lab and run spark queries. I'll update the...
[20:10:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF)
[20:11:22] <Jdlrobson>	 thcipriani: do i need to backport to wmf24 too?
[20:11:37] <wikibugs>	 (03PS1) 10Cwhite: logstash: replace legacy routing filters [puppet] - 10https://gerrit.wikimedia.org/r/822444 (https://phabricator.wikimedia.org/T314139)
[20:11:39] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Schniggendiller) Also on deWP: https://d...
[20:13:28] <thcipriani>	 Jdlrobson: nah, it'll never get deployed
[20:14:15] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder)
[20:14:18] <thcipriani>	 We cut the branch every week regardless of whether we cancel train because the automation is "simpler"
[20:14:40] <Jdlrobson>	 ack
[20:20:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "checked on phab2001 next. this is all it does:" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[20:20:31] <wikibugs>	 (03Merged) 10jenkins-bot: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/822402 (https://phabricator.wikimedia.org/T314952) (owner: 10Thcipriani)
[20:21:17] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10herron)
[20:21:47] <thcipriani>	 Jdlrobson: live on mwdebug1002, check please
[20:22:08] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:45] <dancy>	 thcipriani: Should we change that?  (unconditional branch cut)
[20:23:22] <thcipriani>	 I have no strong feelings about it
[20:23:33] <Jdlrobson>	 thcipriani: almost done
[20:23:52] <dancy>	 👍🏾
[20:23:52] <thcipriani>	 cool, thanks for testing :)
[20:23:53] <mutante>	 !log merging change on prod phabricator host to allow scap deployment, part 1
[20:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed in prod. same as above on phab1001. puppet re-enabled" [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[20:25:11] <Jdlrobson>	 LGTM thcipriani please sync
[20:25:51] <thcipriani>	 going live now
[20:25:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:26:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:26:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:27:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:28:26] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:29:49] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.DesktopArticleTarget.init.js: Backport: [[gerrit:822396|Do not show incompatible skin warning when page is not editable (T314952)]] (duration: 03m 16s)
[20:29:53] <stashbot>	 T314952: Misleading message shows in skins where VE is compatible but the page because of its state isn't - https://phabricator.wikimedia.org/T314952
[20:29:59] <wikibugs>	 (03Abandoned) 10Dzahn: phabricator: add /etc/phabricator/config.yaml for scap [puppet] - 10https://gerrit.wikimedia.org/r/822441 (https://phabricator.wikimedia.org/T313950) (owner: 10Dzahn)
[20:30:00] <thcipriani>	 ^ Jdlrobson should be live
[20:30:05] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Krinkle)
[20:30:22] <thcipriani>	 koi: last ping for UTC late backport
[20:30:26] <Jdlrobson>	 thanks thcipriani will monitor the logs. Hoping to see some results there.
[20:30:32] <Jdlrobson>	 I appreciate your help!
[20:30:46] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:30:57] <thcipriani>	 Jdlrobson: anytime, thanks for testing, and shepherding the patch!
[20:31:12] <wikibugs>	 (03PS6) 10Dzahn: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[20:31:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:56] <wikibugs>	 10SRE: decom cookbook should ignore site.pp - https://phabricator.wikimedia.org/T314954 (10Dzahn) @jbond per request from IRC
[20:36:18] <wikibugs>	 10SRE: decom cookbook should ignore site.pp - https://phabricator.wikimedia.org/T314954 (10Dzahn) p:05Triage→03Low
[20:36:46] <koi>	 sorry abot that, here's me
[20:37:33] <wikibugs>	 (03PS4) 10Thcipriani: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[20:37:50] <thcipriani>	 koi: o/ ready to backport some patches?
[20:38:01] <koi>	 yeah, sure!
[20:38:54] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[20:39:57] <wikibugs>	 (03Merged) 10jenkins-bot: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[20:40:36] <thcipriani>	 ^ koi looks like a noop, correct?
[20:41:10] <koi>	 thcipriani: yeah, the first patch is a noop
[20:41:13] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Krinkle) Note that unlike most other packages, there is an especially sensitive dependency on the behaviour of librsvg which is the component of Thumbor responsible for converting SVGs...
[20:41:28] <thcipriani>	 k, syncing independently for completeness
[20:42:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:42:53] <wikibugs>	 (03PS12) 10Thcipriani: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[20:43:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:43:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:44:45] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[20:44:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:45:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[20:46:01] <koi>	 Actually this one is a noop too :)
[20:46:12] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows - https://phabricator.wikimedia.org/T314698 (10Dzahn) @Novem_Linguae While we are still thinking about a better fix, there is at least one work around using WSL on Wind...
[20:46:18] <thcipriani>	 looks like most of them *should be* noops :)
[20:46:27] <thcipriani>	 but I'll still pull down and let you verify
[20:47:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) I had a quick look and can't find any smoking gun so far.  The issue seems to be related to...
[20:47:24] <thcipriani>	 php-fpm restart is taking a moment...
[20:47:30] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806944|Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 07s)
[20:47:35] <stashbot>	 T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620
[20:47:35] <stashbot>	 T305692: Support language fallback for logo variants - https://phabricator.wikimedia.org/T305692
[20:48:21] <thcipriani>	 koi: live on mwdebug1002 --- everything still looking good there?
[20:49:58] <koi>	 thcipriani: visit zhwiki's main page with different variant and nothing wrong happened, I think we could move on
[20:49:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:50:10] <thcipriani>	 koi: ok, syncing, thanks for checking
[20:50:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:51:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:51:06] <wikibugs>	 (03PS2) 10Thcipriani: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:51:09] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:51:42] <wikibugs>	 (03Abandoned) 10Jdlrobson: Do not show incompatible skin warning when page is not editable [extensions/VisualEditor] (wmf/1.39.0-wmf.24) - 10https://gerrit.wikimedia.org/r/822396 (https://phabricator.wikimedia.org/T314952) (owner: 10Jdlrobson)
[20:52:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:52:02] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822194 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:53:34] <thcipriani>	 koi: oh, scap won't let me sync it: Notice: Undefined variable: wmgSiteLogoVariantFallback in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 1057 Notice: Undefined variable: wmgSiteLogoVariantFallback in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 1062
[20:54:35] <thcipriani>	 I wonder if default null vs default false is the cause of ^
[20:54:50] <koi>	 checking
[20:55:13] <thcipriani>	 this is from: mwscript eval.php --wiki aawiki ''
[20:55:27] <thcipriani>	 scap runs that as a quick check pre-sync
[20:56:04] <koi>	 so, If I like to modify the default value, which is inside the first patch, what should I do now
[20:57:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:57:18] <thcipriani>	 I can revert this and we can merge that one
[20:57:35] <thcipriani>	 if you've got a patch ready
[20:57:40] <brennen>	 note re: logspam-watch: ~/bin/brennen/logspam & ~/bin/brennen/logspam-watch are fixed
[20:57:41] <RhinosF1>	 thcipriani: fyi beta scap failed too with same reason
[20:57:46] <thcipriani>	 otherwise we can revert and try again another day
[20:57:54] <koi>	 to be clear, revert all merged patches
[20:58:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:58:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:58:08] <thcipriani>	 koi: right, if you want to try again another day
[20:58:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:58:46] <koi>	 ok, I'll try to come up with a patch
[20:58:55] <RhinosF1>	 urgh that page can't be good
[20:58:59] <RhinosF1>	 thcipriani: ^
[20:59:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:59:22] <bblack>	 hey
[20:59:33] <bblack>	 is there a known correlation to something deploy related above?
[20:59:34] <RhinosF1>	 We are mid deployment as an fyi
[20:59:45] <jhathaway>	 ok, looking as well
[20:59:58] <thcipriani>	 bblack: unclear, just deployed something, reverting (doubtful it's related, but reverting anyway)
[21:00:03] <RhinosF1>	 I can access meta here
[21:00:07] <bblack>	 thcipriani: ack, thank you!
[21:00:20] * Krinkle said something about re-ordering the patches at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799415/12#message-31cd0f8206f0fa7af1b692523d362bfc85c4bc06
[21:00:31] <Krinkle>	 glad we caught it before flooding logstash
[21:01:18] <Krinkle>	 thcipriani: I guess now that we have atomic deploys through fpm restarts, maybe syncing both at once would work.. e.g. over wmf-config/ as a whole.
[21:01:29] <Krinkle>	 not tried before, at your risk :)
[21:01:41] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows - https://phabricator.wikimedia.org/T314698 (10Dzahn) mailman3 upstream docs at https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/templates.htm...
[21:01:45] <Krinkle>	 I believe the merged state is without this error notice right?
[21:02:10] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows (mailman3 template names use colon in file names) - https://phabricator.wikimedia.org/T314698 (10Dzahn)
[21:02:24] <thcipriani>	 Krinkle: we only got the first patch in this series out, should be a noop
[21:02:29] <thcipriani>	 scap caught the rest
[21:02:40] <thcipriani>	 first being: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/806944
[21:03:09] <thcipriani>	 (although 2 more are currently merged)
[21:03:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[21:04:16] <RhinosF1>	 bblack: ^ the graph looks like a very temp drop. Is it safe for things to carry on as normal?
[21:04:35] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: revert [[gerrit:806944|Define default value for "wmgSiteLogoVariants" (T305692 T308620)]] (duration: 03m 15s)
[21:04:39] <stashbot>	 T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620
[21:04:40] <stashbot>	 T305692: Support language fallback for logo variants - https://phabricator.wikimedia.org/T305692
[21:05:25] <thcipriani>	 koi: we've ran over the window. I'm going to merge my reverts and let's try this again another day.
[21:06:33] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403
[21:06:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 (owner: 10Thcipriani)
[21:07:16] <koi>	 thcipriani: got it, will have another patch some other days
[21:07:18] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822404
[21:07:39] <thcipriani>	 koi: thanks and sorry.
[21:07:52] <wikibugs>	 (03PS1) 10Thcipriani: Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822405
[21:08:29] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822404 (owner: 10Thcipriani)
[21:09:05] <bblack>	 re: the paging alert, I don't *think* the deploy was related.  Can't be certain, though.
[21:09:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822404 (owner: 10Thcipriani)
[21:09:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:09:20] <RhinosF1>	 i doubt it was tbh
[21:09:36] <RhinosF1>	 unless it was only unavailable from the canaries
[21:09:53] <RhinosF1>	 Why does that certAlmostExpired go here
[21:10:04] <thcipriani>	 bblack: thanks for that update. Reverting because we were squeezing things in at the end of the window, and there were different minor errors.
[21:14:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443  - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[21:14:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:15:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:15:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:16:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:17:42] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822405 (owner: 10Thcipriani)
[21:18:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822405 (owner: 10Thcipriani)
[21:19:01] <RhinosF1>	 thcipriani: beta CI has passed now after failing with the same issue as the prod sync
[21:19:38] <wikibugs>	 (03PS2) 10Thcipriani: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403
[21:20:33] <thcipriani>	 RhinosF1: proof of a production-like beta :)
[21:20:56] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 (owner: 10Thcipriani)
[21:21:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:21:37] <RhinosF1>	 thcipriani: yep!
[21:21:41] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822403 (owner: 10Thcipriani)
[21:21:46] <RhinosF1>	 Beta has had a good run at being broken today though!
[21:22:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:22:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:23:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:24:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) > My goal is to proceed and update the automation to set switch interface access/trunk and allowed vlans onc...
[21:28:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:29:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:29:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:30:24] <thcipriani>	 ok, merged state matches deployed state once again
[21:30:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:39:10] <icinga-wm>	 PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:50:10] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36714/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[21:50:59] <wikibugs>	 (03PS2) 10Cwhite: logstash: replace legacy routing filters [puppet] - 10https://gerrit.wikimedia.org/r/822444 (https://phabricator.wikimedia.org/T305175)
[21:51:01] <wikibugs>	 (03PS1) 10Cwhite: logstash: use logstash routing for w3creportingapi stream [puppet] - 10https://gerrit.wikimedia.org/r/822450 (https://phabricator.wikimedia.org/T305175)
[21:51:03] <wikibugs>	 (03PS1) 10Cwhite: logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090)
[21:51:05] <wikibugs>	 (03PS1) 10Cwhite: logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175)
[21:52:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[21:52:06] <icinga-wm>	 RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[21:54:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this added a line for the "www" user in /etc/phabricator/config.yaml and otherwise was noop on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/818465 (https://phabricator.wikimedia.org/T313950) (owner: 10Jaime Nuche)
[21:55:20] <wikibugs>	 (03PS1) 10Brennen Bearnes: logspam: handle higher-resolution timestamps [puppet] - 10https://gerrit.wikimedia.org/r/822453
[21:56:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: replace legacy routing filters [puppet] - 10https://gerrit.wikimedia.org/r/822444 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)
[21:57:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: add target index validation to tests [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite)
[21:58:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Change local.json group to www-data / world readable [puppet] - 10https://gerrit.wikimedia.org/r/820779 (https://phabricator.wikimedia.org/T313950) (owner: 10Brennen Bearnes)
[22:02:34] <wikibugs>	 (03PS1) 10Dzahn: Revert "scap: Provide a working SSH key pair for the scap keyholder agent" [labs/private] - 10https://gerrit.wikimedia.org/r/822466
[22:04:39] <wikibugs>	 (03CR) 10Dzahn: "should this change be made on the local puppetmaster instead? but then don't we always have cherry-picks?" [labs/private] - 10https://gerrit.wikimedia.org/r/822466 (owner: 10Dzahn)
[22:15:43] <wikibugs>	 (03PS1) 10BBlack: Add wikifunctions to MW canonical redirects [puppet] - 10https://gerrit.wikimedia.org/r/822455 (https://phabricator.wikimedia.org/T275904)
[22:17:58] <wikibugs>	 (03PS1) 10Cwhite: logstash: do not overwrite partition in routing [puppet] - 10https://gerrit.wikimedia.org/r/822456 (https://phabricator.wikimedia.org/T314139)
[22:23:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: do not overwrite partition in routing [puppet] - 10https://gerrit.wikimedia.org/r/822456 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite)
[22:27:11] <wikibugs>	 10SRE-Access-Requests: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10Dzahn)
[22:27:27] <wikibugs>	 (03PS2) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048)
[22:28:18] <wikibugs>	 (03PS3) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048)
[22:28:44] <wikibugs>	 (03PS4) 10Dzahn: admin: add demon to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/817838 (https://phabricator.wikimedia.org/T315048)
[22:29:37] <brennen>	 mutante: can i get a quick +2 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/822453 ?
[22:29:49] <brennen>	 minor regex change for deployer log monitoring
[22:30:23] <brennen>	 (worst case it's already broken)
[22:31:08] <mutante>	 ok, yes, I recall this file
[22:31:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] logspam: handle higher-resolution timestamps [puppet] - 10https://gerrit.wikimedia.org/r/822453 (owner: 10Brennen Bearnes)
[22:31:59] <brennen>	 thanks!
[22:33:24] <mutante>	 it affects mwlog1002/mwlog2002. change has been applied.. now. (ran puppet)
[22:35:33] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: access request: add user demon to shell group gerrit-roots - https://phabricator.wikimedia.org/T315048 (10Dzahn) @thcipriani You are group approver for this shell group.
[22:37:39] <brennen>	 confirmed working; thx again.
[22:41:12] <mutante>	 :) laters then
[22:49:32] <wikibugs>	 (03PS2) 10Cwhite: logstash: update production w3creportingapi guard condition [puppet] - 10https://gerrit.wikimedia.org/r/822452 (https://phabricator.wikimedia.org/T305175)
[22:53:00] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:00:33] <wikibugs>	 10SRE: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Legoktm)
[23:00:38] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows (mailman3 template names use colon in file names) - https://phabricator.wikimedia.org/T314698 (10Legoktm)
[23:01:27] <wikibugs>	 10SRE: Mailman3 templates with colons in filename made operations/puppet not cloneable on Windows - https://phabricator.wikimedia.org/T282308 (10Legoktm) Sorry, this slipped off my radar to work on. The proper fix I had planned is to deploy the mailman-templates Debian package (https://gerrit.wikimedia.org/g/ope...
[23:02:32] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:04] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/822451 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite)
[23:27:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:36:50] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Glrx) I do not know PHP or Python, but here are the changes needed to wiki configuration, SVGHandler.php, and Thumbor's svg.py.  * https://commons.wikimedia.org/wiki/User:Glrx...
[23:39:27] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Glrx) >>! In T265549#8147272, @Glrx wrote: > I do not know PHP or Python, but here are the changes needed to wiki configuration, SVGHandler.php, and Thumbor's svg.py. >  > * https://com...