[00:01:28] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)480 ge (W)60 ge 3 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[00:14:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:16:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:23:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:27:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:50:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:52:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:53:06] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] mediawiki/maintenance/growthexperiments.pp: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm)
[00:58:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:00:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:03:41] <wikibugs>	 (03PS1) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715808 (https://phabricator.wikimedia.org/T289941)
[01:03:51] <wikibugs>	 (03PS1) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715809 (https://phabricator.wikimedia.org/T289941)
[01:06:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:08:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:18:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:21:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:31:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:35:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:38:20] <wikibugs>	 (03PS6) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson)
[01:41:07] <wikibugs>	 (03PS7) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson)
[01:41:48] <wikibugs>	 (03PS8) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson)
[01:44:56] <wikibugs>	 (03PS9) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson)
[02:17:19] <wikibugs>	 (03PS1) 10Krinkle: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715810 (https://phabricator.wikimedia.org/T290013)
[02:45:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:52:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:00:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:10:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:11:11] <wikibugs>	 10SRE, 10MediaWiki-Uploading, 10Traffic, 10serviceops: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Krinkle)
[03:21:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:23:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:33:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:35:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:35:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P::toolforge::apt_pinning: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/715700 (owner: 10Majavah)
[03:45:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:46:50] <icinga-wm>	 PROBLEM - Disk space on dbprov2001 is CRITICAL: DISK CRITICAL - free space: /srv 286202 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2001&var-datasource=codfw+prometheus/ops
[03:47:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:47:22] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:05:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-webproxy.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[04:14:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:16:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:16:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:00] <marostegui>	 !log Optimize arwiki.flaggedtemplates T290057
[04:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:23:05] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[04:28:19] <wikibugs>	 (03PS1) 10Marostegui: db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715841 (https://phabricator.wikimedia.org/T288803)
[04:28:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:17] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:33:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:37:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715841 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui)
[04:41:11] <marostegui>	 !log Optimize idwiki.flaggedtemplates T290057
[04:41:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:16] <stashbot>	 T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057
[04:49:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:51:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:11:24] <wikibugs>	 (03PS6) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877)
[05:16:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:20:05] <wikibugs>	 (03CR) 10Juan90264: [C: 03+1] "I add more experienced reviewers to review this change, which finds ONE MONTH in need of a simple review. Could any you could help me? Ple" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264)
[05:23:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:25:03] <effie>	 !log depool mw2251 mw2255 parse2001 for tests - T280497
[05:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:08] <stashbot>	 T280497: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497
[06:05:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:07:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:10:44] <icinga-wm>	 RECOVERY - Disk space on dbprov2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2001&var-datasource=codfw+prometheus/ops
[06:23:49] <wikibugs>	 (03PS1) 10Elukey: sre.puppet.renew-cert: replace RemoteHosts with Nodeset for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/715912
[06:26:08] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Good catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/715912 (owner: 10Elukey)
[06:27:25] <elukey>	 insta-review! 
[06:27:27] <elukey>	 :D
[06:27:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.puppet.renew-cert: replace RemoteHosts with Nodeset for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/715912 (owner: 10Elukey)
[06:27:39] <volans>	 you got lucky
[06:27:52] <elukey>	 ahahhaha
[06:27:56] <elukey>	 thanks :)
[06:28:04] <elukey>	 going to run the cookbook for sodium in a bit
[06:28:17] <volans>	 great
[06:28:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for sodium.wikimedia.org: Renew puppet certificate - elukey@cumin1001
[06:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:33] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sodium.wikimedia.org: Renew puppet certificate - elukey@cumin1001
[06:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:32:05] <elukey>	 ran puppet on sodium, all good
[06:33:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:35:12] <wikibugs>	 (03CR) 10Jcrespo: "I am ready to deploy, should I wait for +1 from Amir?" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[06:35:51] <volans>	 elukey: thanks for testing,
[06:36:22] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[07:05:52] <XioNoX>	 !log pfw NAT and ACLs changes - T290077
[07:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10ayounsi) Next step is to open a ticket with the vendor if possible.
[07:15:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:16:58] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:17:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:23:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:23:56] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:23:58] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:27:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:26] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:29:46] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:30:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:36:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:45:13] <ema>	 !log deploy Varnish SLO dashboard with grr apply slo_dashboards.jsonnet T289036
[07:45:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:19] <stashbot>	 T289036: Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036
[07:46:42] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm and better than the generic "error loading config file" from kubectl" [puppet] - 10https://gerrit.wikimedia.org/r/715698 (owner: 10JMeybohm)
[07:51:47] <wikibugs>	 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10ema) >>! In T289036#7321876, @herron wrote: > Also, I updated the wikitech docs with this information as well as a hint to run 'grr preview' in these cases, whi...
[07:52:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kube_env: Error out of user has no read permission to kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/715698 (owner: 10JMeybohm)
[07:57:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:59:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:03:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:06:36] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) 05Open→03Resolved I see you've deployed all eventgates, thanks! Resolving this
[08:06:42] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[08:07:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:07:42] <wikibugs>	 (03PS2) 10JMeybohm: Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454
[08:07:57] <wikibugs>	 (03PS3) 10JMeybohm: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017)
[08:08:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10fgiunchedi) >>! In T262668#7322172, @jcrespo wrote: > I made a mistake by an order of magnitude, we have backed up approximatel...
[08:10:52] <icinga-wm>	 RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 137, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:14:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={rails,webperf_navtiming} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:15:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10fgiunchedi) cc @Muehlenhoff and @jbond for input on what the correct action is here, namely to either add the @wikimedia.org email or tweak `cross-validate-accounts` to account for this cond...
[08:15:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) > I think we should crank concurrency up and see how much read throughput we can get. Maintenance/rebalance is ongoing...
[08:16:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:19:40] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm)
[08:22:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:24:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:28:56] <wikibugs>	 (03PS3) 10Filippo Giunchedi: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[08:29:46] <wikibugs>	 (03PS4) 10Filippo Giunchedi: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[08:29:53] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm)
[08:30:11] <godog>	 jynus: a little update ^ I think it is good to merge
[08:30:46] <jynus>	 thank you very much for the update, I was going to send that, but got distracted
[08:30:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[08:31:14] <godog>	 sure no worries, I'm processing access requests
[08:31:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[08:31:25] <wikibugs>	 (03PS5) 10Filippo Giunchedi: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo)
[08:36:00] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm)
[08:39:32] <wikibugs>	 (03CR) 10Ema: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[08:43:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:46:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi)
[08:46:29] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Migrate s4 generation from db2097 (stretch) to db2139 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/715919 (https://phabricator.wikimedia.org/T288803)
[08:46:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add jmando [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606)
[08:47:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:48:13] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey)
[08:51:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:52:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17200154440 and 51509 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[08:52:24] <godog>	 since access is already approved on task I guess I can just go ahead and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/715920 ?
[08:53:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:54:52] <wikibugs>	 (03CR) 10MMandere: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[08:55:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] admin: add jmando [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) (owner: 10Filippo Giunchedi)
[08:55:33] <majavah>	 godog: the uid does not match, on wmcs the unix name for 33218 is `jm` instead of `jmando`, and afaik those should match for new accounts
[08:56:57] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm)
[08:57:07] <godog>	 majavah: thank you I wasn't aware of this fact, do you know where it is documented?
[08:57:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:58:25] <majavah>	 godog: https://github.com/wikimedia/puppet/blob/production/modules/admin/README.md#adding-a-new-human-user kind of, here the uid number is the same but shell account name is different
[08:59:06] <majavah>	 (the wikitech account name is User:Jmando, but shell name is set to `jm`)
[08:59:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:00:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] dbbackups: Migrate s4 generation from db2097 (stretch) to db2139 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/715919 (https://phabricator.wikimedia.org/T288803) (owner: 10Jcrespo)
[09:00:40] <wikibugs>	 10SRE, 10Performance-Team: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10fgiunchedi)
[09:02:16] <wikibugs>	 (03CR) 10Ema: [C: 04-1] varnish: Allow SSR=2 on XCPS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:02:24] <godog>	 majavah: ah yes of course, I'll fix it
[09:03:43] <wikibugs>	 (03PS1) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488)
[09:03:47] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add jm [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606)
[09:08:25] <wikibugs>	 (03PS5) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005)
[09:11:03] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421)
[09:12:27] <wikibugs>	 (03CR) 10Vgutierrez: varnish: Allow SSR=2 on XCPS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:12:31] <wikibugs>	 (03CR) 10MVernon: "[sorry, confused by gerrit UI, re-adding the two people Review-bot put on]" [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[09:13:19] <wikibugs>	 (03CR) 10Kormat: "Can you also make the equivalent change for mysql@.service? That will take care of multi-instance hosts." [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[09:14:11] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm)
[09:17:41] <wikibugs>	 (03CR) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter (031 comment) [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[09:17:55] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] cxserver: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm)
[09:18:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) Thanks. I'd say you can close this one down, thanks for you and your teams support.
[09:21:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10ayounsi) 05Open→03Resolved a:03cmooney Great news!  Out of curiosity, is it possible to know the root cause?  Thanks
[09:21:31] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm)
[09:23:25] <marostegui>	 !log Drop flaggedrevs_stats and flaggedrevs_stats2 from dewiki T289050
[09:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:32] <stashbot>	 T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050
[09:23:33] <wikibugs>	 (03PS2) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488)
[09:24:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) We're still not aware of the root cause, but it certainly isn't yourselves given some recent testing we've conducted.
[09:24:23] <wikibugs>	 (03CR) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter (031 comment) [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[09:26:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add nforrester [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259)
[09:30:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, is this going to bounce haproxy on deploy? also please attach a PCC run" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:33:42] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30951/console" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:37:22] <wikibugs>	 (03CR) 10Kormat: "This looks good :) I guess the next step before merging this is to make these exact changes manually to a pontoon host, and check that the" [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[09:37:27] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] haproxy: Use systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[09:37:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:39:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:41:20] <wikibugs>	 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10ayounsi)
[09:46:11] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[09:49:37] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jbond) @fgiunchedi as they don't have a wikimedia.org email we should move them out of the WMF group and add them to the NDA group.  As the yare a contractor they should have an NDA (cc: @KF...
[09:51:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:51:39] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:52:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) (owner: 10Filippo Giunchedi)
[09:53:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:55:48] <wikibugs>	 (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259) (owner: 10Filippo Giunchedi)
[10:03:06] <wikibugs>	 (03PS1) 10Jbond: admin: update approval from String to Array[String] [puppet] - 10https://gerrit.wikimedia.org/r/715931
[10:08:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 91637656248 and 1538 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:09:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 95257356328 and 1599 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:09:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 95633061768 and 1604 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:17:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[10:20:58] <jbond>	 !log start filtering more puppet facts G:715461 - T263578
[10:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:05] <stashbot>	 T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578
[10:23:29] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005)
[10:25:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[10:25:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:26:52] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02126 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:27:29] <jbond>	 ^^ tis is me will resolve shortly
[10:27:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:31:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:34:21] <wikibugs>	 (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715933
[10:35:07] <godog>	 the navtiming job failure is metrics spam, reported as https://phabricator.wikimedia.org/T290138
[10:35:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:35:50] <wikibugs>	 (03PS1) 10MVernon: mariadb::misc::db_inventory: use mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/715934
[10:36:30] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001149 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:36:41] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/715934 (owner: 10MVernon)
[10:38:14] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 227105274464 and 57770 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:38:14] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 184745726840 and 3227 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:38:14] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 184420266392 and 3221 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:38:14] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 187150286920 and 3275 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:39:09] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 (owner: 10JMeybohm)
[10:40:12] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:44:38] <wikibugs>	 (03PS6) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488)
[10:45:10] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon)
[10:49:31] <wikibugs>	 (03PS1) 10Jbond: puppetdb: also add block_devices to blacklisted facts [puppet] - 10https://gerrit.wikimedia.org/r/715937
[10:50:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:52:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:53:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: also add block_devices to blacklisted facts [puppet] - 10https://gerrit.wikimedia.org/r/715937 (owner: 10Jbond)
[10:58:21] <wikibugs>	 (03PS8) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:04:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:06:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:09:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: update approval from String to Array[String] [puppet] - 10https://gerrit.wikimedia.org/r/715931 (owner: 10Jbond)
[11:13:46] <urbanecm>	 Hello, could someone puppet merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/715723 for me please? It already has a +1 from another member of my team (Growth). Thanks!
[11:18:20] <wikibugs>	 10SRE-Access-Requests, 10Parsoid, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle)
[11:19:22] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005)
[11:19:52] <Krinkle>	 !log effie restarted php-fpm on parse2007.codfw.wmnet, ref T290120.
[11:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:56] <stashbot>	 T290120: Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use in XWikimediaDebug.php - https://phabricator.wikimedia.org/T290120
[11:20:08] <effie>	 Krinkle:  she has not done that yet though :p
[11:20:18] <effie>	 I wil ask her to do so on your behaldf
[11:20:22] <effie>	 behalf*
[11:20:24] <Krinkle>	 oh :P
[11:20:53] <Krinkle>	 The graph dropped off.
[11:20:56] <urbanecm>	 theoretically Krinkle would be able to do it themselves (via mwdeploy and https://wikitech.wikimedia.org/wiki/Keyholder). Not convenient, I know :).
[11:21:07] <Krinkle>	 but I've been fooled by incomplete data for this past 2 minutes
[11:21:14] <Krinkle>	 it's corrected itself now
[11:21:21] <effie>	 it is restarted now 
[11:21:28] <effie>	 so we will keep monitoring 
[11:22:10] <Krinkle>	 this was quite a high level of fatals
[11:22:13] * Krinkle looks at alerts
[11:23:21] <wikibugs>	 10SRE-Access-Requests, 10Parsoid, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Urbanecm) I support this. After all, any deployer already has sufficient access to SSH in via the `mwdeploy` syste...
[11:23:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:24:11] <wikibugs>	 10SRE-Access-Requests, 10Parsoid, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle)
[11:24:25] <wikibugs>	 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Trizek-WMF) Do we have deployment this week? {T281164} has been created as usual, covering the Train Deployment for the week of September 13th.
[11:27:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:29:05] <Krinkle>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=20&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200
[11:29:10] <Krinkle>	 10% of parsoid POSTs were failing
[11:29:32] <Krinkle>	 for 10 hours
[11:30:07] <Krinkle>	 not sure why those were more affected..
[11:30:08] <Krinkle>	 https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1
[11:30:29] <Krinkle>	 but in terms of overall 5xx, it wasn't a huge spike given the background noise of timeouts and OOMs on parsoid normally
[11:31:50] <Krinkle>	 I'm gonna call this  an incident and write up a brief report.
[11:32:06] <wikibugs>	 (03PS1) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943
[11:33:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond)
[11:33:17] <wikibugs>	 (03PS8) 10Jcrespo: backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[11:33:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:37:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:40:07] <wikibugs>	 (03CR) 10Physikerwelt: "Amazing. After reading https://wikitech.wikimedia.org/wiki/Mathoid#Deployment I understand that this is generated from I838686b494bcfd4b62" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715933 (owner: 10PipelineBot)
[11:41:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:43:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:44:28] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:55:10] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Dzahn) ACK, alright!
[11:57:42] <wikibugs>	 (03PS2) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943
[12:09:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[12:09:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add jm [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) (owner: 10Filippo Giunchedi)
[12:09:34] <wikibugs>	 (03PS3) 10Filippo Giunchedi: admin: add jm [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606)
[12:18:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add nforrester [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259) (owner: 10Filippo Giunchedi)
[12:18:41] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add nforrester [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259)
[12:20:38] <wikibugs>	 (03PS1) 10Jbond: facter networking: filter k8s interfaces out of the networking fact [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904)
[12:21:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:23:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:23:58] <wikibugs>	 (03CR) 10Ladsgroup: "😄" [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[12:28:01] <wikibugs>	 (03PS1) 10Jbond: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950
[12:28:05] <wikibugs>	 (03CR) 10Dzahn: "@John should I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[12:28:16] <wikibugs>	 (03CR) 10Jbond: admin: create new sre-admins group to match the ldap group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[12:29:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond)
[12:35:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[12:38:14] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[12:38:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:41:39] <wikibugs>	 (03Abandoned) 10Phuedx: Disable Page Previews IRC alerts [puppet] - 10https://gerrit.wikimedia.org/r/648237 (owner: 10Phuedx)
[12:41:50] <mutante>	 !log planet1002 - rm /etc/rawdog/en/feeds/39a7970f.state (corrupt) T289984
[12:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:55] <stashbot>	 T289984: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984
[12:43:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:45:18] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:45:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Parsoid, 10serviceops, 10Sustainability (Incident Followup): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle)
[12:46:11] <wikibugs>	 (03PS1) 10Dzahn: miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951
[12:46:36] <wikibugs>	 (03PS2) 10Dzahn: miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951
[12:46:52] <wikibugs>	 (03PS3) 10Ema: varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[12:46:54] <wikibugs>	 (03PS1) 10Ema: varnish: add tests for unknown XCPS session reuse [puppet] - 10https://gerrit.wikimedia.org/r/715952 (https://phabricator.wikimedia.org/T271421)
[12:46:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:47:53] <godog>	 !log bounce webperf on webperf2001 - T290138
[12:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:58] <stashbot>	 T290138: navtiming prometheus scrape timeout and metric spamming - https://phabricator.wikimedia.org/T290138
[12:48:11] <wikibugs>	 (03PS4) 10Ladsgroup: Drop wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080)
[12:48:17] <wikibugs>	 (03CR) 10Ladsgroup: Drop wikidata alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[12:48:34] <godog>	 !log s/webperf/navtiming/
[12:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:50:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951 (owner: 10Dzahn)
[12:53:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Parsoid, 10serviceops, 10Sustainability (Incident Followup): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle)
[12:53:36] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951 (owner: 10Dzahn)
[12:57:04] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640 (owner: 10Michael DiPietro)
[12:58:22] <wikibugs>	 10SRE, 10Observability-Metrics, 10observability, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10Aklapper)
[12:59:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30957/console" [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[12:59:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Drop wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[13:01:19] <wikibugs>	 (03PS1) 10Urbanecm: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254)
[13:01:22] <wikibugs>	 (03PS1) 10Urbanecm: nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254)
[13:01:41] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[13:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:25] <mutante>	 !log planet1002 - temp removing feed from ad.huikeshoven - seems to cause corrupt state file (T289984)
[13:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:29] <stashbot>	 T289984: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984
[13:05:42] <icinga-wm>	 PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:34] <icinga-wm>	 PROBLEM - Check systemd state on ores2006 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:29] <wikibugs>	 (03PS1) 10Urbanecm: dewiki: Enable Growth features for 30% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715957 (https://phabricator.wikimedia.org/T288420)
[13:07:38] <icinga-wm>	 PROBLEM - Check systemd state on ores2002 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:08] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715810 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle)
[13:10:12] <icinga-wm>	 PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:56] <icinga-wm>	 PROBLEM - Check systemd state on ores2009 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:58] <icinga-wm>	 PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:24] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Good, I am guessing it will be correct :)" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[13:13:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, optional addition inline" [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond)
[13:14:05] <wikibugs>	 (03PS1) 10Krinkle: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715818 (https://phabricator.wikimedia.org/T290013)
[13:14:12] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715818 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle)
[13:14:54] <icinga-wm>	 PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:42] <icinga-wm>	 PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:42] <icinga-wm>	 PROBLEM - Check systemd state on ores2005 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:35] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[13:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:54] <icinga-wm>	 PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:08] <icinga-wm>	 PROBLEM - ores_workers_running on ores2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:17:41] <wikibugs>	 (03PS2) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779)
[13:17:43] <wikibugs>	 (03PS1) 10Jbond: admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958
[13:18:18] <icinga-wm>	 PROBLEM - ores_workers_running on ores1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:18:42] <icinga-wm>	 PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:18:54] <icinga-wm>	 PROBLEM - Check systemd state on ores1006 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond)
[13:19:49] <elukey>	 mmmm weird, checking ores
[13:21:24] <icinga-wm>	 PROBLEM - ores_workers_running on ores1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:21:44] <elukey>	 so on ores2001 celery seems to have gone through a stop/start, and then celery doesn't start anymore due to a mismatch in parameters
[13:21:46] <icinga-wm>	 PROBLEM - ores_workers_running on ores2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:22:04] <icinga-wm>	 PROBLEM - Check systemd state on ores1005 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:22:23] <elukey>	 yeah the unit changed
[13:22:42] <icinga-wm>	 PROBLEM - Check systemd state on ores1009 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:01] <wikibugs>	 (03PS3) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779)
[13:24:40] <icinga-wm>	 PROBLEM - ores_workers_running on ores1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:24:42] <elukey>	 I have disabled puppet on ores-codfw, some workers are up, the aim is to save those
[13:25:02] <icinga-wm>	 PROBLEM - ores_workers_running on ores2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:25:11] <elukey>	 from the logs the unit changed after https://gerrit.wikimedia.org/r/c/operations/puppet/+/715772, or better while applying it
[13:25:17] <elukey>	 but it seems completely unrelated
[13:25:42] <icinga-wm>	 PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:25:52] <icinga-wm>	 PROBLEM - ores_workers_running on ores1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:26:07] <elukey>	 ah no https://gerrit.wikimedia.org/r/c/operations/puppet/+/714640/3/modules/celery/templates/initscripts/celery.systemd.erb is the issue
[13:26:26] <icinga-wm>	 PROBLEM - ores_workers_running on ores1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:26:43] <elukey>	 mdipietro: o/
[13:26:50] <icinga-wm>	 PROBLEM - Check systemd state on ores1003 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:05] <elukey>	 are you around? https://gerrit.wikimedia.org/r/c/operations/puppet/+/714640 is causing an outage for ORES
[13:27:16] <elukey>	 the parameters seems not ok 
[13:27:31] <elukey>	 in the logs I see Sep 01 13:12:29 ores2001 celery-ores-worker[33774]: usage: celery <command> [options]
[13:27:46] <icinga-wm>	 PROBLEM - ores_workers_running on ores1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:27:53] <mdipietro>	 What's an ores worker?
[13:28:04] <icinga-wm>	 PROBLEM - Check systemd state on ores2004 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:18] <elukey>	 it is our ML serving infrastructure, it runs uwsgi + celery
[13:28:19] <elukey>	 on stretch
[13:28:35] <wikibugs>	 (03PS1) 10Dzahn: miscweb: set a global ServerName to suppress log warnings [container/miscweb] - 10https://gerrit.wikimedia.org/r/715959
[13:28:41] <RhinosF1>	 elukey: that patch says it breaks stretch
[13:28:47] <mdipietro>	 Oh I think I see that's used by more than quarry
[13:28:49] <mdipietro>	 Let's revert
[13:28:56] <elukey>	 RhinosF1: yes :)
[13:29:00] <elukey>	 mdipietro: thanks :)
[13:29:20] <wikibugs>	 (03PS1) 10Michael DiPietro: Revert "update quarry systemd and branch" [puppet] - 10https://gerrit.wikimedia.org/r/715819
[13:29:32] <icinga-wm>	 PROBLEM - ores_workers_running on ores1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:29:44] <RhinosF1>	 mdipietro: you might want to check in future that puppet code you're touching isn't used by other stuff
[13:29:59] <wikibugs>	 (03Merged) 10jenkins-bot: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715810 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle)
[13:30:04] <icinga-wm>	 PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:30:39] <mdipietro>	 first time I've run a revert, will it need the puppet-merge step? I'm not seeing the revert there
[13:31:08] <majavah>	 you need to merge the revert in gerrit first, like any other change
[13:31:14] <RhinosF1>	 You need to submit on gerrit first
[13:31:35] <elukey>	 mdipietro: yes please +2, puppet-merge
[13:31:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "update quarry systemd and branch" [puppet] - 10https://gerrit.wikimedia.org/r/715819 (owner: 10Michael DiPietro)
[13:32:04] <wikibugs>	 (03CR) 10Michael DiPietro: [C: 03+2] Revert "update quarry systemd and branch" [puppet] - 10https://gerrit.wikimedia.org/r/715819 (owner: 10Michael DiPietro)
[13:33:23] <mdipietro>	 Ok it's reverted puppet-merge run
[13:33:39] <elukey>	 ack perfect :)
[13:33:49] <elukey>	 running puppet on ores to see if it recovers
[13:33:50] * Krinkle tests on mwdebug2002
[13:34:17] <wikibugs>	 (03Merged) 10jenkins-bot: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715818 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle)
[13:35:22] <icinga-wm>	 RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:35:45] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, nit inline. Not sure how thoroughly we need to test it before merge." [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond)
[13:35:56] <icinga-wm>	 RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:10] <icinga-wm>	 RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:36:12] <icinga-wm>	 RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:14] <icinga-wm>	 PROBLEM - ores_workers_running on ores2004 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:36:16] <RhinosF1>	 Where is gerritbot
[13:36:20] <icinga-wm>	 RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:20] <icinga-wm>	 RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:30] <icinga-wm>	 RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:36:42] <icinga-wm>	 RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:42] <icinga-wm>	 RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:42] <icinga-wm>	 RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:44] <RhinosF1>	 majavah: do you know how to kick gerrit bot?
[13:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:46] <icinga-wm>	 RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:46] <icinga-wm>	 PROBLEM - ores_workers_running on ores2002 is CRITICAL: PROCS CRITICAL: 6 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:37:00] <icinga-wm>	 RECOVERY - ores_workers_running on ores1004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:37:04] <majavah>	 RhinosF1: which part of it needs kicking?
[13:37:24] <icinga-wm>	 RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:31] <RhinosF1>	 majavah: it's not online
[13:37:36] <icinga-wm>	 RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:36] <icinga-wm>	 RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:36] <icinga-wm>	 RECOVERY - Check systemd state on ores1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:37] <icinga-wm>	 RECOVERY - ores_workers_running on ores2004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:37:45] <RhinosF1>	 #wikimedia-dev is silent and so is here
[13:37:52] <icinga-wm>	 PROBLEM - ores_workers_running on ores1007 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:38:04] <icinga-wm>	 RECOVERY - ores_workers_running on ores2003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:38:12] <icinga-wm>	 RECOVERY - ores_workers_running on ores2002 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:38:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:26] <icinga-wm>	 RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:38:28] <majavah>	 what do you mean? wikibugs does both gerrit and phab and it seems to be online and sending things
[13:38:33] <wikibugs>	 (03PS2) 10Urbanecm: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254)
[13:38:49] <elukey>	 mdipietro: ok I think we are good! 
[13:38:58] <icinga-wm>	 RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:38:58] <icinga-wm>	 RECOVERY - ores_workers_running on ores1002 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:39:01] <mdipietro>	 👍
[13:39:04] <icinga-wm>	 RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:14] <icinga-wm>	 RECOVERY - ores_workers_running on ores1006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:39:18] <icinga-wm>	 RECOVERY - ores_workers_running on ores2009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:39:50] <icinga-wm>	 RECOVERY - ores_workers_running on ores1009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:40:24] <RhinosF1>	 majavah: oh I remember
[13:40:24] <icinga-wm>	 RECOVERY - ores_workers_running on ores1008 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:40:44] <RhinosF1>	 majavah: I put it on ignore to make finding a message easier earlier
[13:40:52] <RhinosF1>	 I did not remove it
[13:40:52] <icinga-wm>	 RECOVERY - ores_workers_running on ores1007 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:42:24] <icinga-wm>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:42:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) @JMando access has been set up, please confirm the following:  * SSH access is working * the kerberos initial password (sent via email) has been changed  thank you!
[13:43:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "Another thing to consider is that we will also need to add new group to pws as without the management password the reimage script is not t" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[13:43:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi) @NForrester  access has been set up, please confirm the following:  * SSH access is working * the kerberos initial...
[13:44:07] <wikibugs>	 (03PS1) 10Ladsgroup: Clean up absented files and unused configs [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080)
[13:44:51] <wikibugs>	 (03PS2) 10Ladsgroup: Clean up absented files and unused configs [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080)
[13:45:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:45:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet
[13:45:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:23] <logmsgbot>	 !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.21/includes/resourceloader: Id7c258841d7816 (duration: 01m 49s)
[13:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: rsync::quickdatacopy: Allow having multiple destination hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[13:47:52] <icinga-wm>	 RECOVERY - ores_workers_running on ores2006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[13:48:27] <wikibugs>	 (03PS1) 10Dzahn: comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963
[13:48:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: adapt alertmanager-webhook-logger to ECS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[13:48:35] <logmsgbot>	 !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.20/includes/resourceloader: Id7c258841d7816 (duration: 01m 06s)
[13:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:12] <icinga-wm>	 RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:06] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164)
[13:51:11] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet
[13:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:34] <wikibugs>	 (03PS3) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943
[13:51:51] <wikibugs>	 (03CR) 10Jbond: "fixed" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond)
[13:52:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm)
[13:52:18] <icinga-wm>	 RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron)
[13:53:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[13:53:40] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Add foundation.wikimedia.beta.wmflabs.org to beta sites [puppet] - 10https://gerrit.wikimedia.org/r/715966 (https://phabricator.wikimedia.org/T290164)
[13:54:57] <wikibugs>	 (03PS2) 10Urbanecm: [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164)
[13:55:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] [beta] Add foundation.wikimedia.beta.wmflabs.org to beta sites [puppet] - 10https://gerrit.wikimedia.org/r/715966 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm)
[13:55:26] <urbanecm>	 mutante: that was quick, thanks! Was just going to ping you tbh :D
[13:55:34] <mutante>	 hehe, I could feel 
[13:55:37] <wikibugs>	 (03CR) 10Jbond: facter networking: filter k8s interfaces out of the networking fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond)
[13:55:51] <mutante>	 merged on master
[13:56:04] <wikibugs>	 (03PS1) 10Jgreen: add a/ptr records for payments-staging.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/715968 (https://phabricator.wikimedia.org/T289869)
[13:56:07] <urbanecm>	 thanks. I'll run puppet on the beta hosts.
[13:58:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30958/console" [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[13:58:36] <urbanecm>	 jouncebot: now
[13:58:37] <jouncebot>	 No deployments scheduled for the next 4 hour(s) and 1 minute(s)
[13:58:39] <urbanecm>	 jouncebot: next
[13:58:39] <jouncebot>	 In 4 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800)
[13:58:39] <jouncebot>	 In 4 hour(s) and 1 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800)
[13:59:02] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] add a/ptr records for payments-staging.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/715968 (https://phabricator.wikimedia.org/T289869) (owner: 10Jgreen)
[13:59:41] <wikibugs>	 (03CR) 10Volans: "some first comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond)
[14:00:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Clean up absented files and unused configs [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup)
[14:00:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm)
[14:01:17] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm)
[14:01:26] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond)
[14:04:34] <godog>	 !log move simone-this-dot from wmf to nda ldap group - T289783
[14:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:38] <stashbot>	 T289783: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783
[14:04:52] <mutante>	 godog: :) was wondering about that
[14:05:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10fgiunchedi) >>! In T289783#7324124, @jbond wrote: > @fgiunchedi as they don't have a wikimedia.org email we should move them out of the WMF group and add them to the NDA group.  As the yare...
[14:05:09] <godog>	 mutante: yeah should be fine now
[14:05:14] <mutante>	 cool, thanks
[14:07:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:07:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:08:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 (owner: 10Dzahn)
[14:09:39] <wikibugs>	 (03PS2) 10Dzahn: comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963
[14:12:00] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[14:14:09] <wikibugs>	 (03CR) 10Dzahn: admin: create a group to run the wmf-auto-reimage commands (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[14:14:36] <volans>	 godog: FYI icinga is not happy because Service notification command 'notify-service-by-irc-wikidata' specified for contact 'irc-wikidata'
[14:14:43] <volans>	 I guess related to the removal of the related stuff
[14:14:52] <volans>	 same for notify-host-by-irc-wikidata
[14:16:08] <godog>	 I'll take a look
[14:16:35] <mutante>	 yea, the contact uses that command but it's gone
[14:18:16] <wikibugs>	 (03PS1) 10Effie Mouzeli: mwdebug: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970
[14:19:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "i with joanna and this is approved" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[14:21:22] <wikibugs>	 (03PS4) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779)
[14:21:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) https://netbox.wikimedia.org/dcim/devices/1955/ was purchased on 2017-10-01, and has a 4 year warranty, expiring on 2021-10-01.  https://opengear.com/support/contact-tech-support  A support tick...
[14:21:58] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[14:22:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "Spoke with Joanna and this is now approved" [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[14:22:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond)
[14:23:03] <wikibugs>	 (03PS5) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779)
[14:23:09] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Hey, @Ottomata I believe you organized or helped organize the watch party for "Turning the database inside-out". This...
[14:26:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "+1 on premise, -1 for a couple of nits on commit message. Many thanks for this! Now... when can we expect puppetdb to clean all those old " [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond)
[14:28:02] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "LGTM, great work! I'm using this on my workstation already and it works perfectly." [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[14:28:17] <wikibugs>	 (03PS2) 10Effie Mouzeli: mwdebug: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970
[14:28:21] <wikibugs>	 (03PS2) 10Jbond: facter networking: filter k8s interfaces out of the networking fact [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904)
[14:30:06] <wikibugs>	 (03PS2) 10Jbond: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950
[14:30:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond)
[14:31:39] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[14:31:48] <wikibugs>	 (03PS9) 10Dzahn: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729
[14:32:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn)
[14:33:49] <wikibugs>	 (03PS3) 10Effie Mouzeli: mwdebug: increase number of replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970
[14:34:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 (owner: 10Dzahn)
[14:35:43] <wikibugs>	 (03Merged) 10jenkins-bot: comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 (owner: 10Dzahn)
[14:38:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: increase number of replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 (owner: 10Effie Mouzeli)
[14:39:08] <wikibugs>	 (03PS1) 10Dzahn: miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972
[14:39:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 (owner: 10Dzahn)
[14:39:55] <wikibugs>	 (03PS2) 10Dzahn: miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972
[14:40:56] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: increase number of replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 (owner: 10Effie Mouzeli)
[14:41:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 (owner: 10Dzahn)
[14:42:54] <icinga-wm>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:44:05] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 (owner: 10Dzahn)
[14:46:48] <effie>	 jouncebot: now
[14:46:49] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 13 minute(s)
[14:47:48] <wikibugs>	 (03PS4) 10Hnowlan: postgres: increase number of WAL files retained by master [puppet] - 10https://gerrit.wikimedia.org/r/643717
[14:48:47] <wikibugs>	 (03PS1) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528)
[14:49:16] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30961/console" [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan)
[14:49:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[14:52:36] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:53:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10Ottomata) +1 <3
[14:54:09] <wikibugs>	 (03PS1) 10Urbanecm: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786)
[14:54:21] <wikibugs>	 (03CR) 10David Caro: update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[14:54:30] <wikibugs>	 (03PS2) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528)
[14:55:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[14:58:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:50] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:26] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[15:08:21] <logmsgbot>	 !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[15:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Parsoid, 10serviceops, 10Sustainability (Incident Followup): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Legoktm) +1 to granting permissions like normal appservers, this seems like an oversight once Parsoid moved to PHP and is now...
[15:13:52] <wikibugs>	 (03PS3) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528)
[15:13:54] <wikibugs>	 (03PS2) 10Dzahn: miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538)
[15:14:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[15:14:53] <wikibugs>	 (03PS3) 10Dzahn: miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538)
[15:18:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[15:19:38] <wikibugs>	 (03PS8) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[15:20:00] <wikibugs>	 (03PS1) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352)
[15:20:12] <wikibugs>	 (03Abandoned) 10Urbanecm: [Governance wiki] Create new 'editor' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472602 (https://phabricator.wikimedia.org/T205352) (owner: 10Jforrester)
[15:20:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 (owner: 10JMeybohm)
[15:20:30] <wikibugs>	 (03Abandoned) 10Urbanecm: [Governance wiki] Allow sysops to grant and remove 'editor' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472603 (owner: 10Jforrester)
[15:20:31] <wikibugs>	 (03Abandoned) 10Urbanecm: [Governance wiki] Move edit rights from users to 'editor' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472604 (https://phabricator.wikimedia.org/T205350) (owner: 10Jforrester)
[15:21:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[15:21:30] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn)
[15:21:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Parsoid, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10jijiki)
[15:22:56] <wikibugs>	 (03Merged) 10jenkins-bot: Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 (owner: 10JMeybohm)
[15:26:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: clinic-duty: add ops-maintenance calendar link generator [software] - 10https://gerrit.wikimedia.org/r/715980
[15:26:38] <wikibugs>	 (03CR) 10Herron: profile: adapt alertmanager-webhook-logger to ECS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[15:29:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: add tests for unknown XCPS session reuse [puppet] - 10https://gerrit.wikimedia.org/r/715952 (https://phabricator.wikimedia.org/T271421) (owner: 10Ema)
[15:29:31] <wikibugs>	 (03PS1) 10Cmjohnson: Adding dhcpd updates for ms-be1064-1066 [puppet] - 10https://gerrit.wikimedia.org/r/715981 (https://phabricator.wikimedia.org/T285808)
[15:30:42] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding dhcpd updates for ms-be1064-1066 [puppet] - 10https://gerrit.wikimedia.org/r/715981 (https://phabricator.wikimedia.org/T285808) (owner: 10Cmjohnson)
[15:32:47] <wikibugs>	 (03PS9) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[15:34:54] <wikibugs>	 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) {F34628005}
[15:34:54] <wikibugs>	 (03PS1) 10Cmjohnson: Adding ms-be1064-66 to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/715983 (https://phabricator.wikimedia.org/T285808)
[15:35:00] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:24] <wikibugs>	 (03PS4) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356)
[15:35:39] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding ms-be1064-66 to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/715983 (https://phabricator.wikimedia.org/T285808) (owner: 10Cmjohnson)
[15:35:49] <wikibugs>	 (03CR) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[15:40:54] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10fgiunchedi) >>! In T262668#7323887, @jcrespo wrote: >> I think we should crank concurrency up and see how much read throughput...
[15:41:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ms-be1064.eqiad.wmnet ` The log can be found in `...
[15:42:53] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro)
[15:42:59] <wikibugs>	 (03CR) 10BryanDavis: toolhub: Add helmfile.d (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[15:43:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ms-be1065.eqiad.wmnet ` The log can be found in `...
[15:44:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ms-be1066.eqiad.wmnet ` The log can be found in `...
[15:46:17] <wikibugs>	 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) >>! In T287539#7324387, @Trizek-WMF wrote: > Do we have deployment this week? {T281164} has been created as usual, covering the Train Deployment for the...
[15:46:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Thank you @godog, will do, slowly.  On the extreme, a 4x-8x the number of current threads would anyway move the bottle...
[15:46:38] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond)
[15:47:36] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:51:02] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM overall, couple of minor comments" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond)
[15:51:05] <wikibugs>	 (03CR) 10Effie Mouzeli: "I do not speak the language, but +100 for the idea !" [software] - 10https://gerrit.wikimedia.org/r/715980 (owner: 10Filippo Giunchedi)
[15:55:56] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@ff15071]: Fix for cassandra3 loading [analytics/refinery@ff15071]
[15:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:10] <wikibugs>	 (03PS4) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921)
[16:00:44] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1064.eqiad.wmnet with reason: REIMAGE
[16:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:29] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1065.eqiad.wmnet with reason: REIMAGE
[16:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:45] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1066.eqiad.wmnet with reason: REIMAGE
[16:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:58] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1064.eqiad.wmnet with reason: REIMAGE
[16:03:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:55] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1065.eqiad.wmnet with reason: REIMAGE
[16:04:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:54] <wikibugs>	 (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715985
[16:06:36] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[16:06:45] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1066.eqiad.wmnet with reason: REIMAGE
[16:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:01] <wikibugs>	 (03PS4) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528)
[16:07:43] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Parsoid, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Dzahn) What Lego said, access should mimick what we do with regular appservers.
[16:08:06] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[16:08:34] <wikibugs>	 (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715986
[16:09:35] <wikibugs>	 (03CR) 10Ladsgroup: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[16:09:38] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[16:10:34] <wikibugs>	 (03PS1) 10Dzahn: add deploment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144)
[16:10:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1064.eqiad.wmnet'] `  and were **ALL** successful.
[16:11:25] <RhinosF1>	 mutante: spelling on the commit title
[16:12:00] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[16:12:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1065.eqiad.wmnet'] `  and were **ALL** successful.
[16:12:29] <wikibugs>	 (03PS2) 10Dzahn: add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144)
[16:12:32] <mutante>	 RhinosF1: ty! fixed
[16:12:49] <RhinosF1>	 mutante: np
[16:13:01] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn)
[16:13:10] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[16:13:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1066.eqiad.wmnet'] `  and were **ALL** successful.
[16:14:06] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[16:14:12] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[16:16:10] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[16:17:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson)
[16:17:19] <wikibugs>	 (03PS3) 10Dzahn: add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144)
[16:19:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) @fgiunchedi ms-be1064/65/66 are installed and are ready for you to take over, 1067 is not racked yet until we can space in row D.   We haven't had a response from traff...
[16:21:39] <wikibugs>	 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Trizek-WMF) Thank you @Legoktm, I updated our public messages accordingly.
[16:22:54] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@ff15071]: Fix for cassandra3 loading [analytics/refinery@ff15071] (duration: 26m 58s)
[16:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:46] <logmsgbot>	 !log mforns@deploy1002 Started deploy [analytics/refinery@ff15071] (thin): Fix for cassandra3 loading THIN [analytics/refinery@ff15071]
[16:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:52] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [analytics/refinery@ff15071] (thin): Fix for cassandra3 loading THIN [analytics/refinery@ff15071] (duration: 00m 06s)
[16:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:06] <wikibugs>	 (03CR) 10Zabe: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond)
[16:26:14] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn)
[16:26:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) A ticket has been submitted  Your request (#82025) has been received, and is being reviewed by our support staff.  For questions concerning Opengear's Console Server products, please submit...
[16:29:28] <wikibugs>	 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) I made two updates:  * the date * the  fact that the deployment train will run I informed the translators about these changes.
[16:32:10] <wikibugs>	 (03PS2) 10Jbond: admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958
[16:32:51] <wikibugs>	 (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond)
[16:32:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond)
[16:43:20] <wikibugs>	 (03PS1) 10Legoktm: [WIP] Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628)
[16:44:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm)
[16:44:48] <wikibugs>	 (03PS2) 10Legoktm: [WIP] Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628)
[16:46:00] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30966/console" [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm)
[16:47:05] <wikibugs>	 (03PS3) 10Legoktm: Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628)
[16:47:48] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30967/console" [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm)
[16:49:38] <legoktm>	 meh, not sure what I'm doing wrong
[16:51:32] <wikibugs>	 (03CR) 10Legoktm: "I'm not sure why PCC says the timer is being enabled in codfw, I only added to the eqiad hiera." [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm)
[16:54:05] <dancy>	 legoktm:  the codfw is still affected...  the mwautopull timer is created with 'ensure => absent'.
[16:54:15] <rzl>	 legoktm: the codfw full diff just has the-- drat :)
[16:54:15] <dancy>	 affected, but affected in the desired way
[16:54:17] <rzl>	 what dancy said
[16:54:25] <legoktm>	 oh
[16:54:50] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm)
[16:55:07] <legoktm>	 where do I see that when looking at https://puppet-compiler.wmflabs.org/compiler1002/30967/kubestage2001.codfw.wmnet/index.html
[16:55:25] <dancy>	 "Full Diff" under "Relevant files"
[16:55:37] <dancy>	 (bottom of the page)
[16:55:52] <legoktm>	 ahh, TIL, perfect :D
[16:56:34] <wikibugs>	 (03CR) 10Legoktm: Automatically pull latest MediaWiki image onto staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm)
[17:00:07] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01033 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[17:00:24] <jbond>	 pplooking
[17:04:13] <wikibugs>	 (03CR) 10Jdlrobson: "Survey is not active from the coverage if I'm reading correctly so don't think we need to backport this." [extensions/QuickSurveys] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715809 (https://phabricator.wikimedia.org/T289941) (owner: 10Jforrester)
[17:09:07] <wikibugs>	 (03PS1) 10Bstorm: quarry: add a simple backup server [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568)
[17:13:29] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002869 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[17:22:42] <wikibugs>	 (03PS2) 10Bstorm: quarry: add a simple backup server [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568)
[17:37:59] <wikibugs>	 (03PS3) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005)
[17:38:01] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005)
[17:38:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[17:45:41] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] "Missing .fixtures file for mcrouter enabled status which is in turn hiding errors." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[17:49:15] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:51:36] <urbanecm>	 jouncebot: next
[17:51:36] <jouncebot>	 In 0 hour(s) and 8 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800)
[17:51:36] <jouncebot>	 In 0 hour(s) and 8 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800)
[17:51:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10KFrancis) @fgiunchedi @dr0ptp4kt I have not been able to find Simone Cuomo on our current contractors list or under their name in Coupa.  Is Simone working as a consultant under a business e...
[17:54:11] <wikibugs>	 (03PS1) 10AOkoth: admin: change to yubikey SSH key [puppet] - 10https://gerrit.wikimedia.org/r/716003 (https://phabricator.wikimedia.org/T288645)
[17:55:47] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "I'm live on a call with Arnold and can confirm this is his new key." [puppet] - 10https://gerrit.wikimedia.org/r/716003 (https://phabricator.wikimedia.org/T288645) (owner: 10AOkoth)
[18:00:05] <jouncebot>	 twentyafterfour and dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800).
[18:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800).
[18:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:25] <urbanecm>	 i'll deploy something
[18:02:05] <wikibugs>	 (03PS3) 10Urbanecm: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254)
[18:02:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) a:03Cmjohnson
[18:02:12] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson
[18:02:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm)
[18:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm)
[18:04:42] <wikibugs>	 (03PS2) 10Urbanecm: nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254)
[18:04:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm)
[18:05:35] <wikibugs>	 (03Merged) 10jenkins-bot: nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm)
[18:05:43] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 94b1cca: Growth features: Enable for newcomers on two wikis (T285254, T287867) (duration: 01m 09s)
[18:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:50] <stashbot>	 T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254
[18:05:50] <stashbot>	 T287867: Deploy Growth features on Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T287867
[18:05:59] <wikibugs>	 (03PS3) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142)
[18:06:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron)
[18:07:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:36] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 27e85b1f228dccb584b4692f5b1b1354b19625b4: nlwiki: Enable link recommendations for all Growth users (T285254) (duration: 01m 06s)
[18:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:07] <wikibugs>	 (03PS4) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142)
[18:08:52] * urbanecm done
[18:08:54] <wikibugs>	 (03PS5) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881)
[18:08:56] <wikibugs>	 (03PS10) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[18:09:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:38] <wikibugs>	 (03PS2) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194
[18:10:03] <wikibugs>	 (03CR) 10Legoktm: Update configuration related to disabling Score functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm)
[18:10:44] <wikibugs>	 (03PS3) 10Legoktm: Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193
[18:10:46] <wikibugs>	 (03PS3) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194
[18:11:16] <urbanecm>	 actually, one more patch
[18:11:45] <legoktm>	 ok, I'll go after you then :)
[18:11:56] <urbanecm>	 thanks
[18:13:06] <wikibugs>	 (03PS2) 10Urbanecm: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786)
[18:13:53] <wikibugs>	 (03PS3) 10Urbanecm: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786)
[18:14:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) (owner: 10Urbanecm)
[18:15:21] <wikibugs>	 (03PS5) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615)
[18:15:51] <wikibugs>	 (03Merged) 10jenkins-bot: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) (owner: 10Urbanecm)
[18:17:28] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fe1ae2e438841a069dc8dadc9a1850b91863c06a: Growth features: Deploy to 100% of newcomers on small wikis (T289786) (duration: 01m 06s)
[18:17:33] <urbanecm>	 done for real
[18:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:36] <urbanecm>	 legoktm: go ahead :)
[18:17:38] <stashbot>	 T289786: Deploy Growth features to 100% of newcomers on any wiki that has less than 500 monthly registrations - https://phabricator.wikimedia.org/T289786
[18:19:15] <wikibugs>	 (03PS6) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881)
[18:19:17] <wikibugs>	 (03PS11) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881)
[18:19:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:58] <legoktm>	 thanks!
[18:20:22] <wikibugs>	 (03CR) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis)
[18:20:27] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm)
[18:21:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:34] <wikibugs>	 (03Merged) 10jenkins-bot: Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm)
[18:23:50] <wikibugs>	 (03CR) 10Herron: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[18:25:07] <wikibugs>	 (03PS6) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615)
[18:26:26] <legoktm>	 eh, not working
[18:28:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:30:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:02] <wikibugs>	 (03PS1) 10Legoktm: Revert "Don't set default $wgShellboxUrls to Score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715821
[18:32:08] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Revert "Don't set default $wgShellboxUrls to Score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715821 (owner: 10Legoktm)
[18:32:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Don't set default $wgShellboxUrls to Score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715821 (owner: 10Legoktm)
[18:35:04] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10RobH)
[18:35:16] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10RobH)
[18:36:05] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10RobH) a:03Papaul
[18:37:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:37:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[18:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:17] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH)
[18:40:30] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH)
[18:40:58] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH)
[18:41:41] <wikibugs>	 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH) a:03Papaul
[18:46:35] <wikibugs>	 (03PS1) 10Ayounsi: remove Damping from cr4-ulsfo:xe-0/1/2 [homer/public] - 10https://gerrit.wikimedia.org/r/716008 (https://phabricator.wikimedia.org/T290188)
[18:47:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] remove Damping from cr4-ulsfo:xe-0/1/2 [homer/public] - 10https://gerrit.wikimedia.org/r/716008 (https://phabricator.wikimedia.org/T290188) (owner: 10Ayounsi)
[18:54:29] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10jijiki) >>! In T289657#7309715, @wiki_willy wrote: > Hi @jijiki - hope all is well.  We were wondering if it would be possible to prioritize the decom o...
[18:56:33] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10wiki_willy) Awesome, thanks @jijiki!
[19:00:05] <jouncebot>	 twentyafterfour and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1900).
[19:00:35] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:47] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[19:02:03] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki)
[19:14:04] <wikibugs>	 (03PS8) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493)
[19:14:16] <wikibugs>	 (03PS3) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215)
[19:14:19] <wikibugs>	 (03PS1) 10Ebernhardson: airflow: Compress scheduler logs [puppet] - 10https://gerrit.wikimedia.org/r/716018
[19:20:31] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1214.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:25:27] <wikibugs>	 (03PS6) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673)
[19:49:26] <twentyafterfour>	 I'm about to deploy wmf.21 to group1, should I be concerned with the replica lag alert? ^
[19:51:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10lmata) Much appreciated, thank you!
[19:51:58] <urbanecm>	 twentyafterfour: it's in eqiad (ie. unused) and it is happening for over a week if not more.
[19:53:02] <urbanecm>	 (but I'm not a SRE, of course, just my 2c)
[19:53:08] <twentyafterfour>	 thanks urbanecm
[19:53:18] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716024
[19:53:20] <wikibugs>	 (03CR) 1020after4: [C: 03+2] group1 wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716024 (owner: 1020after4)
[19:54:28] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.21  refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716024 (owner: 1020after4)
[19:55:21] <zabe>	 Is it wanted that wmf.20 blocker task is mentioned ^
[19:55:52] <wikibugs>	 10Puppet, 10GitLab, 10Infrastructure-Foundations, 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen)
[19:55:57] <twentyafterfour>	 no ... it should mention the wmf.21 blocker task
[19:56:26] <twentyafterfour>	 weird I wonder what went wrong with that 
[19:56:34] <logmsgbot>	 !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.21  refs T281161
[19:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:40] <stashbot>	 T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161
[19:57:41] <twentyafterfour>	 !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.21  refs T281162
[19:57:41] <logmsgbot>	 !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.21  refs T281161 (duration: 01m 06s)
[19:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:44] <stashbot>	 T281162: 1.37.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T281162
[19:57:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:59] <dancy>	 twentyafterfour: yak: automate selection of train blocker ticket in deploy-promote.
[19:59:07] <dancy>	 I always forget to supply it
[19:59:22] <twentyafterfour>	 dancy: I have a tool for that but apparently it's not reliable
[19:59:33] <dancy>	 Let's fix it next week!
[19:59:39] <wikibugs>	 10SRE, 10GitLab, 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10brennen)
[20:00:05] <jouncebot>	 twentyafterfour and dancy: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1900).
[20:00:05] <jouncebot>	 chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T2000)
[20:00:12] <twentyafterfour>	 I'm out next week but my tool is https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/608936
[20:00:21] <dancy>	 ok. I'll check it out.
[20:01:04] <twentyafterfour>	 it's fallable though because it relies on finding the oldest open train blocker task
[20:01:18] <twentyafterfour>	 so if the previous week is still open at the time it'll fail, and probably other ways as well
[20:01:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:42] <dancy>	 Understood.  I have ideas.
[20:02:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:51] <dancy>	 twentyafterfour: For reference, what was the exact command you issued?
[20:04:11] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:10] <twentyafterfour>	 dancy: `export PHABTASK=$(current-deployment-blockers)`
[20:11:35] <icinga-wm>	 RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:14:40] <wikibugs>	 (03PS1) 10RobH: updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032
[20:15:20] <wikibugs>	 (03PS2) 10RobH: updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032
[20:15:31] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032 (owner: 10RobH)
[20:16:26] <wikibugs>	 (03Merged) 10jenkins-bot: updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032 (owner: 10RobH)
[20:20:15] <wikibugs>	 (03PS2) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009)
[20:26:51] <twentyafterfour>	 group1 appears to be stable ... no new errors in the logs at all 
[20:28:29] <wikibugs>	 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10mewoph)
[20:28:33] <dancy>	 👍🏾
[20:37:03] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10RobH)
[20:37:11] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10RobH)
[20:37:36] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10RobH) a:03Jclark-ctr
[20:42:56] <wikibugs>	 (03PS5) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664)
[20:59:13] <wikibugs>	 (03PS2) 10Legoktm: mediawiki::maintenance: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm)
[20:59:29] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mediawiki::maintenance: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm)
[21:05:41] <wikibugs>	 (03PS1) 10Zabe: query_service: migrate query-service-gc-log-cleanup cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673)
[21:08:45] <wikibugs>	 (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:09:15] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah)
[21:13:35] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[21:14:31] <wikibugs>	 (03CR) 10Zabe: "This doesn't seems to be working: https://puppet-compiler.wmflabs.org/compiler1002/892/wdqs2001.codfw.wmnet/change.wdqs2001.codfw.wmnet.er" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:17:39] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] query_service: migrate query-service-gc-log-cleanup cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:20:21] <wikibugs>	 (03PS1) 10Dave Pifke: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165)
[21:21:37] <wikibugs>	 (03PS2) 10Zabe: query_service: migrate query-service-gc-log-cleanup cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673)
[21:22:21] <wikibugs>	 (03PS1) 10RobH: removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042
[21:22:30] <wikibugs>	 (03PS2) 10RobH: removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042
[21:23:01] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:23:15] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "This looks like what is needed." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713661 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah)
[21:23:36] <wikibugs>	 (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:25:10] <wikibugs>	 (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/893/wdqs2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[21:36:18] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Service-deployment-requests, and 2 others: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Legoktm)
[21:41:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)
[21:41:47] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] quarry: add a simple backup server [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)
[21:59:01] <wikibugs>	 (03PS1) 10Legoktm: Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [labs/private] - 10https://gerrit.wikimedia.org/r/716048 (https://phabricator.wikimedia.org/T289227)
[21:59:04] <wikibugs>	 (03PS1) 10Legoktm: Add k8s users/tokens for apple-search [labs/private] - 10https://gerrit.wikimedia.org/r/716049 (https://phabricator.wikimedia.org/T289224)
[22:03:06] <wikibugs>	 (03PS1) 10Legoktm: Add k8s tokens/users for shellbox-{syntaxhighlight,timeline} [puppet] - 10https://gerrit.wikimedia.org/r/716051 (https://phabricator.wikimedia.org/T289227)
[22:03:09] <wikibugs>	 (03PS1) 10Legoktm: Add k8s users/tokens for apple-search [puppet] - 10https://gerrit.wikimedia.org/r/716052 (https://phabricator.wikimedia.org/T289224)
[22:04:07] <wikibugs>	 (03PS2) 10Legoktm: Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [puppet] - 10https://gerrit.wikimedia.org/r/716051 (https://phabricator.wikimedia.org/T289227)
[22:04:08] <wikibugs>	 (03PS2) 10Legoktm: Add k8s users/tokens for apple-search [puppet] - 10https://gerrit.wikimedia.org/r/716052 (https://phabricator.wikimedia.org/T289224)
[22:14:04] <wikibugs>	 (03PS1) 10Bstorm: quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568)
[22:16:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)
[22:16:19] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s users/tokens for apple-search [labs/private] - 10https://gerrit.wikimedia.org/r/716049 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm)
[22:16:24] <wikibugs>	 (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [labs/private] - 10https://gerrit.wikimedia.org/r/716048 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm)
[22:16:45] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [puppet] - 10https://gerrit.wikimedia.org/r/716051 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm)
[22:16:47] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add k8s users/tokens for apple-search [puppet] - 10https://gerrit.wikimedia.org/r/716052 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm)
[22:20:58] <wikibugs>	 (03PS1) 10Legoktm: admin: Add namespace for shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/716054 (https://phabricator.wikimedia.org/T289227)
[22:21:00] <wikibugs>	 (03PS1) 10Legoktm: admin: Add namespace for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716055 (https://phabricator.wikimedia.org/T289226)
[22:21:02] <wikibugs>	 (03PS1) 10Legoktm: admin: Add namespace for apple-search [deployment-charts] - 10https://gerrit.wikimedia.org/r/716056 (https://phabricator.wikimedia.org/T289224)
[22:24:03] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Add namespace for shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/716054 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm)
[22:24:06] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Add namespace for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716055 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm)
[22:27:10] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Add namespace for shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/716054 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm)
[22:27:14] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Add namespace for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716055 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm)
[22:29:43] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[22:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:57] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[22:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:18] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[22:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:54] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[22:32:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:25] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[22:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:56] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[22:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:25] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[22:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:29] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[22:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:22] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Add namespace for apple-search [deployment-charts] - 10https://gerrit.wikimedia.org/r/716056 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm)
[22:39:01] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Add namespace for apple-search [deployment-charts] - 10https://gerrit.wikimedia.org/r/716056 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm)
[22:39:54] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[22:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:56] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[22:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:02] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[22:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:30] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[22:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:00] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[22:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:25] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[22:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:41] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[22:43:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:44:39] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[22:44:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:47] <wikibugs>	 (03PS1) 10Gergő Tisza: fixLinkRecommendationData: Allow --db-table in dry-run mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715824 (https://phabricator.wikimedia.org/T283868)
[22:50:17] <wikibugs>	 (03PS1) 10Gergő Tisza: fixLinkRecommendationData: stay under 10K search limit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715825 (https://phabricator.wikimedia.org/T284531)
[22:50:22] <wikibugs>	 (03PS1) 10Legoktm: Add helmfile.d for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716063 (https://phabricator.wikimedia.org/T289226)
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T2300).
[23:00:04] <jouncebot>	 tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:46] <Jdlrobson>	 Here
[23:01:17] <urbanecm>	 Hi Jdlrobson 
[23:01:20] <urbanecm>	 I can deploy today
[23:02:14] <tgr>	 o/
[23:02:30] <tgr>	 mine are fire and forget
[23:02:49] <tgr>	 no testing needed, I mean
[23:03:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] fixLinkRecommendationData: Allow --db-table in dry-run mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715824 (https://phabricator.wikimedia.org/T283868) (owner: 10Gergő Tisza)
[23:03:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] fixLinkRecommendationData: stay under 10K search limit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715825 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza)
[23:03:12] <urbanecm>	 ack tgr 
[23:03:34] <urbanecm>	 Jdlrobson: would you mind amending https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/713653 to add it to the extension-list? :-)
[23:03:43] <Jdlrobson>	 oh shoot im sorry i thought it did that
[23:03:45] <Jdlrobson>	 doing that now
[23:04:13] <logmsgbot>	 !log dpifke@deploy1002 Started deploy [performance/navtiming@63c9d31]: Deploy fix for CpuBenchmark-related Prometheus timeouts T281243
[23:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:17] <urbanecm>	 thanks
[23:04:18] <stashbot>	 T281243: Expose CPU benchmark data to Prometheus/Grafana - https://phabricator.wikimedia.org/T281243
[23:04:19] <logmsgbot>	 !log dpifke@deploy1002 Finished deploy [performance/navtiming@63c9d31]: Deploy fix for CpuBenchmark-related Prometheus timeouts T281243 (duration: 00m 06s)
[23:04:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:25] <dpifke>	 ^ above only affects webperfX001
[23:04:40] <urbanecm>	 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704171 only adds a SVG, can't find the commit that uses it
[23:05:02] <wikibugs>	 (03PS9) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493)
[23:05:40] <Jdlrobson>	 urbanecm: you can skip the logo patch it's not ready.
[23:05:48] <urbanecm>	 okay
[23:05:51] <Jdlrobson>	 I was misinformed :)
[23:06:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson)
[23:06:17] <wikibugs>	 (03PS4) 10Urbanecm: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson)
[23:06:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson)
[23:06:29] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "This also needs an update in wmf-config/InitialiseSettings.php to set the width and height etc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264)
[23:07:01] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 3 others: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Legoktm)
[23:08:11] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson)
[23:08:51] <urbanecm>	 Jdlrobson: please test at mwdebug201
[23:08:58] <urbanecm>	 (the WVUI search i mean)
[23:09:06] <Jdlrobson>	 on it
[23:09:24] <wikibugs>	 (03Abandoned) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715809 (https://phabricator.wikimedia.org/T289941) (owner: 10Jforrester)
[23:09:29] <wikibugs>	 (03Abandoned) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715808 (https://phabricator.wikimedia.org/T289941) (owner: 10Jforrester)
[23:09:49] <Jdlrobson>	 urbanecm: LGMT
[23:09:59] <Jdlrobson>	 please sync
[23:09:59] <urbanecm>	 thanks, syncing
[23:10:23] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add helmfile.d for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716063 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm)
[23:10:25] <wikibugs>	 (03PS10) 10Urbanecm: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[23:11:51] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bb7d92c48edf48b94fd628e9e0b5fd6682460373: Enable WVUI search on Wikimedia Commons (T287215) (duration: 01m 07s)
[23:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:55] <stashbot>	 T287215: Enable WVUI search on commons  - https://phabricator.wikimedia.org/T287215
[23:11:56] <urbanecm>	 live
[23:12:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "extension has 2 branches, secreviewed, no reason not to enable it in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[23:12:36] <urbanecm>	 Jdlrobson: just wondering, when do you plan to do it in prod?
[23:12:49] <urbanecm>	 (enable the extension, that is)
[23:12:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson)
[23:12:58] <Jdlrobson>	 It depends on performance team at this point
[23:13:02] <Jdlrobson>	 but pretty flexible
[23:13:34] <urbanecm>	 so definitely not "later this week"
[23:13:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add helmfile.d for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716063 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm)
[23:14:18] <urbanecm>	 in that case all should be fine
[23:14:55] <Jdlrobson>	 urbanecm: definitely not later this week :)
[23:15:06] <urbanecm>	 good :)
[23:15:19] <logmsgbot>	 !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
[23:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:43] <urbanecm>	 i'm going to sync it to get the variables to prod, beta will self-update soon
[23:16:05] <wikibugs>	 (03PS1) 10MSantos: maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068
[23:16:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068 (owner: 10MSantos)
[23:16:36] <wikibugs>	 (03PS2) 10MSantos: maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068
[23:16:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:30] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 91ff9273fd9f80b571771a7454d34d63f43405b8: Enable NearbyPages on beta cluster (T246493; 1/3) (duration: 01m 06s)
[23:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:34] <stashbot>	 T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493
[23:17:38] <Jdlrobson>	 thanks urbanecm 
[23:17:50] <urbanecm>	 np :)
[23:18:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:50] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 91ff9273fd9f80b571771a7454d34d63f43405b8: Enable NearbyPages on beta cluster (T246493; 2/3) (duration: 01m 06s)
[23:18:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:13] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
[23:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:07] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/extension-list: 91ff9273fd9f80b571771a7454d34d63f43405b8: Enable NearbyPages on beta cluster (T246493; 3/3) (duration: 01m 05s)
[23:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:20] <urbanecm>	 so, the prod part is done
[23:20:25] <urbanecm>	 Jdlrobson: anything else?
[23:22:00] <Jdlrobson>	 nope that's all.. thanks!
[23:22:06] <urbanecm>	 any time
[23:22:23] <wikibugs>	 (03Merged) 10jenkins-bot: fixLinkRecommendationData: Allow --db-table in dry-run mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715824 (https://phabricator.wikimedia.org/T283868) (owner: 10Gergő Tisza)
[23:22:38] <urbanecm>	 just in time
[23:24:15] <wikibugs>	 (03Merged) 10jenkins-bot: fixLinkRecommendationData: stay under 10K search limit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715825 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza)
[23:24:42] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 3c7d4ecc699b7c68467a372686f5514375d2b74f: fixLinkRecommendationData: Allow --db-table in dry-run mode (T283868) (duration: 01m 06s)
[23:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:47] <stashbot>	 T283868: Monitor "no suggestion" rate for Add Link tasks - https://phabricator.wikimedia.org/T283868
[23:25:22] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
[23:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:21] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 0bd65426494d4df981141650211e27e17c98ee0c: fixLinkRecommendationData: stay under 10K search limit (T284531) (duration: 01m 06s)
[23:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:25] <stashbot>	 T284531: Add Link: Work around 10K search result set limit in fixLinkRecommendationData.php - https://phabricator.wikimedia.org/T284531
[23:27:28] <urbanecm>	 tgr: both done
[23:27:32] <urbanecm>	 anything else i can help with?
[23:27:54] <tgr>	 thanks!
[23:28:01] <urbanecm>	 np
[23:30:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:40:39] <wikibugs>	 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) #traffic-icebox now exists as a new tag with a process-informative description (click it and read!).  I've bulk (+silent) moved all open #traffic tickets which had no activity for >= 6 months over to...
[23:43:20] <wikibugs>	 (03PS2) 10Bstorm: quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568)
[23:45:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:45:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[23:50:01] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[23:50:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[23:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Seems right to my not-very-trustworthy eyes" [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)
[23:50:33] <Amir1>	 !log mwscript createAndPromote.php --wiki=test2wiki --sysop --force Ladsgroup
[23:50:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:26] <wikibugs>	 (03PS1) 10Jdlrobson: Fix Wikidata API url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716073
[23:52:22] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)