[00:01:28] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)480 ge (W)60 ge 3 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [00:14:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:16:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:27:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:06] (03CR) 10Gergő Tisza: [C: 03+1] mediawiki/maintenance/growthexperiments.pp: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm) [00:58:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:00:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:03:41] (03PS1) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715808 (https://phabricator.wikimedia.org/T289941) [01:03:51] (03PS1) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715809 (https://phabricator.wikimedia.org/T289941) [01:06:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:08:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:18:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:21:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:31:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:35:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:38:20] (03PS6) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [01:41:07] (03PS7) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [01:41:48] (03PS8) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [01:44:56] (03PS9) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [02:17:19] (03PS1) 10Krinkle: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715810 (https://phabricator.wikimedia.org/T290013) [02:45:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:52:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:00:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:10:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:11:11] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10serviceops: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Krinkle) [03:21:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:23:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:33:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:35:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:35:31] (03CR) 10Andrew Bogott: [C: 03+2] P::toolforge::apt_pinning: bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/715700 (owner: 10Majavah) [03:45:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:46:50] PROBLEM - Disk space on dbprov2001 is CRITICAL: DISK CRITICAL - free space: /srv 286202 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2001&var-datasource=codfw+prometheus/ops [03:47:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:47:22] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:41] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-webproxy.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [04:14:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:16:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:00] !log Optimize arwiki.flaggedtemplates T290057 [04:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:05] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [04:28:19] (03PS1) 10Marostegui: db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715841 (https://phabricator.wikimedia.org/T288803) [04:28:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:33:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:37:19] (03CR) 10Marostegui: [C: 03+2] db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/715841 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [04:41:11] !log Optimize idwiki.flaggedtemplates T290057 [04:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:16] T290057: Optimize flaggedtemplates tables in production. - https://phabricator.wikimedia.org/T290057 [04:49:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:51:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:11:24] (03PS6) 10Juan90264: Adding and use wordmark in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) [05:16:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:20:05] (03CR) 10Juan90264: [C: 03+1] "I add more experienced reviewers to review this change, which finds ONE MONTH in need of a simple review. Could any you could help me? Ple" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [05:23:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:25:03] !log depool mw2251 mw2255 parse2001 for tests - T280497 [05:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:08] T280497: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 [06:05:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:07:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:10:44] RECOVERY - Disk space on dbprov2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2001&var-datasource=codfw+prometheus/ops [06:23:49] (03PS1) 10Elukey: sre.puppet.renew-cert: replace RemoteHosts with Nodeset for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/715912 [06:26:08] (03CR) 10Volans: [C: 03+1] "Good catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/715912 (owner: 10Elukey) [06:27:25] insta-review! [06:27:27] :D [06:27:36] (03CR) 10Elukey: [C: 03+2] sre.puppet.renew-cert: replace RemoteHosts with Nodeset for icinga [cookbooks] - 10https://gerrit.wikimedia.org/r/715912 (owner: 10Elukey) [06:27:39] you got lucky [06:27:52] ahahhaha [06:27:56] thanks :) [06:28:04] going to run the cookbook for sodium in a bit [06:28:17] great [06:28:39] !log elukey@cumin1001 START - Cookbook sre.puppet.renew-cert for sodium.wikimedia.org: Renew puppet certificate - elukey@cumin1001 [06:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sodium.wikimedia.org: Renew puppet certificate - elukey@cumin1001 [06:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:32:05] ran puppet on sodium, all good [06:33:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:35:12] (03CR) 10Jcrespo: "I am ready to deploy, should I wait for +1 from Amir?" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [06:35:51] elukey: thanks for testing, [06:36:22] RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [07:05:52] !log pfw NAT and ACLs changes - T290077 [07:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:06] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10ayounsi) Next step is to open a ticket with the vendor if possible. [07:15:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:16:58] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:17:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:23:38] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:56] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:27:46] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:26] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:29:46] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:30:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:36:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:45:13] !log deploy Varnish SLO dashboard with grr apply slo_dashboards.jsonnet T289036 [07:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:19] T289036: Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 [07:46:42] (03CR) 10Jelto: [C: 03+1] "lgtm and better than the generic "error loading config file" from kubectl" [puppet] - 10https://gerrit.wikimedia.org/r/715698 (owner: 10JMeybohm) [07:51:47] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q1): Use Grizzly for Varnish SLO Grafana dashboard - https://phabricator.wikimedia.org/T289036 (10ema) >>! In T289036#7321876, @herron wrote: > Also, I updated the wikitech docs with this information as well as a hint to run 'grr preview' in these cases, whi... [07:52:34] (03CR) 10JMeybohm: [C: 03+2] kube_env: Error out of user has no read permission to kubeconfig [puppet] - 10https://gerrit.wikimedia.org/r/715698 (owner: 10JMeybohm) [07:57:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:59:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:03:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:06:36] 10SRE, 10Analytics, 10Analytics-Kanban, 10Prod-Kubernetes, and 2 others: Move eventgate services to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) 05Open→03Resolved I see you've deployed all eventgates, thanks! Resolving this [08:06:42] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:07:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:07:42] (03PS2) 10JMeybohm: Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 [08:07:57] (03PS3) 10JMeybohm: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) [08:08:33] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10fgiunchedi) >>! In T262668#7322172, @jcrespo wrote: > I made a mistake by an order of magnitude, we have backed up approximatel... [08:10:52] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 137, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:14:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={rails,webperf_navtiming} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:15:43] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10fgiunchedi) cc @Muehlenhoff and @jbond for input on what the correct action is here, namely to either add the @wikimedia.org email or tweak `cross-validate-accounts` to account for this cond... [08:15:47] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) > I think we should crank concurrency up and see how much read throughput we can get. Maintenance/rebalance is ongoing... [08:16:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:40] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [08:22:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:28:56] (03PS3) 10Filippo Giunchedi: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [08:29:46] (03PS4) 10Filippo Giunchedi: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [08:29:53] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [08:30:11] jynus: a little update ^ I think it is good to merge [08:30:46] thank you very much for the update, I was going to send that, but got distracted [08:30:58] (03CR) 10Jcrespo: [C: 03+1] admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [08:31:14] sure no worries, I'm processing access requests [08:31:20] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [08:31:25] (03PS5) 10Filippo Giunchedi: admin: Update approver of analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/715259 (owner: 10Jcrespo) [08:36:00] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [08:39:32] (03CR) 10Ema: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [08:43:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:46:06] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) [08:46:29] (03PS1) 10Jcrespo: dbbackups: Migrate s4 generation from db2097 (stretch) to db2139 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/715919 (https://phabricator.wikimedia.org/T288803) [08:46:34] (03PS1) 10Filippo Giunchedi: admin: add jmando [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) [08:47:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:48:13] (03CR) 10Klausman: [C: 03+1] kubeflow-kfserving-inference: add Secret specs for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/715747 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [08:51:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:52:16] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17200154440 and 51509 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:52:24] since access is already approved on task I guess I can just go ahead and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/715920 ? [08:53:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:54:52] (03CR) 10MMandere: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [08:55:19] (03CR) 10Jcrespo: [C: 03+1] admin: add jmando [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) (owner: 10Filippo Giunchedi) [08:55:33] godog: the uid does not match, on wmcs the unix name for 33218 is `jm` instead of `jmando`, and afaik those should match for new accounts [08:56:57] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [08:57:07] majavah: thank you I wasn't aware of this fact, do you know where it is documented? [08:57:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:58:25] godog: https://github.com/wikimedia/puppet/blob/production/modules/admin/README.md#adding-a-new-human-user kind of, here the uid number is the same but shell account name is different [08:59:06] (the wikitech account name is User:Jmando, but shell name is set to `jm`) [08:59:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:14] (03CR) 10Marostegui: [C: 03+1] dbbackups: Migrate s4 generation from db2097 (stretch) to db2139 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/715919 (https://phabricator.wikimedia.org/T288803) (owner: 10Jcrespo) [09:00:40] 10SRE, 10Performance-Team: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10fgiunchedi) [09:02:16] (03CR) 10Ema: [C: 04-1] varnish: Allow SSR=2 on XCPS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:02:24] majavah: ah yes of course, I'll fix it [09:03:43] (03PS1) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) [09:03:47] (03PS2) 10Filippo Giunchedi: admin: add jm [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) [09:08:25] (03PS5) 10Vgutierrez: haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) [09:11:03] (03PS2) 10Vgutierrez: varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) [09:12:27] (03CR) 10Vgutierrez: varnish: Allow SSR=2 on XCPS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:12:31] (03CR) 10MVernon: "[sorry, confused by gerrit UI, re-adding the two people Review-bot put on]" [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [09:13:19] (03CR) 10Kormat: "Can you also make the equivalent change for mysql@.service? That will take care of multi-instance hosts." [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [09:14:11] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:17:41] (03CR) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter (031 comment) [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [09:17:55] (03CR) 10Jelto: [C: 04-1] cxserver: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [09:18:34] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) Thanks. I'd say you can close this one down, thanks for you and your teams support. [09:21:05] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10ayounsi) 05Open→03Resolved a:03cmooney Great news! Out of curiosity, is it possible to know the root cause? Thanks [09:21:31] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [09:23:25] !log Drop flaggedrevs_stats and flaggedrevs_stats2 from dewiki T289050 [09:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:32] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [09:23:33] (03PS2) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) [09:24:01] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Unable to load en.wikipedia.org from 84.19.61.192/26 - https://phabricator.wikimedia.org/T279503 (10A189605) We're still not aware of the root cause, but it certainly isn't yourselves given some recent testing we've conducted. [09:24:23] (03CR) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter (031 comment) [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [09:26:44] (03PS1) 10Filippo Giunchedi: admin: add nforrester [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259) [09:30:11] (03CR) 10Filippo Giunchedi: "LGTM overall, is this going to bounce haproxy on deploy? also please attach a PCC run" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:33:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30951/console" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:37:22] (03CR) 10Kormat: "This looks good :) I guess the next step before merging this is to make these exact changes manually to a pontoon host, and check that the" [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [09:37:27] (03CR) 10Vgutierrez: [V: 03+1] haproxy: Use systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:37:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:41:20] 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10ayounsi) [09:46:11] (03CR) 10Ladsgroup: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [09:49:37] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10jbond) @fgiunchedi as they don't have a wikimedia.org email we should move them out of the WMF group and add them to the NDA group. As the yare a contractor they should have an NDA (cc: @KF... [09:51:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:51:39] (03CR) 10Ema: [C: 03+1] varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:52:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) (owner: 10Filippo Giunchedi) [09:53:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:48] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259) (owner: 10Filippo Giunchedi) [10:03:06] (03PS1) 10Jbond: admin: update approval from String to Array[String] [puppet] - 10https://gerrit.wikimedia.org/r/715931 [10:08:28] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 91637656248 and 1538 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:09:28] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 95257356328 and 1599 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:09:34] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 95633061768 and 1604 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:17:15] (03CR) 10Jbond: [C: 03+2] puppetdb: block additional facts [puppet] - 10https://gerrit.wikimedia.org/r/715461 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:20:58] !log start filtering more puppet facts G:715461 - T263578 [10:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:05] T263578: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 [10:23:29] (03PS1) 10Vgutierrez: cache::haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [10:25:39] 10SRE, 10Traffic, 10Patch-For-Review: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:25:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:26:52] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.02126 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:27:29] ^^ tis is me will resolve shortly [10:27:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:34:21] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715933 [10:35:07] the navtiming job failure is metrics spam, reported as https://phabricator.wikimedia.org/T290138 [10:35:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:35:50] (03PS1) 10MVernon: mariadb::misc::db_inventory: use mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/715934 [10:36:30] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001149 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:36:41] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/715934 (owner: 10MVernon) [10:38:14] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 227105274464 and 57770 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:38:14] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 184745726840 and 3227 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:38:14] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 184420266392 and 3221 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:38:14] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 187150286920 and 3275 seconds Hnowlan Hosts require resync https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:39:09] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 (owner: 10JMeybohm) [10:40:12] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:38] (03PS6) 10MVernon: prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) [10:45:10] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [10:49:31] (03PS1) 10Jbond: puppetdb: also add block_devices to blacklisted facts [puppet] - 10https://gerrit.wikimedia.org/r/715937 [10:50:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:52:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:53:46] (03CR) 10Jbond: [C: 03+2] puppetdb: also add block_devices to blacklisted facts [puppet] - 10https://gerrit.wikimedia.org/r/715937 (owner: 10Jbond) [10:58:21] (03PS8) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:04:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:06:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:09:37] (03CR) 10Jbond: [C: 03+2] admin: update approval from String to Array[String] [puppet] - 10https://gerrit.wikimedia.org/r/715931 (owner: 10Jbond) [11:13:46] Hello, could someone puppet merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/715723 for me please? It already has a +1 from another member of my team (Growth). Thanks! [11:18:20] 10SRE-Access-Requests, 10Parsoid, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle) [11:19:22] (03PS2) 10Vgutierrez: cache::haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [11:19:52] !log effie restarted php-fpm on parse2007.codfw.wmnet, ref T290120. [11:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:56] T290120: Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use in XWikimediaDebug.php - https://phabricator.wikimedia.org/T290120 [11:20:08] Krinkle: she has not done that yet though :p [11:20:18] I wil ask her to do so on your behaldf [11:20:22] behalf* [11:20:24] oh :P [11:20:53] The graph dropped off. [11:20:56] theoretically Krinkle would be able to do it themselves (via mwdeploy and https://wikitech.wikimedia.org/wiki/Keyholder). Not convenient, I know :). [11:21:07] but I've been fooled by incomplete data for this past 2 minutes [11:21:14] it's corrected itself now [11:21:21] it is restarted now [11:21:28] so we will keep monitoring [11:22:10] this was quite a high level of fatals [11:22:13] * Krinkle looks at alerts [11:23:21] 10SRE-Access-Requests, 10Parsoid, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Urbanecm) I support this. After all, any deployer already has sufficient access to SSH in via the `mwdeploy` syste... [11:23:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:24:11] 10SRE-Access-Requests, 10Parsoid, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle) [11:24:25] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Trizek-WMF) Do we have deployment this week? {T281164} has been created as usual, covering the Train Deployment for the week of September 13th. [11:27:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:29:05] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=20&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200 [11:29:10] 10% of parsoid POSTs were failing [11:29:32] for 10 hours [11:30:07] not sure why those were more affected.. [11:30:08] https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1 [11:30:29] but in terms of overall 5xx, it wasn't a huge spike given the background noise of timeouts and OOMs on parsoid normally [11:31:50] I'm gonna call this an incident and write up a brief report. [11:32:06] (03PS1) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 [11:33:06] (03CR) 10jerkins-bot: [V: 04-1] facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [11:33:17] (03PS8) 10Jcrespo: backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [11:33:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:37:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:40:07] (03CR) 10Physikerwelt: "Amazing. After reading https://wikitech.wikimedia.org/wiki/Mathoid#Deployment I understand that this is generated from I838686b494bcfd4b62" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715933 (owner: 10PipelineBot) [11:41:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:43:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:44:28] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:55:10] 10SRE, 10serviceops, 10Datacenter-Switchover: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Dzahn) ACK, alright! [11:57:42] (03PS2) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 [12:09:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [12:09:27] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add jm [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) (owner: 10Filippo Giunchedi) [12:09:34] (03PS3) 10Filippo Giunchedi: admin: add jm [puppet] - 10https://gerrit.wikimedia.org/r/715920 (https://phabricator.wikimedia.org/T289606) [12:18:36] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add nforrester [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259) (owner: 10Filippo Giunchedi) [12:18:41] (03PS2) 10Filippo Giunchedi: admin: add nforrester [puppet] - 10https://gerrit.wikimedia.org/r/715928 (https://phabricator.wikimedia.org/T289259) [12:20:38] (03PS1) 10Jbond: facter networking: filter k8s interfaces out of the networking fact [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) [12:21:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:23:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:23:58] (03CR) 10Ladsgroup: "😄" [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [12:28:01] (03PS1) 10Jbond: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 [12:28:05] (03CR) 10Dzahn: "@John should I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [12:28:16] (03CR) 10Jbond: admin: create new sre-admins group to match the ldap group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715731 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [12:29:32] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond) [12:35:54] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [12:38:14] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [12:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:41:39] (03Abandoned) 10Phuedx: Disable Page Previews IRC alerts [puppet] - 10https://gerrit.wikimedia.org/r/648237 (owner: 10Phuedx) [12:41:50] !log planet1002 - rm /etc/rawdog/en/feeds/39a7970f.state (corrupt) T289984 [12:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:55] T289984: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 [12:43:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:45:18] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:45:42] 10SRE, 10SRE-Access-Requests, 10Parsoid, 10serviceops, 10Sustainability (Incident Followup): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle) [12:46:11] (03PS1) 10Dzahn: miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951 [12:46:36] (03PS2) 10Dzahn: miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951 [12:46:52] (03PS3) 10Ema: varnish: Allow SSR=2 on XCPS [puppet] - 10https://gerrit.wikimedia.org/r/715541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:46:54] (03PS1) 10Ema: varnish: add tests for unknown XCPS session reuse [puppet] - 10https://gerrit.wikimedia.org/r/715952 (https://phabricator.wikimedia.org/T271421) [12:46:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=webperf_navtiming site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:47:53] !log bounce webperf on webperf2001 - T290138 [12:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:58] T290138: navtiming prometheus scrape timeout and metric spamming - https://phabricator.wikimedia.org/T290138 [12:48:11] (03PS4) 10Ladsgroup: Drop wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) [12:48:17] (03CR) 10Ladsgroup: Drop wikidata alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [12:48:34] !log s/webperf/navtiming/ [12:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:50:59] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951 (owner: 10Dzahn) [12:53:19] 10SRE, 10SRE-Access-Requests, 10Parsoid, 10serviceops, 10Sustainability (Incident Followup): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Krinkle) [12:53:36] (03Merged) 10jenkins-bot: miscweb: bump staging version to 2021-08-31-125449-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715951 (owner: 10Dzahn) [12:57:04] (03CR) 10Michael DiPietro: [C: 03+2] update quarry systemd and branch [puppet] - 10https://gerrit.wikimedia.org/r/714640 (owner: 10Michael DiPietro) [12:58:22] 10SRE, 10Observability-Metrics, 10observability, 10Graphite: grafana access control - https://phabricator.wikimedia.org/T108546 (10Aklapper) [12:59:21] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30957/console" [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [12:59:56] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Drop wikidata alerts [puppet] - 10https://gerrit.wikimedia.org/r/715772 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [13:01:19] (03PS1) 10Urbanecm: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) [13:01:22] (03PS1) 10Urbanecm: nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254) [13:01:41] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:25] !log planet1002 - temp removing feed from ad.huikeshoven - seems to cause corrupt state file (T289984) [13:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:29] T289984: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 [13:05:42] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:34] PROBLEM - Check systemd state on ores2006 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:29] (03PS1) 10Urbanecm: dewiki: Enable Growth features for 30% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715957 (https://phabricator.wikimedia.org/T288420) [13:07:38] PROBLEM - Check systemd state on ores2002 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:08] (03CR) 10Krinkle: [C: 03+2] resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715810 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle) [13:10:12] PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:56] PROBLEM - Check systemd state on ores2009 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:58] PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:24] (03CR) 10Hashar: [C: 03+1] "Good, I am guessing it will be correct :)" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:13:50] (03CR) 10Volans: [C: 03+1] "LGTM, optional addition inline" [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:14:05] (03PS1) 10Krinkle: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715818 (https://phabricator.wikimedia.org/T290013) [13:14:12] (03CR) 10Krinkle: [C: 03+2] resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715818 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle) [13:14:54] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:42] PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:42] PROBLEM - Check systemd state on ores2005 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:35] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [13:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:54] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:08] PROBLEM - ores_workers_running on ores2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:17:41] (03PS2) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) [13:17:43] (03PS1) 10Jbond: admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 [13:18:18] PROBLEM - ores_workers_running on ores1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:18:42] PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:54] PROBLEM - Check systemd state on ores1006 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:24] (03CR) 10jerkins-bot: [V: 04-1] admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [13:19:49] mmmm weird, checking ores [13:21:24] PROBLEM - ores_workers_running on ores1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:21:44] so on ores2001 celery seems to have gone through a stop/start, and then celery doesn't start anymore due to a mismatch in parameters [13:21:46] PROBLEM - ores_workers_running on ores2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:22:04] PROBLEM - Check systemd state on ores1005 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:23] yeah the unit changed [13:22:42] PROBLEM - Check systemd state on ores1009 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:01] (03PS3) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) [13:24:40] PROBLEM - ores_workers_running on ores1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:24:42] I have disabled puppet on ores-codfw, some workers are up, the aim is to save those [13:25:02] PROBLEM - ores_workers_running on ores2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:25:11] from the logs the unit changed after https://gerrit.wikimedia.org/r/c/operations/puppet/+/715772, or better while applying it [13:25:17] but it seems completely unrelated [13:25:42] PROBLEM - ores_workers_running on ores2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:25:52] PROBLEM - ores_workers_running on ores1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:26:07] ah no https://gerrit.wikimedia.org/r/c/operations/puppet/+/714640/3/modules/celery/templates/initscripts/celery.systemd.erb is the issue [13:26:26] PROBLEM - ores_workers_running on ores1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:26:43] mdipietro: o/ [13:26:50] PROBLEM - Check systemd state on ores1003 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:05] are you around? https://gerrit.wikimedia.org/r/c/operations/puppet/+/714640 is causing an outage for ORES [13:27:16] the parameters seems not ok [13:27:31] in the logs I see Sep 01 13:12:29 ores2001 celery-ores-worker[33774]: usage: celery [options] [13:27:46] PROBLEM - ores_workers_running on ores1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:27:53] What's an ores worker? [13:28:04] PROBLEM - Check systemd state on ores2004 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:18] it is our ML serving infrastructure, it runs uwsgi + celery [13:28:19] on stretch [13:28:35] (03PS1) 10Dzahn: miscweb: set a global ServerName to suppress log warnings [container/miscweb] - 10https://gerrit.wikimedia.org/r/715959 [13:28:41] elukey: that patch says it breaks stretch [13:28:47] Oh I think I see that's used by more than quarry [13:28:49] Let's revert [13:28:56] RhinosF1: yes :) [13:29:00] mdipietro: thanks :) [13:29:20] (03PS1) 10Michael DiPietro: Revert "update quarry systemd and branch" [puppet] - 10https://gerrit.wikimedia.org/r/715819 [13:29:32] PROBLEM - ores_workers_running on ores1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:29:44] mdipietro: you might want to check in future that puppet code you're touching isn't used by other stuff [13:29:59] (03Merged) 10jenkins-bot: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715810 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle) [13:30:04] PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:39] first time I've run a revert, will it need the puppet-merge step? I'm not seeing the revert there [13:31:08] you need to merge the revert in gerrit first, like any other change [13:31:14] You need to submit on gerrit first [13:31:35] mdipietro: yes please +2, puppet-merge [13:31:52] (03CR) 10Elukey: [C: 03+1] Revert "update quarry systemd and branch" [puppet] - 10https://gerrit.wikimedia.org/r/715819 (owner: 10Michael DiPietro) [13:32:04] (03CR) 10Michael DiPietro: [C: 03+2] Revert "update quarry systemd and branch" [puppet] - 10https://gerrit.wikimedia.org/r/715819 (owner: 10Michael DiPietro) [13:33:23] Ok it's reverted puppet-merge run [13:33:39] ack perfect :) [13:33:49] running puppet on ores to see if it recovers [13:33:50] * Krinkle tests on mwdebug2002 [13:34:17] (03Merged) 10jenkins-bot: resourceloader: Fix prepending of OOUI theme skinStyles [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715818 (https://phabricator.wikimedia.org/T290013) (owner: 10Krinkle) [13:35:22] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:45] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline. Not sure how thoroughly we need to test it before merge." [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [13:35:56] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:10] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:36:12] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:14] PROBLEM - ores_workers_running on ores2004 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:36:16] Where is gerritbot [13:36:20] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:20] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:30] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:36:42] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:42] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:42] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:44] majavah: do you know how to kick gerrit bot? [13:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:46] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:46] PROBLEM - ores_workers_running on ores2002 is CRITICAL: PROCS CRITICAL: 6 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:37:00] RECOVERY - ores_workers_running on ores1004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:37:04] RhinosF1: which part of it needs kicking? [13:37:24] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:31] majavah: it's not online [13:37:36] RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:36] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:36] RECOVERY - Check systemd state on ores1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:37] RECOVERY - ores_workers_running on ores2004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:37:45] #wikimedia-dev is silent and so is here [13:37:52] PROBLEM - ores_workers_running on ores1007 is CRITICAL: PROCS CRITICAL: 2 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:38:04] RECOVERY - ores_workers_running on ores2003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:38:12] RECOVERY - ores_workers_running on ores2002 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:38:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:26] RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:38:28] what do you mean? wikibugs does both gerrit and phab and it seems to be online and sending things [13:38:33] (03PS2) 10Urbanecm: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) [13:38:49] mdipietro: ok I think we are good! [13:38:58] RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:38:58] RECOVERY - ores_workers_running on ores1002 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:39:01] 👍 [13:39:04] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:14] RECOVERY - ores_workers_running on ores1006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:39:18] RECOVERY - ores_workers_running on ores2009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:39:50] RECOVERY - ores_workers_running on ores1009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:40:24] majavah: oh I remember [13:40:24] RECOVERY - ores_workers_running on ores1008 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:40:44] majavah: I put it on ignore to make finding a message easier earlier [13:40:52] I did not remove it [13:40:52] RECOVERY - ores_workers_running on ores1007 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:42:24] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:42:47] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) @JMando access has been set up, please confirm the following: * SSH access is working * the kerberos initial password (sent via email) has been changed thank you! [13:43:10] (03CR) 10Jbond: [V: 03+1 C: 03+1] "Another thing to consider is that we will also need to add new group to pws as without the management password the reimage script is not t" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [13:43:24] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi) @NForrester access has been set up, please confirm the following: * SSH access is working * the kerberos initial... [13:44:07] (03PS1) 10Ladsgroup: Clean up absented files and unused configs [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080) [13:44:51] (03PS2) 10Ladsgroup: Clean up absented files and unused configs [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080) [13:45:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [13:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:23] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.21/includes/resourceloader: Id7c258841d7816 (duration: 01m 49s) [13:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:31] (03CR) 10Filippo Giunchedi: rsync::quickdatacopy: Allow having multiple destination hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [13:47:52] RECOVERY - ores_workers_running on ores2006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [13:48:27] (03PS1) 10Dzahn: comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 [13:48:35] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: adapt alertmanager-webhook-logger to ECS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [13:48:35] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.20/includes/resourceloader: Id7c258841d7816 (duration: 01m 06s) [13:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:12] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:06] (03PS1) 10Urbanecm: [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) [13:51:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [13:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] (03PS3) 10Jbond: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 [13:51:51] (03CR) 10Jbond: "fixed" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [13:52:14] (03CR) 10jerkins-bot: [V: 04-1] [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm) [13:52:18] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:20] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [13:53:32] (03CR) 10Filippo Giunchedi: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [13:53:40] (03PS1) 10Urbanecm: [beta] Add foundation.wikimedia.beta.wmflabs.org to beta sites [puppet] - 10https://gerrit.wikimedia.org/r/715966 (https://phabricator.wikimedia.org/T290164) [13:54:57] (03PS2) 10Urbanecm: [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) [13:55:08] (03CR) 10Dzahn: [C: 03+2] [beta] Add foundation.wikimedia.beta.wmflabs.org to beta sites [puppet] - 10https://gerrit.wikimedia.org/r/715966 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm) [13:55:26] mutante: that was quick, thanks! Was just going to ping you tbh :D [13:55:34] hehe, I could feel [13:55:37] (03CR) 10Jbond: facter networking: filter k8s interfaces out of the networking fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:55:51] merged on master [13:56:04] (03PS1) 10Jgreen: add a/ptr records for payments-staging.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/715968 (https://phabricator.wikimedia.org/T289869) [13:56:07] thanks. I'll run puppet on the beta hosts. [13:58:35] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30958/console" [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [13:58:36] jouncebot: now [13:58:37] No deployments scheduled for the next 4 hour(s) and 1 minute(s) [13:58:39] jouncebot: next [13:58:39] In 4 hour(s) and 1 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800) [13:58:39] In 4 hour(s) and 1 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800) [13:59:02] (03CR) 10Jgreen: [C: 03+2] add a/ptr records for payments-staging.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/715968 (https://phabricator.wikimedia.org/T289869) (owner: 10Jgreen) [13:59:41] (03CR) 10Volans: "some first comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [14:00:05] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] Clean up absented files and unused configs [puppet] - 10https://gerrit.wikimedia.org/r/715961 (https://phabricator.wikimedia.org/T290080) (owner: 10Ladsgroup) [14:00:34] (03CR) 10Urbanecm: [C: 03+2] [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm) [14:01:17] (03Merged) 10jenkins-bot: [beta] Create foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715965 (https://phabricator.wikimedia.org/T290164) (owner: 10Urbanecm) [14:01:26] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [14:04:34] !log move simone-this-dot from wmf to nda ldap group - T289783 [14:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:38] T289783: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 [14:04:52] godog: :) was wondering about that [14:05:04] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10fgiunchedi) >>! In T289783#7324124, @jbond wrote: > @fgiunchedi as they don't have a wikimedia.org email we should move them out of the WMF group and add them to the NDA group. As the yare... [14:05:09] mutante: yeah should be fine now [14:05:14] cool, thanks [14:07:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:34] (03CR) 10Dzahn: [C: 03+2] comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 (owner: 10Dzahn) [14:09:39] (03PS2) 10Dzahn: comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 [14:12:00] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:14:09] (03CR) 10Dzahn: admin: create a group to run the wmf-auto-reimage commands (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [14:14:36] godog: FYI icinga is not happy because Service notification command 'notify-service-by-irc-wikidata' specified for contact 'irc-wikidata' [14:14:43] I guess related to the removal of the related stuff [14:14:52] same for notify-host-by-irc-wikidata [14:16:08] I'll take a look [14:16:35] yea, the contact uses that command but it's gone [14:18:16] (03PS1) 10Effie Mouzeli: mwdebug: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 [14:19:28] (03CR) 10Jbond: [V: 03+1 C: 03+1] "i with joanna and this is approved" [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [14:21:22] (03PS4) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) [14:21:43] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) https://netbox.wikimedia.org/dcim/devices/1955/ was purchased on 2017-10-01, and has a 4 year warranty, expiring on 2021-10-01. https://opengear.com/support/contact-tech-support A support tick... [14:21:58] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:22:27] (03CR) 10Jbond: [C: 03+1] "Spoke with Joanna and this is now approved" [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [14:22:48] (03CR) 10Jbond: [C: 03+2] admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [14:23:03] (03PS5) 10Jbond: admin: add sre-admins to the always group [puppet] - 10https://gerrit.wikimedia.org/r/715733 (https://phabricator.wikimedia.org/T289779) [14:23:09] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Hey, @Ottomata I believe you organized or helped organize the watch party for "Turning the database inside-out". This... [14:26:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "+1 on premise, -1 for a couple of nits on commit message. Many thanks for this! Now... when can we expect puppetdb to clean all those old " [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:28:02] (03CR) 10Ema: [C: 03+1] "LGTM, great work! I'm using this on my workstation already and it works perfectly." [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [14:28:17] (03PS2) 10Effie Mouzeli: mwdebug: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 [14:28:21] (03PS2) 10Jbond: facter networking: filter k8s interfaces out of the networking fact [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) [14:30:06] (03PS2) 10Jbond: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 [14:30:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/715950 (owner: 10Jbond) [14:31:39] (03CR) 10MMandere: [C: 03+2] varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [14:31:48] (03PS9) 10Dzahn: admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 [14:32:30] (03CR) 10Dzahn: [C: 03+2] admin: create a group to run the wmf-auto-reimage commands [puppet] - 10https://gerrit.wikimedia.org/r/715729 (owner: 10Dzahn) [14:33:49] (03PS3) 10Effie Mouzeli: mwdebug: increase number of replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 [14:34:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 (owner: 10Dzahn) [14:35:43] (03Merged) 10jenkins-bot: comment out proto redirect rewrite rules [container/miscweb] - 10https://gerrit.wikimedia.org/r/715963 (owner: 10Dzahn) [14:38:26] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: increase number of replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 (owner: 10Effie Mouzeli) [14:39:08] (03PS1) 10Dzahn: miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 [14:39:19] (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 (owner: 10Dzahn) [14:39:55] (03PS2) 10Dzahn: miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 [14:40:56] (03Merged) 10jenkins-bot: mwdebug: increase number of replicas for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/715970 (owner: 10Effie Mouzeli) [14:41:23] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 (owner: 10Dzahn) [14:42:54] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:44:05] (03Merged) 10jenkins-bot: miscweb: bump staging version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715972 (owner: 10Dzahn) [14:46:48] jouncebot: now [14:46:49] No deployments scheduled for the next 3 hour(s) and 13 minute(s) [14:47:48] (03PS4) 10Hnowlan: postgres: increase number of WAL files retained by master [puppet] - 10https://gerrit.wikimedia.org/r/643717 [14:48:47] (03PS1) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [14:49:16] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30961/console" [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan) [14:49:29] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [14:52:36] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:53:04] PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:53] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10Ottomata) +1 <3 [14:54:09] (03PS1) 10Urbanecm: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) [14:54:21] (03CR) 10David Caro: update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [14:54:30] (03PS2) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [14:55:11] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [14:58:20] RECOVERY - Hadoop NodeManager on an-worker1096 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:50] RECOVERY - Check systemd state on an-worker1096 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:26] (03CR) 10BryanDavis: [C: 04-1] toolhub: Add helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [15:08:21] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [15:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:56] 10SRE, 10SRE-Access-Requests, 10Parsoid, 10serviceops, 10Sustainability (Incident Followup): Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Legoktm) +1 to granting permissions like normal appservers, this seems like an oversight once Parsoid moved to PHP and is now... [15:13:52] (03PS3) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [15:13:54] (03PS2) 10Dzahn: miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) [15:14:10] (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:14:53] (03PS3) 10Dzahn: miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) [15:18:47] (03CR) 10Dzahn: [C: 03+2] miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:19:38] (03PS8) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [15:20:00] (03PS1) 10Urbanecm: foundationwiki: Create editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715979 (https://phabricator.wikimedia.org/T205352) [15:20:12] (03Abandoned) 10Urbanecm: [Governance wiki] Create new 'editor' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472602 (https://phabricator.wikimedia.org/T205352) (owner: 10Jforrester) [15:20:14] (03CR) 10JMeybohm: [C: 03+2] Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 (owner: 10JMeybohm) [15:20:30] (03Abandoned) 10Urbanecm: [Governance wiki] Allow sysops to grant and remove 'editor' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472603 (owner: 10Jforrester) [15:20:31] (03Abandoned) 10Urbanecm: [Governance wiki] Move edit rights from users to 'editor' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472604 (https://phabricator.wikimedia.org/T205350) (owner: 10Jforrester) [15:21:00] (03CR) 10jerkins-bot: [V: 04-1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [15:21:30] (03Merged) 10jenkins-bot: miscweb: bump production version to 2021-09-01-143556-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/715237 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:21:48] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Parsoid, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10jijiki) [15:22:56] (03Merged) 10jenkins-bot: Rakefile: Fix parsing of envoy config with empty resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/715454 (owner: 10JMeybohm) [15:26:31] (03PS1) 10Filippo Giunchedi: clinic-duty: add ops-maintenance calendar link generator [software] - 10https://gerrit.wikimedia.org/r/715980 [15:26:38] (03CR) 10Herron: profile: adapt alertmanager-webhook-logger to ECS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [15:29:23] (03CR) 10Vgutierrez: [C: 03+1] varnish: add tests for unknown XCPS session reuse [puppet] - 10https://gerrit.wikimedia.org/r/715952 (https://phabricator.wikimedia.org/T271421) (owner: 10Ema) [15:29:31] (03PS1) 10Cmjohnson: Adding dhcpd updates for ms-be1064-1066 [puppet] - 10https://gerrit.wikimedia.org/r/715981 (https://phabricator.wikimedia.org/T285808) [15:30:42] (03CR) 10Cmjohnson: [C: 03+2] Adding dhcpd updates for ms-be1064-1066 [puppet] - 10https://gerrit.wikimedia.org/r/715981 (https://phabricator.wikimedia.org/T285808) (owner: 10Cmjohnson) [15:32:47] (03PS9) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [15:34:54] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) {F34628005} [15:34:54] (03PS1) 10Cmjohnson: Adding ms-be1064-66 to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/715983 (https://phabricator.wikimedia.org/T285808) [15:35:00] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:24] (03PS4) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) [15:35:39] (03CR) 10Cmjohnson: [C: 03+2] Adding ms-be1064-66 to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/715983 (https://phabricator.wikimedia.org/T285808) (owner: 10Cmjohnson) [15:35:49] (03CR) 10Cwhite: profile: adapt alertmanager-webhook-logger to ECS (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [15:40:54] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10fgiunchedi) >>! In T262668#7323887, @jcrespo wrote: >> I think we should crank concurrency up and see how much read throughput... [15:41:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ms-be1064.eqiad.wmnet ` The log can be found in `... [15:42:53] (03CR) 10David Caro: [C: 03+1] update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [15:42:59] (03CR) 10BryanDavis: toolhub: Add helmfile.d (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [15:43:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ms-be1065.eqiad.wmnet ` The log can be found in `... [15:44:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` ms-be1066.eqiad.wmnet ` The log can be found in `... [15:46:17] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) >>! In T287539#7324387, @Trizek-WMF wrote: > Do we have deployment this week? {T281164} has been created as usual, covering the Train Deployment for the... [15:46:21] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Thank you @godog, will do, slowly. On the extreme, a 4x-8x the number of current threads would anyway move the bottle... [15:46:38] (03CR) 10Cwhite: [C: 03+1] facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [15:47:36] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:51:02] (03CR) 10Herron: [C: 03+1] "LGTM overall, couple of minor comments" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [15:51:05] (03CR) 10Effie Mouzeli: "I do not speak the language, but +100 for the idea !" [software] - 10https://gerrit.wikimedia.org/r/715980 (owner: 10Filippo Giunchedi) [15:55:56] !log mforns@deploy1002 Started deploy [analytics/refinery@ff15071]: Fix for cassandra3 loading [analytics/refinery@ff15071] [15:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:10] (03PS4) 10Ladsgroup: Set permission of creating short url to everyone everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715492 (https://phabricator.wikimedia.org/T267921) [16:00:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1064.eqiad.wmnet with reason: REIMAGE [16:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1065.eqiad.wmnet with reason: REIMAGE [16:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1066.eqiad.wmnet with reason: REIMAGE [16:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:58] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1064.eqiad.wmnet with reason: REIMAGE [16:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1065.eqiad.wmnet with reason: REIMAGE [16:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:54] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715985 [16:06:36] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:06:45] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1066.eqiad.wmnet with reason: REIMAGE [16:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:01] (03PS4) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [16:07:43] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Parsoid, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Dzahn) What Lego said, access should mimick what we do with regular appservers. [16:08:06] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:08:34] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/715986 [16:09:35] (03CR) 10Ladsgroup: dumps: migrate cron of dumps-exception-checker to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:09:38] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:10:34] (03PS1) 10Dzahn: add deploment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) [16:10:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1064.eqiad.wmnet'] ` and were **ALL** successful. [16:11:25] mutante: spelling on the commit title [16:12:00] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:12:27] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1065.eqiad.wmnet'] ` and were **ALL** successful. [16:12:29] (03PS2) 10Dzahn: add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) [16:12:32] RhinosF1: ty! fixed [16:12:49] mutante: np [16:13:01] (03CR) 10RhinosF1: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [16:13:10] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:13:52] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ms-be1066.eqiad.wmnet'] ` and were **ALL** successful. [16:14:06] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:14:12] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:16:10] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:17:04] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) [16:17:19] (03PS3) 10Dzahn: add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) [16:19:13] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) @fgiunchedi ms-be1064/65/66 are installed and are ready for you to take over, 1067 is not racked yet until we can space in row D. We haven't had a response from traff... [16:21:39] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Trizek-WMF) Thank you @Legoktm, I updated our public messages accordingly. [16:22:54] !log mforns@deploy1002 Finished deploy [analytics/refinery@ff15071]: Fix for cassandra3 loading [analytics/refinery@ff15071] (duration: 26m 58s) [16:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:46] !log mforns@deploy1002 Started deploy [analytics/refinery@ff15071] (thin): Fix for cassandra3 loading THIN [analytics/refinery@ff15071] [16:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:52] !log mforns@deploy1002 Finished deploy [analytics/refinery@ff15071] (thin): Fix for cassandra3 loading THIN [analytics/refinery@ff15071] (duration: 00m 06s) [16:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:06] (03CR) 10Zabe: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [16:26:14] (03CR) 10Legoktm: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [16:26:23] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) A ticket has been submitted Your request (#82025) has been received, and is being reviewed by our support staff. For questions concerning Opengear's Console Server products, please submit... [16:29:28] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10Trizek-WMF) I made two updates: * the date * the fact that the deployment train will run I informed the translators about these changes. [16:32:10] (03PS2) 10Jbond: admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 [16:32:51] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [16:32:59] (03CR) 10jerkins-bot: [V: 04-1] admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [16:43:20] (03PS1) 10Legoktm: [WIP] Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) [16:44:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [16:44:48] (03PS2) 10Legoktm: [WIP] Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) [16:46:00] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30966/console" [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [16:47:05] (03PS3) 10Legoktm: Automatically pull latest MediaWiki image onto staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) [16:47:48] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30967/console" [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [16:49:38] meh, not sure what I'm doing wrong [16:51:32] (03CR) 10Legoktm: "I'm not sure why PCC says the timer is being enabled in codfw, I only added to the eqiad hiera." [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [16:54:05] legoktm: the codfw is still affected... the mwautopull timer is created with 'ensure => absent'. [16:54:15] legoktm: the codfw full diff just has the-- drat :) [16:54:15] affected, but affected in the desired way [16:54:17] what dancy said [16:54:25] oh [16:54:50] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [16:55:07] where do I see that when looking at https://puppet-compiler.wmflabs.org/compiler1002/30967/kubestage2001.codfw.wmnet/index.html [16:55:25] "Full Diff" under "Relevant files" [16:55:37] (bottom of the page) [16:55:52] ahh, TIL, perfect :D [16:56:34] (03CR) 10Legoktm: Automatically pull latest MediaWiki image onto staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [17:00:07] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01033 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:00:24] pplooking [17:04:13] (03CR) 10Jdlrobson: "Survey is not active from the coverage if I'm reading correctly so don't think we need to backport this." [extensions/QuickSurveys] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715809 (https://phabricator.wikimedia.org/T289941) (owner: 10Jforrester) [17:09:07] (03PS1) 10Bstorm: quarry: add a simple backup server [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568) [17:13:29] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002869 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:22:42] (03PS2) 10Bstorm: quarry: add a simple backup server [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568) [17:37:59] (03PS3) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [17:38:01] (03PS1) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [17:38:41] 10SRE, 10Traffic, 10Patch-For-Review: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [17:45:41] (03CR) 10BryanDavis: [C: 04-1] "Missing .fixtures file for mcrouter enabled status which is in turn hiding errors." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [17:49:15] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:36] jouncebot: next [17:51:36] In 0 hour(s) and 8 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800) [17:51:36] In 0 hour(s) and 8 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800) [17:51:55] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10KFrancis) @fgiunchedi @dr0ptp4kt I have not been able to find Simone Cuomo on our current contractors list or under their name in Coupa. Is Simone working as a consultant under a business e... [17:54:11] (03PS1) 10AOkoth: admin: change to yubikey SSH key [puppet] - 10https://gerrit.wikimedia.org/r/716003 (https://phabricator.wikimedia.org/T288645) [17:55:47] (03CR) 10RLazarus: [C: 03+2] "I'm live on a call with Arnold and can confirm this is his new key." [puppet] - 10https://gerrit.wikimedia.org/r/716003 (https://phabricator.wikimedia.org/T288645) (owner: 10AOkoth) [18:00:05] twentyafterfour and dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:25] i'll deploy something [18:02:05] (03PS3) 10Urbanecm: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) [18:02:10] 10SRE, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) a:03Cmjohnson [18:02:12] 10SRE, 10ops-eqiad: eqiad: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268750 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [18:02:18] (03CR) 10Urbanecm: [C: 03+2] Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm) [18:03:07] (03Merged) 10jenkins-bot: Growth features: Enable for newcomers on two wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715955 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm) [18:04:42] (03PS2) 10Urbanecm: nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254) [18:04:50] (03CR) 10Urbanecm: [C: 03+2] nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm) [18:05:35] (03Merged) 10jenkins-bot: nlwiki: Enable link recommendations for all Growth users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715956 (https://phabricator.wikimedia.org/T285254) (owner: 10Urbanecm) [18:05:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 94b1cca: Growth features: Enable for newcomers on two wikis (T285254, T287867) (duration: 01m 09s) [18:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:50] T285254: Deploy Growth features on Dutch Wikipedia - https://phabricator.wikimedia.org/T285254 [18:05:50] T287867: Deploy Growth features on Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T287867 [18:05:59] (03PS3) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [18:06:52] (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [18:07:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 27e85b1f228dccb584b4692f5b1b1354b19625b4: nlwiki: Enable link recommendations for all Growth users (T285254) (duration: 01m 06s) [18:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:07] (03PS4) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [18:08:52] * urbanecm done [18:08:54] (03PS5) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) [18:08:56] (03PS10) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [18:09:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:38] (03PS2) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 [18:10:03] (03CR) 10Legoktm: Update configuration related to disabling Score functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [18:10:44] (03PS3) 10Legoktm: Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 [18:10:46] (03PS3) 10Legoktm: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 [18:11:16] actually, one more patch [18:11:45] ok, I'll go after you then :) [18:11:56] thanks [18:13:06] (03PS2) 10Urbanecm: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) [18:13:53] (03PS3) 10Urbanecm: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) [18:14:09] (03CR) 10Urbanecm: [C: 03+2] Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) (owner: 10Urbanecm) [18:15:21] (03PS5) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [18:15:51] (03Merged) 10jenkins-bot: Growth features: Deploy to 100% of newcomers on small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715975 (https://phabricator.wikimedia.org/T289786) (owner: 10Urbanecm) [18:17:28] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fe1ae2e438841a069dc8dadc9a1850b91863c06a: Growth features: Deploy to 100% of newcomers on small wikis (T289786) (duration: 01m 06s) [18:17:33] done for real [18:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:36] legoktm: go ahead :) [18:17:38] T289786: Deploy Growth features to 100% of newcomers on any wiki that has less than 500 monthly registrations - https://phabricator.wikimedia.org/T289786 [18:19:15] (03PS6) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) [18:19:17] (03PS11) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [18:19:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:58] thanks! [18:20:22] (03CR) 10BryanDavis: toolhub: Add mcrouter sidecar for memcached access (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [18:20:27] (03CR) 10Legoktm: [C: 03+2] Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm) [18:21:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:34] (03Merged) 10jenkins-bot: Don't set default $wgShellboxUrls to Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715193 (owner: 10Legoktm) [18:23:50] (03CR) 10Herron: thanos: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [18:25:07] (03PS6) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [18:26:26] eh, not working [18:28:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:02] (03PS1) 10Legoktm: Revert "Don't set default $wgShellboxUrls to Score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715821 [18:32:08] (03CR) 10Legoktm: [C: 03+2] Revert "Don't set default $wgShellboxUrls to Score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715821 (owner: 10Legoktm) [18:32:55] (03Merged) 10jenkins-bot: Revert "Don't set default $wgShellboxUrls to Score" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715821 (owner: 10Legoktm) [18:35:04] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10RobH) [18:35:16] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10RobH) [18:36:05] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10RobH) a:03Papaul [18:37:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:17] 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH) [18:40:30] 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH) [18:40:58] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH) [18:41:41] 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10RobH) a:03Papaul [18:46:35] (03PS1) 10Ayounsi: remove Damping from cr4-ulsfo:xe-0/1/2 [homer/public] - 10https://gerrit.wikimedia.org/r/716008 (https://phabricator.wikimedia.org/T290188) [18:47:25] (03CR) 10Ayounsi: [C: 03+2] remove Damping from cr4-ulsfo:xe-0/1/2 [homer/public] - 10https://gerrit.wikimedia.org/r/716008 (https://phabricator.wikimedia.org/T290188) (owner: 10Ayounsi) [18:54:29] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10jijiki) >>! In T289657#7309715, @wiki_willy wrote: > Hi @jijiki - hope all is well. We were wondering if it would be possible to prioritize the decom o... [18:56:33] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10wiki_willy) Awesome, thanks @jijiki! [19:00:05] twentyafterfour and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1900). [19:00:35] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:47] (03CR) 10RLazarus: [C: 03+1] thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [19:02:03] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) [19:14:04] (03PS8) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [19:14:16] (03PS3) 10Jdlrobson: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) [19:14:19] (03PS1) 10Ebernhardson: airflow: Compress scheduler logs [puppet] - 10https://gerrit.wikimedia.org/r/716018 [19:20:31] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1214.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:25:27] (03PS6) 10Zabe: dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) [19:49:26] I'm about to deploy wmf.21 to group1, should I be concerned with the replica lag alert? ^ [19:51:37] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10lmata) Much appreciated, thank you! [19:51:58] twentyafterfour: it's in eqiad (ie. unused) and it is happening for over a week if not more. [19:53:02] (but I'm not a SRE, of course, just my 2c) [19:53:08] thanks urbanecm [19:53:18] (03PS1) 1020after4: group1 wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716024 [19:53:20] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716024 (owner: 1020after4) [19:54:28] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.21 refs T281161 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716024 (owner: 1020after4) [19:55:21] Is it wanted that wmf.20 blocker task is mentioned ^ [19:55:52] 10Puppet, 10GitLab, 10Infrastructure-Foundations, 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen) [19:55:57] no ... it should mention the wmf.21 blocker task [19:56:26] weird I wonder what went wrong with that [19:56:34] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.21 refs T281161 [19:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:40] T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161 [19:57:41] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.21 refs T281162 [19:57:41] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.21 refs T281161 (duration: 01m 06s) [19:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:44] T281162: 1.37.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T281162 [19:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:59] twentyafterfour: yak: automate selection of train blocker ticket in deploy-promote. [19:59:07] I always forget to supply it [19:59:22] dancy: I have a tool for that but apparently it's not reliable [19:59:33] Let's fix it next week! [19:59:39] 10SRE, 10GitLab, 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10brennen) [20:00:05] twentyafterfour and dancy: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T1900). [20:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T2000) [20:00:12] I'm out next week but my tool is https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/608936 [20:00:21] ok. I'll check it out. [20:01:04] it's fallable though because it relies on finding the oldest open train blocker task [20:01:18] so if the previous week is still open at the time it'll fail, and probably other ways as well [20:01:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:42] Understood. I have ideas. [20:02:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:51] twentyafterfour: For reference, what was the exact command you issued? [20:04:11] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:10] dancy: `export PHABTASK=$(current-deployment-blockers)` [20:11:35] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:40] (03PS1) 10RobH: updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032 [20:15:20] (03PS2) 10RobH: updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032 [20:15:31] (03CR) 10RobH: [C: 03+2] updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032 (owner: 10RobH) [20:16:26] (03Merged) 10jenkins-bot: updating for config c-1g [software] - 10https://gerrit.wikimedia.org/r/716032 (owner: 10RobH) [20:20:15] (03PS2) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009) [20:26:51] group1 appears to be stable ... no new errors in the logs at all [20:28:29] 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10mewoph) [20:28:33] 👍🏾 [20:37:03] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10RobH) [20:37:11] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10RobH) [20:37:36] 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10RobH) a:03Jclark-ctr [20:42:56] (03PS5) 10Jdlrobson: Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) [20:59:13] (03PS2) 10Legoktm: mediawiki::maintenance: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm) [20:59:29] (03CR) 10Legoktm: [C: 03+2] mediawiki::maintenance: Add --statsd to updateMenteeData.php [puppet] - 10https://gerrit.wikimedia.org/r/715723 (https://phabricator.wikimedia.org/T278971) (owner: 10Urbanecm) [21:05:41] (03PS1) 10Zabe: query_service: migrate query-service-gc-log-cleanup cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) [21:08:45] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:09:15] (03CR) 10Bstorm: [C: 03+2] toolforge: remove portgrabber [puppet] - 10https://gerrit.wikimedia.org/r/714187 (owner: 10Majavah) [21:13:35] (03CR) 10Legoktm: [C: 03+2] backup: Simplify Mailman backups [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [21:14:31] (03CR) 10Zabe: "This doesn't seems to be working: https://puppet-compiler.wmflabs.org/compiler1002/892/wdqs2001.codfw.wmnet/change.wdqs2001.codfw.wmnet.er" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:17:39] (03CR) 10Legoktm: [C: 04-1] query_service: migrate query-service-gc-log-cleanup cron to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:20:21] (03PS1) 10Dave Pifke: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) [21:21:37] (03PS2) 10Zabe: query_service: migrate query-service-gc-log-cleanup cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) [21:22:21] (03PS1) 10RobH: removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042 [21:22:30] (03PS2) 10RobH: removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042 [21:23:01] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:23:15] (03CR) 10Bstorm: [C: 03+1] "This looks like what is needed." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713661 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [21:23:36] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:25:10] (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/893/wdqs2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:36:18] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Service-deployment-requests, and 2 others: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Legoktm) [21:41:25] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm) [21:41:47] (03CR) 10Bstorm: [C: 03+2] quarry: add a simple backup server [puppet] - 10https://gerrit.wikimedia.org/r/715997 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm) [21:59:01] (03PS1) 10Legoktm: Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [labs/private] - 10https://gerrit.wikimedia.org/r/716048 (https://phabricator.wikimedia.org/T289227) [21:59:04] (03PS1) 10Legoktm: Add k8s users/tokens for apple-search [labs/private] - 10https://gerrit.wikimedia.org/r/716049 (https://phabricator.wikimedia.org/T289224) [22:03:06] (03PS1) 10Legoktm: Add k8s tokens/users for shellbox-{syntaxhighlight,timeline} [puppet] - 10https://gerrit.wikimedia.org/r/716051 (https://phabricator.wikimedia.org/T289227) [22:03:09] (03PS1) 10Legoktm: Add k8s users/tokens for apple-search [puppet] - 10https://gerrit.wikimedia.org/r/716052 (https://phabricator.wikimedia.org/T289224) [22:04:07] (03PS2) 10Legoktm: Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [puppet] - 10https://gerrit.wikimedia.org/r/716051 (https://phabricator.wikimedia.org/T289227) [22:04:08] (03PS2) 10Legoktm: Add k8s users/tokens for apple-search [puppet] - 10https://gerrit.wikimedia.org/r/716052 (https://phabricator.wikimedia.org/T289224) [22:14:04] (03PS1) 10Bstorm: quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) [22:16:17] (03CR) 10jerkins-bot: [V: 04-1] quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm) [22:16:19] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s users/tokens for apple-search [labs/private] - 10https://gerrit.wikimedia.org/r/716049 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm) [22:16:24] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [labs/private] - 10https://gerrit.wikimedia.org/r/716048 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:16:45] (03CR) 10Legoktm: [C: 03+2] Add k8s users/tokens for shellbox-{syntaxhighlight,timeline} [puppet] - 10https://gerrit.wikimedia.org/r/716051 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:16:47] (03CR) 10Legoktm: [C: 03+2] Add k8s users/tokens for apple-search [puppet] - 10https://gerrit.wikimedia.org/r/716052 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm) [22:20:58] (03PS1) 10Legoktm: admin: Add namespace for shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/716054 (https://phabricator.wikimedia.org/T289227) [22:21:00] (03PS1) 10Legoktm: admin: Add namespace for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716055 (https://phabricator.wikimedia.org/T289226) [22:21:02] (03PS1) 10Legoktm: admin: Add namespace for apple-search [deployment-charts] - 10https://gerrit.wikimedia.org/r/716056 (https://phabricator.wikimedia.org/T289224) [22:24:03] (03CR) 10Legoktm: [C: 03+2] admin: Add namespace for shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/716054 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:24:06] (03CR) 10Legoktm: [C: 03+2] admin: Add namespace for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716055 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [22:27:10] (03Merged) 10jenkins-bot: admin: Add namespace for shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/716054 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [22:27:14] (03Merged) 10jenkins-bot: admin: Add namespace for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716055 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [22:29:43] !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [22:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:57] !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [22:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:18] !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [22:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:54] !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [22:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:25] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [22:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:56] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [22:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:25] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [22:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:29] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [22:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:22] (03CR) 10Legoktm: [C: 03+2] admin: Add namespace for apple-search [deployment-charts] - 10https://gerrit.wikimedia.org/r/716056 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm) [22:39:01] (03Merged) 10jenkins-bot: admin: Add namespace for apple-search [deployment-charts] - 10https://gerrit.wikimedia.org/r/716056 (https://phabricator.wikimedia.org/T289224) (owner: 10Legoktm) [22:39:54] !log legoktm@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [22:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:56] !log legoktm@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [22:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:02] !log legoktm@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [22:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:30] !log legoktm@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [22:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:00] !log legoktm@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [22:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:25] !log legoktm@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [22:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:41] !log legoktm@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [22:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:39] !log legoktm@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [22:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:47] (03PS1) 10Gergő Tisza: fixLinkRecommendationData: Allow --db-table in dry-run mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715824 (https://phabricator.wikimedia.org/T283868) [22:50:17] (03PS1) 10Gergő Tisza: fixLinkRecommendationData: stay under 10K search limit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715825 (https://phabricator.wikimedia.org/T284531) [22:50:22] (03PS1) 10Legoktm: Add helmfile.d for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716063 (https://phabricator.wikimedia.org/T289226) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210901T2300). [23:00:04] tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:46] Here [23:01:17] Hi Jdlrobson [23:01:20] I can deploy today [23:02:14] o/ [23:02:30] mine are fire and forget [23:02:49] no testing needed, I mean [23:03:01] (03CR) 10Urbanecm: [C: 03+2] fixLinkRecommendationData: Allow --db-table in dry-run mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715824 (https://phabricator.wikimedia.org/T283868) (owner: 10Gergő Tisza) [23:03:05] (03CR) 10Urbanecm: [C: 03+2] fixLinkRecommendationData: stay under 10K search limit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715825 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza) [23:03:12] ack tgr [23:03:34] Jdlrobson: would you mind amending https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/713653 to add it to the extension-list? :-) [23:03:43] oh shoot im sorry i thought it did that [23:03:45] doing that now [23:04:13] !log dpifke@deploy1002 Started deploy [performance/navtiming@63c9d31]: Deploy fix for CpuBenchmark-related Prometheus timeouts T281243 [23:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:17] thanks [23:04:18] T281243: Expose CPU benchmark data to Prometheus/Grafana - https://phabricator.wikimedia.org/T281243 [23:04:19] !log dpifke@deploy1002 Finished deploy [performance/navtiming@63c9d31]: Deploy fix for CpuBenchmark-related Prometheus timeouts T281243 (duration: 00m 06s) [23:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:25] ^ above only affects webperfX001 [23:04:40] and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704171 only adds a SVG, can't find the commit that uses it [23:05:02] (03PS9) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [23:05:40] urbanecm: you can skip the logo patch it's not ready. [23:05:48] okay [23:05:51] I was misinformed :) [23:06:13] (03CR) 10Urbanecm: [C: 03+2] Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson) [23:06:17] (03PS4) 10Urbanecm: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson) [23:06:23] (03CR) 10Urbanecm: [C: 03+2] Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson) [23:06:29] (03CR) 10Jdlrobson: [C: 04-1] "This also needs an update in wmf-config/InitialiseSettings.php to set the width and height etc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [23:07:01] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review, and 3 others: Split search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Legoktm) [23:08:11] (03Merged) 10jenkins-bot: Enable WVUI search on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715586 (https://phabricator.wikimedia.org/T287215) (owner: 10Jdlrobson) [23:08:51] Jdlrobson: please test at mwdebug201 [23:08:58] (the WVUI search i mean) [23:09:06] on it [23:09:24] (03Abandoned) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715809 (https://phabricator.wikimedia.org/T289941) (owner: 10Jforrester) [23:09:29] (03Abandoned) 10Jforrester: Use privacyPolicy configuration [extensions/QuickSurveys] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/715808 (https://phabricator.wikimedia.org/T289941) (owner: 10Jforrester) [23:09:49] urbanecm: LGMT [23:09:59] please sync [23:09:59] thanks, syncing [23:10:23] (03CR) 10Legoktm: [C: 03+2] Add helmfile.d for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716063 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [23:10:25] (03PS10) 10Urbanecm: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [23:11:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bb7d92c48edf48b94fd628e9e0b5fd6682460373: Enable WVUI search on Wikimedia Commons (T287215) (duration: 01m 07s) [23:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:55] T287215: Enable WVUI search on commons - https://phabricator.wikimedia.org/T287215 [23:11:56] live [23:12:09] (03CR) 10Urbanecm: [C: 03+2] "extension has 2 branches, secreviewed, no reason not to enable it in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [23:12:36] Jdlrobson: just wondering, when do you plan to do it in prod? [23:12:49] (enable the extension, that is) [23:12:58] (03Merged) 10jenkins-bot: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [23:12:58] It depends on performance team at this point [23:13:02] but pretty flexible [23:13:34] so definitely not "later this week" [23:13:36] (03Merged) 10jenkins-bot: Add helmfile.d for shellbox-timeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/716063 (https://phabricator.wikimedia.org/T289226) (owner: 10Legoktm) [23:14:18] in that case all should be fine [23:14:55] urbanecm: definitely not later this week :) [23:15:06] good :) [23:15:19] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [23:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:43] i'm going to sync it to get the variables to prod, beta will self-update soon [23:16:05] (03PS1) 10MSantos: maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068 [23:16:22] (03CR) 10jerkins-bot: [V: 04-1] maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068 (owner: 10MSantos) [23:16:36] (03PS2) 10MSantos: maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068 [23:16:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 91ff9273fd9f80b571771a7454d34d63f43405b8: Enable NearbyPages on beta cluster (T246493; 1/3) (duration: 01m 06s) [23:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:34] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [23:17:38] thanks urbanecm [23:17:50] np :) [23:18:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:50] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 91ff9273fd9f80b571771a7454d34d63f43405b8: Enable NearbyPages on beta cluster (T246493; 2/3) (duration: 01m 06s) [23:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:13] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [23:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:07] !log urbanecm@deploy1002 Synchronized wmf-config/extension-list: 91ff9273fd9f80b571771a7454d34d63f43405b8: Enable NearbyPages on beta cluster (T246493; 3/3) (duration: 01m 05s) [23:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:20] so, the prod part is done [23:20:25] Jdlrobson: anything else? [23:22:00] nope that's all.. thanks! [23:22:06] any time [23:22:23] (03Merged) 10jenkins-bot: fixLinkRecommendationData: Allow --db-table in dry-run mode [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715824 (https://phabricator.wikimedia.org/T283868) (owner: 10Gergő Tisza) [23:22:38] just in time [23:24:15] (03Merged) 10jenkins-bot: fixLinkRecommendationData: stay under 10K search limit [extensions/GrowthExperiments] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/715825 (https://phabricator.wikimedia.org/T284531) (owner: 10Gergő Tisza) [23:24:42] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 3c7d4ecc699b7c68467a372686f5514375d2b74f: fixLinkRecommendationData: Allow --db-table in dry-run mode (T283868) (duration: 01m 06s) [23:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:47] T283868: Monitor "no suggestion" rate for Add Link tasks - https://phabricator.wikimedia.org/T283868 [23:25:22] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [23:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:21] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 0bd65426494d4df981141650211e27e17c98ee0c: fixLinkRecommendationData: stay under 10K search limit (T284531) (duration: 01m 06s) [23:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:25] T284531: Add Link: Work around 10K search result set limit in fixLinkRecommendationData.php - https://phabricator.wikimedia.org/T284531 [23:27:28] tgr: both done [23:27:32] anything else i can help with? [23:27:54] thanks! [23:28:01] np [23:30:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:39] 10SRE, 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BBlack) #traffic-icebox now exists as a new tag with a process-informative description (click it and read!). I've bulk (+silent) moved all open #traffic tickets which had no activity for >= 6 months over to... [23:43:20] (03PS2) 10Bstorm: quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) [23:45:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:57] (03CR) 10Ladsgroup: [C: 03+1] dumps: migrate cron of dumps-exception-checker to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711011 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [23:50:01] (03CR) 10Thcipriani: [C: 03+1] Italian Wikipedia is now a group 1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [23:50:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:30] (03CR) 10Andrew Bogott: [C: 03+1] "Seems right to my not-very-trustworthy eyes" [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm) [23:50:33] !log mwscript createAndPromote.php --wiki=test2wiki --sysop --force Ladsgroup [23:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:26] (03PS1) 10Jdlrobson: Fix Wikidata API url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716073 [23:52:22] (03CR) 10Bstorm: [C: 03+2] quarry backup: change the cleanup job to check number of backups [puppet] - 10https://gerrit.wikimedia.org/r/716053 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)