[00:00:39] (03Abandoned) 10Dzahn: peopleweb: add monitor for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/900741 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [00:02:10] (03CR) 10Dzahn: [C: 03+2] "both probes working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*etherpad.*%22%7D&g0.tab=1&g0.stacked=0" [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [00:03:26] (WidespreadPuppetFailure) firing: Puppet has failed on wdqs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wdqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:03:34] (SystemdUnitFailed) firing: (15) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:34] (SystemdUnitFailed) firing: (15) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:14] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10KFrancis) I am confirming the NDA has been signed. Please proceed with the access request. Thanks! [00:26:36] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10wiki_willy) @Cmjohnson & @Papaul - can you guys provide an ETR on this one? Thanks, Willy [00:33:26] (WidespreadPuppetFailure) resolved: Puppet has failed on wdqs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wdqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:47:26] (WidespreadPuppetFailure) firing: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:51:26] (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:51:59] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) @Dzahn Thank you so much for the help explaining this! Makes a ton of sense, and I'll create that ticket soon. @Ottomata Unfortunately I'm trying to get something from hdfs and publish it to `/s... [01:02:39] PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:19] PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f0d9ddda280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [01:03:19] org/wiki/Search%23Administration [01:03:54] (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:07:33] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder) [01:12:05] RECOVERY - Check systemd state on logstash1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:12:43] RECOVERY - OpenSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 607, active_shards: 1330, relocating_shards: 1, initializing_shards: 3, unassigned_shards: 69, delayed_unassigned_sh [01:12:43] number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.8644793152639 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:13:49] (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:26] (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:22:26] (WidespreadPuppetFailure) resolved: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:24:15] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:54] (03PS2) 10Krinkle: mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) [01:45:00] * Krinkle testing on mwdebug2001 [01:45:05] (03CR) 10Krinkle: [C: 03+2] mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [01:45:48] (03Merged) 10jenkins-bot: mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [01:50:20] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Quiddity) [01:52:26] (03CR) 10Krinkle: [C: 03+2] "Test plan: Make an edit on test2.wikipedia via WikimediaDebug with verbose logging enabled. Then, confirm in Logstash that for the given w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle) [01:58:34] (SystemdUnitFailed) firing: (16) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:16] !log krinkle@deploy2002 Synchronized wmf-config/mc.php: I44edcd46da45b827d (duration: 06m 33s) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0200) [02:03:39] (SystemdUnitFailed) firing: (60) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) [02:07:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [02:08:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:18:39] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 386.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [02:22:19] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [02:25:16] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Patch-For-Review: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Krinkle) [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:21] (03PS1) 10Andrew Bogott: Trove: adjust timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/903361 [02:30:26] (03CR) 10Andrew Bogott: [C: 03+2] Trove: adjust timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/903361 (owner: 10Andrew Bogott) [02:47:21] RECOVERY - confd service on an-worker1132 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:48:52] 10SRE, 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Machine Learning team - k8s resources access - https://phabricator.wikimedia.org/T333174 (10Ladsgroup) [02:49:11] 10SRE, 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Machine Learning team - k8s resources access - https://phabricator.wikimedia.org/T333174 (10Ladsgroup) [02:53:05] PROBLEM - confd service on an-worker1132 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0300) [03:03:34] (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:25] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:21] (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:59:23] PROBLEM - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2023-03-28 03:47:53 is 108 KiB, but the previous one was 92 KiB, a change of +17.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:13:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:14:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:19:19] PROBLEM - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1115) taken on 2023-03-28 03:52:10 is 107 KiB, but the previous one was 91 KiB, a change of +17.5 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:20:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.400 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:21:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:54] (03PS1) 10Marostegui: db1179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903364 (https://phabricator.wikimedia.org/T332292) [05:13:53] (03CR) 10Marostegui: [C: 03+2] db1179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903364 (https://phabricator.wikimedia.org/T332292) (owner: 10Marostegui) [05:15:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P45955 and previous config saved to /var/cache/conftool/dbconfig/20230328-051539-root.json [05:28:34] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P45956 and previous config saved to /var/cache/conftool/dbconfig/20230328-053043-root.json [05:40:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [05:45:25] (03Merged) 10jenkins-bot: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [05:45:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P45957 and previous config saved to /var/cache/conftool/dbconfig/20230328-054548-root.json [05:53:20] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [05:53:57] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [05:55:28] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [05:55:55] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0600) [06:00:05] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0600). nyaa~ [06:00:35] <_joe_> jouncebot: next [06:00:35] In 0 hour(s) and 59 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0700) [06:00:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P45958 and previous config saved to /var/cache/conftool/dbconfig/20230328-060053-root.json [06:14:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104 T329481', diff saved to https://phabricator.wikimedia.org/P45959 and previous config saved to /var/cache/conftool/dbconfig/20230328-061441-root.json [06:14:48] T329481: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 [06:15:45] (03PS1) 10Marostegui: db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903537 (https://phabricator.wikimedia.org/T329481) [06:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45960 and previous config saved to /var/cache/conftool/dbconfig/20230328-061558-root.json [06:16:11] (03CR) 10Marostegui: [C: 03+2] db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903537 (https://phabricator.wikimedia.org/T329481) (owner: 10Marostegui) [06:18:39] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 310.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [06:28:26] (03CR) 10Abijeet Patro: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903546 (owner: 10Abijeet Patro) [06:31:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45961 and previous config saved to /var/cache/conftool/dbconfig/20230328-063103-root.json [06:31:41] RECOVERY - Check systemd state on db1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45962 and previous config saved to /var/cache/conftool/dbconfig/20230328-064607-root.json [06:51:14] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) 05Ope... [06:51:24] (03PS15) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [06:56:29] (03PS16) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [06:57:32] (03CR) 10JMeybohm: [C: 03+2] k8s: Use storage-driver instead of storage_driver [puppet] - 10https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) (owner: 10Ahmon Dancy) [06:58:53] (03CR) 10Jforrester: "Filed the merge failure as T333291." [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [07:00:05] Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:00:16] kart_: want to self-deploy? [07:01:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45963 and previous config saved to /var/cache/conftool/dbconfig/20230328-070112-root.json [07:02:45] * kart_ is here [07:02:55] taavi: I can self deploy this. [07:03:24] (03PS2) 10KartikMistry: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834) [07:03:34] (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:39] (03PS17) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [07:07:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834) (owner: 10KartikMistry) [07:08:21] (03Merged) 10jenkins-bot: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834) (owner: 10KartikMistry) [07:08:43] !log kartik@deploy2002 Started scap: Backport for [[gerrit:903003|Enable Section Translation on some wikis while Content Translation remains in beta (T308834)]] [07:08:49] T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834 [07:10:51] !log kartik@deploy2002 kartik: Backport for [[gerrit:903003|Enable Section Translation on some wikis while Content Translation remains in beta (T308834)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [07:14:08] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [07:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45964 and previous config saved to /var/cache/conftool/dbconfig/20230328-071617-root.json [07:20:26] (03PS3) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 [07:20:49] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:903003|Enable Section Translation on some wikis while Content Translation remains in beta (T308834)]] (duration: 12m 05s) [07:20:54] T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834 [07:21:58] taavi: I'm done. [07:22:10] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40356/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [07:23:34] (SystemdUnitFailed) firing: (64) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:06] (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:27:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'clear' for AS: 17806 [07:28:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 17806 [07:28:28] (03PS4) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 [07:28:34] (SystemdUnitFailed) firing: (64) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45965 and previous config saved to /var/cache/conftool/dbconfig/20230328-073122-root.json [07:31:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40357/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [07:34:47] (03CR) 10Ayounsi: [C: 03+2] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [07:37:32] (03CR) 10Slyngshede: [V: 03+1] P:url_downloader send Squid access logs to Logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [07:38:02] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move reads to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [07:38:06] (03PS2) 10Filippo Giunchedi: wmnet: move reads to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165) [07:38:34] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 206.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [07:40:09] !log move graphite reads to codfw - T330165 [07:40:19] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: check graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903206 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [07:40:21] (03Merged) 10jenkins-bot: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [07:47:56] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [07:50:57] jouncebot: next [07:50:57] In 2 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1000) [07:51:53] !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:51:54] !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:54:44] !log root@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:54:45] !log root@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:56:13] !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:56:15] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:56:20] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:56:36] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:00:06] !log move graphite reads to codfw - T330165 [08:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:17] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [08:00:20] !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:00:33] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move writes to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [08:00:38] (03PS2) 10Filippo Giunchedi: wmnet: move writes to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165) [08:00:41] (03CR) 10Filippo Giunchedi: [C: 03+2] statsd: move writes to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903207 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [08:01:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 254.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [08:01:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [08:02:25] (03Merged) 10jenkins-bot: Failover statsd to graphite2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [08:02:26] !log ayounsi@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:02:36] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] [08:03:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on es[1020-1022].eqiad.wmnet with reason: Switch maintenance [08:03:11] !log ayounsi@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:03:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on es[1020-1022].eqiad.wmnet with reason: Switch maintenance [08:04:11] !log oblivian@deploy2002 oblivian and filippo: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:04:22] 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) p:05Triage→03Medium [08:05:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on 21 hosts with reason: Switch maintenance [08:05:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 21 hosts with reason: Switch maintenance [08:05:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on 16 hosts with reason: Switch maintenance [08:06:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 16 hosts with reason: Switch maintenance [08:06:47] 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) [08:08:34] !log ayounsi@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [08:08:34] (SystemdUnitFailed) firing: (16) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul looks good to me. I can do them any day this week except today (Tuesday), so whenever... [08:09:03] <_joe_> godog: php restarts happening [08:09:12] <_joe_> you should see the traffif shifting [08:09:16] _joe_: ok! thank you [08:09:30] I'm looking at this guy https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&refresh=1m&from=1679987363327&to=1679990963327&viewPanel=14 [08:11:25] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] (duration: 08m 48s) [08:11:30] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [08:12:50] (03CR) 10Clément Goubert: [C: 03+2] P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) (owner: 10Clément Goubert) [08:13:17] (03PS6) 10Clément Goubert: P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) [08:13:34] (SystemdUnitFailed) firing: (61) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:38] !log ayounsi@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [08:18:34] (SystemdUnitFailed) firing: (61) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:13] !log ayounsi@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [08:21:26] (03PS1) 10Stevemunene: Deprecate oozie services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) [08:24:02] 10SRE, 10SRE-Access-Requests: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10abi_) [08:24:39] (03CR) 10LSobanski: [C: 03+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [08:25:30] !log ayounsi@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:26:20] (03PS1) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 [08:27:33] (03PS1) 10Giuseppe Lavagetto: requestctl: fix default path for the git repo [software/conftool] - 10https://gerrit.wikimedia.org/r/903597 [08:27:35] (03PS1) 10Giuseppe Lavagetto: Add a ConftoolClient class to ease initialization by clients [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 [08:27:37] (03PS1) 10Giuseppe Lavagetto: Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599 [08:27:39] (03PS1) 10Giuseppe Lavagetto: Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600 [08:28:36] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [08:29:19] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40359/console" [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) (owner: 10Stevemunene) [08:29:25] !log ayounsi@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:30:27] (03CR) 10CI reject: [V: 04-1] Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599 (owner: 10Giuseppe Lavagetto) [08:30:35] (03CR) 10CI reject: [V: 04-1] Add a ConftoolClient class to ease initialization by clients [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 (owner: 10Giuseppe Lavagetto) [08:30:37] (03CR) 10CI reject: [V: 04-1] Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600 (owner: 10Giuseppe Lavagetto) [08:31:26] !log ayounsi@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:31:27] (03PS3) 10Abijeet Patro: Update SSH key for abi [puppet] - 10https://gerrit.wikimedia.org/r/903546 (https://phabricator.wikimedia.org/T333298) [08:32:09] !log phedenskog@deploy2002 Started deploy [performance/navtiming@e757bdf]: (no justification provided) [08:32:11] (03PS2) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 [08:32:15] !log phedenskog@deploy2002 Finished deploy [performance/navtiming@e757bdf]: (no justification provided) (duration: 00m 06s) [08:32:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: fix default path for the git repo [software/conftool] - 10https://gerrit.wikimedia.org/r/903597 (owner: 10Giuseppe Lavagetto) [08:32:25] !log ayounsi@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:32:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10Nikerabbit) I approve. Though, this should be just a key update. [08:32:49] (03CR) 10Hashar: releases-jenkins: replace Icinga with Prometheus monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [08:34:20] !log ayounsi@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:35:17] !log ayounsi@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:35:26] (03Merged) 10jenkins-bot: requestctl: fix default path for the git repo [software/conftool] - 10https://gerrit.wikimedia.org/r/903597 (owner: 10Giuseppe Lavagetto) [08:35:54] 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) p:05Medium→03Low Had a quick chat with @ayounsi on irc about this, seems it's related to some of the validation scripts, should be easy to fix. [08:36:09] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) The firmware update cookbook does offer a firmware update; I was going to apply it once the disks were swapped (as rebooting the system with drives in a funny state... [08:36:43] (03CR) 10Btullis: [C: 03+1] "Many thanks. Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:37:08] 10SRE, 10Data-Persistence, 10Traffic-Icebox, 10serviceops: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10Marostegui) [08:37:37] !log ayounsi@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:37:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8732605, @cmooney wrote: > > @aborrero are we ok to proceed with theis second... [08:38:18] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [08:39:28] (03CR) 10Btullis: [C: 03+1] "Also looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/902454 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [08:39:34] !log ayounsi@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:39:57] 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Jelto) [08:40:16] (03PS2) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) [08:40:42] (03CR) 10Btullis: [C: 03+1] "Great stuff, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [08:41:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [08:41:06] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:42:48] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.wikimedia.org [08:43:43] !log ayounsi@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:44:17] (03PS1) 10Marostegui: orchestrator.conf.json.erb: Replace sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/903602 (https://phabricator.wikimedia.org/T326596) [08:45:17] !log ayounsi@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:45:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] P:services_proxy::envoy: Add mw-api-int (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [08:46:30] (03CR) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [08:48:01] (03CR) 10MVernon: [C: 03+1] orchestrator.conf.json.erb: Replace sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/903602 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [08:48:12] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json.erb: Replace sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/903602 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [08:48:34] (03PS3) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) [08:49:12] !log ayounsi@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:50:56] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.wikimedia.org [08:52:09] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/903603 (owner: 10Clément Goubert) [08:55:33] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729881, @Volans wrote: > Looks ok to me too, I'm no sure about all the details involved if w... [08:57:10] (03CR) 10Clément Goubert: [C: 03+2] P:docker::prune_old_images: Fix type [puppet] - 10https://gerrit.wikimedia.org/r/903603 (owner: 10Clément Goubert) [08:58:29] !log restart ipmiseld on cp2035 [08:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:50] (03CR) 10Jelto: [C: 03+2] "new disks arrived, merging the new partman config" [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [09:01:16] (03CR) 10Btullis: "Looks good. I have one query about another couple of alerts that we might be able to remove, but I couldn't find them." [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [09:03:17] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729881, @Volans wrote: > Looks ok to me too, I'm no sure about all the details involved if w... [09:03:38] 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Vgutierrez) [09:03:49] (03PS4) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) [09:04:27] 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Vgutierrez) p:05Triage→03Medium [09:04:40] (03PS4) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) [09:05:11] (03CR) 10Elukey: [C: 03+1] "I like the idea thanks! Let's see what Janis thinks about it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris) [09:06:19] (03CR) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:06:37] (03CR) 10JMeybohm: [C: 03+1] "I think that's the right way of doing it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris) [09:09:44] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 17 NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40361/console" [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:10:59] !log jbond@cumin1001 START - Cookbook sre.idm.logout Logging Nicolas Fraison out of systemdlogoutd on: 2048 hosts [09:11:08] !log jbond@cumin1001 END (ERROR) - Cookbook sre.idm.logout (exit_code=97) Logging Nicolas Fraison out of systemdlogoutd on: 2048 hosts [09:11:44] !log jbond@cumin1001 START - Cookbook sre.idm.logout Logging Nicolas Fraison out of all services on: 2048 hosts [09:12:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nicolas Fraison out of all services on: 2048 hosts [09:13:42] (03CR) 10Jaime Nuche: Revert "deployment_server: ensure Docker is installed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn) [09:13:47] (03CR) 10Btullis: [C: 03+1] "Looks good. We know that the oozie profile is also added by the Hui UI role, but that is being deprecated with oozie anyway, so +1." [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) (owner: 10Stevemunene) [09:15:09] (03PS1) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) [09:20:16] (03CR) 10Volans: [C: 04-1] "Missing some bits" [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [09:20:39] (03PS1) 10Vgutierrez: admin: Remove shared SSH key with WMCS for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903606 [09:20:48] (03CR) 10David Caro: [C: 04-1] "Half-refactor and "having to get on a plane let's push to save progress" kinda patch" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:21:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903606 (owner: 10Vgutierrez) [09:21:41] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/903605/40362/" [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [09:21:57] (03CR) 10Vgutierrez: [C: 03+2] admin: Remove shared SSH key with WMCS for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903606 (owner: 10Vgutierrez) [09:22:43] (03CR) 10Hashar: "Marking my comment about using ECS as solved after https://gerrit.wikimedia.org/r/c/operations/puppet/+/903239" [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [09:26:01] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10jbond) 05Resolved→03Open @Trokhymovych We have noticed that you have stared to use your production key in WMCS. as a precaution [[ https://gerrit.wikimedia.org/r/c/operations... [09:26:37] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) p:05Triage→03Medium [09:26:51] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) [09:26:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [09:27:14] (03PS1) 10Filippo Giunchedi: alertmanager: group alerts by team too [puppet] - 10https://gerrit.wikimedia.org/r/903607 (https://phabricator.wikimedia.org/T332709) [09:28:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1001.eqiad.wmnet with reason: stop kafka and dist-upgrade [09:28:31] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10BTullis) Ah, thanks @Dzahn - I think the reason for these leftover processes is the kerberos automatic tickets renewal mechanism that I put in place in {T268985} It enables 'linger... [09:28:34] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1001.eqiad.wmnet with reason: stop kafka and dist-upgrade [09:28:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) (owner: 10Dzahn) [09:30:24] (03CR) 10Jaime Nuche: deployment_server: ensure Docker is installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [09:31:31] (03CR) 10Jbond: [C: 03+1] alertmanager: group alerts by team too [puppet] - 10https://gerrit.wikimedia.org/r/903607 (https://phabricator.wikimedia.org/T332709) (owner: 10Filippo Giunchedi) [09:33:23] (03CR) 10Jaime Nuche: "Thanks for bearing with me on this Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [09:34:11] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: group alerts by team too [puppet] - 10https://gerrit.wikimedia.org/r/903607 (https://phabricator.wikimedia.org/T332709) (owner: 10Filippo Giunchedi) [09:34:18] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) >>! In T333135#8732974, @BTullis wrote: > Ah, thanks @Dzahn - I think the reason for these leftover processes is the kerberos automatic tickets renewal mechanism that I put... [09:34:32] PROBLEM - Check systemd state on cp2035 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:02] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Logs from switch at during operation: `lines=20 Mar 28 09:28:50 cloudsw1-b1-codfw sshd[11342]: WARNING: could not open /etc/ssh/moduli... [09:35:10] (03PS1) 10Btullis: Disable the gobblin timers temporarily for switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) [09:35:32] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) 05Resolved→03In progress [09:35:40] !log depool cp2035 - T333312 [09:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] T333312: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 [09:36:05] !log silence systemdunitfailed alerts for team=wmcs - T333315 [09:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:10] T333315: WMCS: hundred of phabricator tickets were created for some alerts - https://phabricator.wikimedia.org/T333315 [09:36:16] 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) [09:36:57] (03CR) 10Btullis: "This is to be deployed at around 12:50 UTC, in order to pause ingestion to HDFS." [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [09:38:31] !log dist-upgrade kafka-main1001 to bullseye - T332013 [09:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:36] T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 [09:41:04] !log resetting cp2035 management card - T333312 [09:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:10] T333312: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 [09:42:11] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Trokhymovych) @jbond New Public SSH key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFCyl+eu4X9cI/XT6nCSvud+X6LJyVV7Rcr1g4MnP2xf trokhymovych.mykola@gmail.com [09:43:51] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Deprecate oozie services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) (owner: 10Stevemunene) [09:45:16] 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Vgutierrez) Unable to reset the management card: ` root@cp2035:~# bmc-device --cold-reset; echo $? ipmi_cmd_cold_reset: driver timeout 1 ` [09:45:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2035.codfw.wmnet with reason: HW issues [09:45:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2035.codfw.wmnet with reason: HW issues [09:45:48] RECOVERY - Check systemd state on cp2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:04] (03PS5) 10Jbond: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [09:46:12] 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=07b8190f-1479-43ea-ba98-63f852f30e9e) set by vgutierrez@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with r... [09:46:24] (03CR) 10Jbond: [C: 04-1] "lgtm just a minor change needed" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [09:46:36] (03PS6) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) [09:49:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:49:41] (03PS1) 10Jbond: admin: add ssh key for Trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903612 (https://phabricator.wikimedia.org/T315262) [09:49:48] the under replicated partitions is due to kafka-main1001 being upgraded [09:51:54] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [09:54:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:54:53] (03CR) 10Jbond: [C: 03+2] admin: add ssh key for Trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903612 (https://phabricator.wikimedia.org/T315262) (owner: 10Jbond) [09:55:09] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:18] (SystemdUnitFailed) firing: (3) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:41] (03CR) 10JMeybohm: [C: 03+2] k8s: Remove 1.16 related code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [09:56:15] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye [09:56:25] (SystemdUnitFailed) firing: (7) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:30] (SystemdUnitFailed) firing: (2) planet_sync_tile_generation-gis.service Failed on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:51] (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10jbond) 05Open→03Resolved >>! In T315262#8733314, @Trokhymovych wrote: > @jbond > New Public SSH key: > ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFCyl+eu4X9cI/... [09:57:11] (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:59:15] (03PS3) 10EoghanGaffney: Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) [09:59:31] (SystemdUnitFailed) firing: kubelet.service Failed on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1000) [10:01:09] (03PS1) 10Filippo Giunchedi: sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) [10:01:21] (03PS1) 10JMeybohm: Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 [10:02:58] (03CR) 10CI reject: [V: 04-1] sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:03:19] (03CR) 10CI reject: [V: 04-1] Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 (owner: 10JMeybohm) [10:04:36] (03PS2) 10JMeybohm: Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 [10:04:47] (03PS3) 10JMeybohm: Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 [10:07:01] (03PS1) 10Alexandros Kosiaris: thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 [10:07:33] (03CR) 10JMeybohm: [C: 03+2] Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 (owner: 10JMeybohm) [10:10:51] (03PS2) 10Filippo Giunchedi: sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) [10:11:19] (03CR) 10Hnowlan: [C: 03+1] thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 (owner: 10Alexandros Kosiaris) [10:11:45] (03PS1) 10Jbond: alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 [10:11:56] (03PS2) 10Volans: remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 [10:12:08] (03CR) 10Volans: [C: 03+2] remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 (owner: 10Volans) [10:12:48] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage [10:14:11] (03PS1) 10Elukey: admin_ng: lower the typha pods to 1 in ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/903616 (https://phabricator.wikimedia.org/T333302) [10:14:31] (SystemdUnitFailed) resolved: kubelet.service Failed on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:46] (03Merged) 10jenkins-bot: remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 (owner: 10Volans) [10:15:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] admin_ng: increase namespace cpu quota for thumbor, increase replicas (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [10:16:01] (03PS2) 10Giuseppe Lavagetto: Add a ConftoolClient class [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 [10:16:03] (03PS2) 10Giuseppe Lavagetto: Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599 [10:16:06] (03PS2) 10Giuseppe Lavagetto: Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600 [10:16:10] (03PS1) 10Giuseppe Lavagetto: Add black formatting and enforcement [software/conftool] - 10https://gerrit.wikimedia.org/r/903617 [10:16:21] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage [10:16:58] (03PS2) 10Ladsgroup: admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) (owner: 10Dzahn) [10:17:03] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) (owner: 10Dzahn) [10:19:48] (03CR) 10Elukey: [C: 03+2] admin_ng: lower the typha pods to 1 in ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/903616 (https://phabricator.wikimedia.org/T333302) (owner: 10Elukey) [10:21:49] (03PS1) 10Jbond: openldap: drop sre-admins from the list of ops members [puppet] - 10https://gerrit.wikimedia.org/r/903619 [10:22:31] (03CR) 10Jbond: [C: 03+2] openldap: drop sre-admins from the list of ops members [puppet] - 10https://gerrit.wikimedia.org/r/903619 (owner: 10Jbond) [10:23:15] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10phaultfinder) [10:23:29] (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:24] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:24:28] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:24:42] (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:08] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond) [10:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:28:40] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) 05In progress→03Resolved I added Barakat to wmf ldap group. They should be able to access grafana and such. [10:28:53] nel page? [10:29:02] !ack [10:29:02] no value provided for parameter incident and no default available [10:29:02] Incident id must be an integer [10:29:09] !incidents [10:29:09] 3512 (UNACKED) NELHigh sre (tcp.timed_out) [10:29:13] !ack 3512 [10:29:14] 3512 (ACKED) NELHigh sre (tcp.timed_out) [10:29:27] tcp timeouts [10:29:31] (03PS6) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) [10:30:19] (03CR) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [10:30:39] brief spike of timeout from india [10:31:45] but nothing persistent it seems [10:31:52] almost all frorm a single IP? weird [10:32:07] (03PS1) 10Btullis: Failover hive services to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/903621 (https://phabricator.wikimedia.org/T330165) [10:32:11] ah no sorry, misreading [10:32:14] target IP :D [10:32:16] makes sense [10:32:17] volans: could be GNAT restarting [10:32:26] upload-eqsin [10:32:27] CGNAT ? [10:32:27] oh ok yes that makes more senses :) [10:32:35] upload-eqsin, just 1 ISP [10:32:48] akosiaris: no that's Other's ISP [10:32:51] what jbond says is a pretty plausible explanation [10:32:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:02] volans: I see 1 specific one [10:33:06] not "Other" [10:33:17] ah, no scratch that [10:33:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [10:33:19] you are right [10:34:14] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [10:35:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 235.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [10:35:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10Ladsgroup) a:03Ladsgroup Clinic duty this week, it doesn't need the full process. Let me double check something and get back to you. [10:36:31] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:39:14] (03CR) 10EoghanGaffney: [C: 03+2] Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [10:41:31] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:10] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @ayounsi thanks for the response. Overall I've no objection so let's proceed. I agree in terms of addin... [10:43:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:43:30] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:45:40] (03CR) 10Btullis: [C: 03+2] Failover hive services to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/903621 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [10:46:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) Thanks for the task, does indeed look like a useful tool that could simplify adding additional monitoring without having to modify the LibreNMS codeb... [10:46:29] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) a:03cmooney [10:47:26] (WidespreadPuppetFailure) firing: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:48:29] (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:42] (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:21] (03PS1) 10Ladsgroup: api: Mark query as read-only to avoid regex on SQL [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903549 (https://phabricator.wikimedia.org/T332942) [10:52:10] jouncebot: nowandnext [10:52:10] For the next 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1000) [10:52:11] In 2 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300) [10:52:11] In 2 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300) [10:52:18] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [10:53:41] (03PS4) 10Ladsgroup: Update SSH key for abi [puppet] - 10https://gerrit.wikimedia.org/r/903546 (https://phabricator.wikimedia.org/T333298) (owner: 10Abijeet Patro) [10:53:46] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Update SSH key for abi [puppet] - 10https://gerrit.wikimedia.org/r/903546 (https://phabricator.wikimedia.org/T333298) (owner: 10Abijeet Patro) [10:53:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 (owner: 10Alexandros Kosiaris) [10:55:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10Ladsgroup) 05Open→03Resolved you'll have access with the new keys in thirty minutes [10:57:40] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) [10:57:42] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) [10:58:31] (03Merged) 10jenkins-bot: thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 (owner: 10Alexandros Kosiaris) [10:59:22] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) [10:59:24] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) [11:00:22] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:00:27] (03Abandoned) 10Clément Goubert: cpufrequtils: Force reload init script on change [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert) [11:01:39] (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992) [11:03:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [11:03:49] (03PS2) 10Ladsgroup: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry) [11:04:05] (03PS6) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 [11:04:20] (03CR) 10Slyngshede: P:url_downloader send Squid access logs to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [11:04:36] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry) [11:05:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Move mbsantos and jgiannelos from parsoid-test-admins to parsoid-test-roots - https://phabricator.wikimedia.org/T333206 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Merged and deployed the patch, it should be doable in half an hour. [11:08:16] jouncebot: nowandnext [11:08:16] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [11:08:16] In 1 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300) [11:08:17] In 1 hour(s) and 51 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300) [11:08:24] (03CR) 10Ladsgroup: [C: 03+2] api: Mark query as read-only to avoid regex on SQL [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903549 (https://phabricator.wikimedia.org/T332942) (owner: 10Ladsgroup) [11:08:45] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40363/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [11:13:16] 10SRE, 10SRE-Access-Requests: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10Ladsgroup) Does this need anything from SRE now? I assume Hugh already took care of the most. [11:14:32] 10SRE, 10SRE-Access-Requests: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10hnowlan) 05Open→03Resolved [11:14:50] (03CR) 10Hnowlan: [C: 03+2] admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [11:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:19:42] (SystemdUnitFailed) firing: (3) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:53] (03Merged) 10jenkins-bot: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan) [11:20:23] (03PS1) 10Btullis: Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) [11:21:44] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:22:00] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:22:06] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40364/console" [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [11:22:21] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:22:26] (WidespreadPuppetFailure) resolved: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:23:10] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:23:29] (SystemdUnitFailed) resolved: (2) planet_sync_tile_generation-gis.service Failed on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:29] (SystemdUnitFailed) resolved: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:29] (SystemdUnitFailed) resolved: (3) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:39] (SystemdUnitFailed) resolved: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:23:49] (SystemdUnitFailed) resolved: (7) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:07] (03PS1) 10Jbond: O:cluster/management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628 [11:24:42] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:24:58] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:02] (03Merged) 10jenkins-bot: api: Mark query as read-only to avoid regex on SQL [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903549 (https://phabricator.wikimedia.org/T332942) (owner: 10Ladsgroup) [11:27:13] (03PS2) 10Jbond: O:cluster::management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628 [11:28:10] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [11:28:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40365/console" [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond) [11:28:36] (03PS5) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) [11:28:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [11:29:21] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Peter) 05Resolved→03Open Hmm maybe something needs all to be done on the Grafana side? When @BAbiola-WMF tries to login to Grafana she gets //407:Proxy Authentication Required// or UNEXPECTED_PROXY... [11:30:32] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fnegri) I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Runbooks/Depool_... [11:32:57] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:903549|api: Mark query as read-only to avoid regex on SQL (T332942)]] [11:33:03] T332942: Warning: SQLPlatform::isWriteQuery fallback to regex (from ApiQueryRevisions) - https://phabricator.wikimedia.org/T332942 [11:34:02] (03PS1) 10Effie Mouzeli: cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 [11:34:24] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:903549|api: Mark query as read-only to avoid regex on SQL (T332942)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [11:34:31] (03PS3) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 [11:34:51] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:36:39] (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [11:37:17] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:38:44] 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) @ayounsi I think this error I'm hitting is possibly similar: ` pynetbox.core.query.RequestError: The request failed with code 500 Internal Server Error: {'error': 'Cable object... [11:39:43] (03CR) 10Hashar: [C: 03+1] zuul: fix up service enable and ensure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [11:40:44] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) They aren't showing up in https://ldap.toolforge.org/group/wmf maybe I messed up something in ldap change. Let me double check [11:40:51] (03CR) 10Hashar: "The parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/901576/ fixed up the Puppet manifests to ensure all three services " [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [11:44:13] (03CR) 10Elukey: [C: 03+1] Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [11:45:38] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) They show up in ldap search: ` ladsgroup@mwmaint2002:~$ ldapsearch -x cn=wmf ... member: uid=babiola,ou=people,dc=wikimedia,dc=org ` My guess is that it needs to propagate but let me check... [11:47:12] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:47:26] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:47:42] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:48:06] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:48:18] (03PS4) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 [11:49:49] (03PS2) 10EoghanGaffney: Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) [11:51:40] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:903549|api: Mark query as read-only to avoid regex on SQL (T332942)]] (duration: 18m 42s) [11:51:45] T332942: Warning: SQLPlatform::isWriteQuery fallback to regex (from ApiQueryRevisions) - https://phabricator.wikimedia.org/T332942 [11:52:19] (03CR) 10EoghanGaffney: [C: 03+2] Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [11:52:41] (03CR) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [11:56:22] !log dist-upgrade kafka-main1002 to debian bullseye - T332013 [11:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:27] T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 [11:57:01] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1002.eqiad.wmnet with reason: stop kafka and dist-upgrade [11:57:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1002.eqiad.wmnet with reason: stop kafka and dist-upgrade [11:58:43] (03CR) 10Stevemunene: [C: 03+1] Disable the gobblin timers temporarily for switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [11:59:09] (03CR) 10Stevemunene: [C: 03+1] Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [12:04:51] (03PS2) 10Effie Mouzeli: cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 [12:05:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.3k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:08:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:08:58] (03PS8) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [12:09:31] !log eoghan@cumin1001 START - Cookbook sre.ganeti.reimage for host aphlict1002.eqiad.wmnet with OS bullseye [12:10:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:10:34] this is kafka-main1002 being upgraded --^ [12:13:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:14:06] (03CR) 10Effie Mouzeli: "PCC has the expected changes" [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli) [12:14:25] 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for the dogfooding :) I removed the TAG check from the CR see diff: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extra... [12:14:43] (03PS3) 10Effie Mouzeli: cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 [12:15:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 45295 [12:16:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45295 [12:17:16] (03CR) 10Volans: [C: 03+1] "I've not tested it, if it's a noop in the generated results in the repo LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [12:17:48] (03CR) 10Elukey: [C: 03+2] admin: Grant kserve API group read access to deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris) [12:17:50] (03CR) 10Slyngshede: [C: 03+1] "Minor nit, see inline, looks good otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [12:20:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:20:44] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:20:46] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:21:31] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage [12:22:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:24:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 250.3k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:24:52] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage [12:26:26] (WidespreadPuppetFailure) firing: Puppet has failed on kafka_main cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:27:18] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:27:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:29:13] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [12:29:35] (03PS1) 10JMeybohm: Revert "Revert "k8s: Remove 1.16 related code"" [puppet] - 10https://gerrit.wikimedia.org/r/903560 [12:30:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [12:31:26] (WidespreadPuppetFailure) resolved: Puppet has failed on kafka_main cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:31:39] (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [12:31:57] XioNoX: o/ the Singtel transport link between uslfo and eqsin seems down (at least according to BFD), I don't find any scheduled maintenance though [12:32:28] (03CR) 10Hashar: doc: upgrade php from 7.3 to 7.4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [12:34:20] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112 [12:34:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112 [12:36:08] elukey: seems up right now but flapping regularly, looking [12:36:20] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:36:28] !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aphlict1002.eqiad.wmnet with OS bullseye [12:37:29] (03PS3) 10David Caro: maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) [12:37:35] (03PS5) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [12:37:39] (03PS5) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) [12:37:43] (03PS5) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [12:37:48] (03PS7) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [12:38:10] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108 [12:38:17] however telxius seems down [12:38:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [12:38:26] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:38:52] (03CR) 10David Caro: maintain-dbusers: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [12:39:50] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:40:28] (03Merged) 10jenkins-bot: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede) [12:41:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10fgiunchedi) I took a quick look at the exporter and looks good to me too! Also +1 on the general testing/deployment plan re: SSH from a quick read through th... [12:42:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/903623/40369/" [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:43:20] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [12:43:28] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108 [12:43:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [12:44:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 205.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:44:39] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40370/console" [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto) [12:44:43] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108 [12:44:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108 [12:46:06] opened https://phabricator.wikimedia.org/T333342 about telxius [12:47:01] (03CR) 10Volans: [C: 03+1] "LGTM if the output works fine for NetOps" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [12:48:15] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) They are now in the list in https://ldap.toolforge.org/group/wmf >User:Barakat Ajadi (more info) Is it fixed now? [12:50:43] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [12:50:55] (03CR) 10Btullis: [C: 03+2] Disable the gobblin timers temporarily for switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [12:52:52] PROBLEM - Bird Internet Routing Daemon on durum1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:52:52] (03PS1) 10EoghanGaffney: Add aphlict role to new vm host [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) [12:53:12] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:17] (03PS1) 10Ayounsi: Depool eqiad frontends for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/903642 (https://phabricator.wikimedia.org/T330165) [12:53:52] (03PS1) 10Hashar: wm-checks-api: parse PCC full message [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643 [12:54:33] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40371/console" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:56:27] (03CR) 10Hashar: "Unrelated to this change, Gerrit shows below the commit message "Error while fetching results for wm-checks-api: TypeError: compiled is nu" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [12:56:29] (03CR) 10Ssingh: [C: 03+1] Depool eqiad frontends for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/903642 (https://phabricator.wikimedia.org/T330165) (owner: 10Ayounsi) [12:56:39] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [12:56:41] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [12:56:54] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:56:58] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:57:49] 10SRE, 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team: Machine Learning team - k8s resources access - https://phabricator.wikimedia.org/T333174 (10elukey) 05Open→03Resolved a:03elukey Took the liberty to merge Alexandro's proposal, since the isvc resources don't really contain anything... [12:58:05] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165 [12:58:10] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [12:58:27] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgr... [12:58:39] (03PS8) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [12:58:45] (03CR) 10Ayounsi: [C: 03+2] Depool eqiad frontends for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/903642 (https://phabricator.wikimedia.org/T330165) (owner: 10Ayounsi) [12:59:00] (03CR) 10Volans: "Some minor nits inline, LGTM in general, but I didn't check at all the logic of the export that I leave to netops." [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond) [12:59:47] (03CR) 10Vgutierrez: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [12:59:49] !log depool eqiad for network maintenance - T330165 [12:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300) [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300) [13:00:23] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto) [13:00:50] looks like nothing to deploy indeed [13:02:36] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:07:14] (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) (owner: 10David Caro) [13:07:58] (03CR) 10Clément Goubert: [C: 03+1] cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli) [13:10:33] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40373/console" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [13:16:23] (03PS2) 10JMeybohm: k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) [13:16:51] Hi, anyone able to run a maint script for T332241? [13:16:52] T332241: fix Category namespace on gurwiki - https://phabricator.wikimedia.org/T332241 [13:16:57] (03PS3) 10Volans: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 [13:17:06] (03PS4) 10Volans: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 [13:17:37] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row B switches upgrade - T330165 [13:17:46] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [13:17:58] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgr... [13:18:44] (03PS29) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [13:20:01] (03CR) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:20:24] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily removed dns1003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/903246 (https://phabricator.wikimedia.org/T330165) (owner: 10Ssingh) [13:21:14] (03PS1) 10Phuedx: MetricsPlatform: Fix ContextAttributesFactoryTest failing on prod branch [extensions/EventLogging] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903562 (https://phabricator.wikimedia.org/T333291) [13:21:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:22] (03CR) 10Vgutierrez: "looking good, upload tests are happy," [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:21:42] koi: We're in the middle of a network maintenance in eqiad, can it wait until it's done? [13:22:02] definitely :) [13:22:13] !incidents [13:22:14] 3513 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:22:14] 3512 (RESOLVED) NELHigh sre (tcp.timed_out) [13:22:29] !ack 3513 [13:22:30] 3513 (ACKED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [13:22:38] hnowlan: is that you? [13:22:52] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh) [13:24:07] I acked it btw [13:24:22] I see [13:24:31] you know whats going on akosiaris? [13:24:57] hnowlan is trying to increase capacity [13:24:57] jayme: not me afaik, looking [13:25:18] oh, my bad assumption then [13:25:33] note this is codfw, so nothing with the eqiad row B upgrade (which hasn't even started yet) [13:25:36] I did do a push earlier but at like 11:47 and it was rolled back [13:25:55] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10larissagaulia) >>! In T332868#8734469, @Ladsgroup wrote: > They are now in the list in https://ldap.toolforge.org/group/wmf >>User:Barakat Ajadi (more info) > > Is it fixed now? No, not yet. Let me w... [13:26:16] big spike in loads in codfw [13:26:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:28] qps up 4x [13:26:41] some upload? [13:26:48] matches slow probes [13:27:32] (03CR) 10Vgutierrez: [C: 04-1] Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:28:15] eqiad reads down to 0 also? [13:28:19] hnowlan: could it be just because eqiad got depooled and all traffic hits codfw now? [13:28:41] ah [13:28:43] ah yes! [13:28:49] that'd do it, lmao [13:28:51] I depooled eqiad for the row B upgrade [13:28:58] although it should be able to handle the traffic [13:28:59] fair enough :) [13:29:08] PROBLEM - Bird Internet Routing Daemon on dns1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:29:15] ^ expected [13:29:20] I was about to ask ;-) [13:29:32] same for durum1002 [13:30:11] errors are tapering off a bit but it's still high [13:30:16] guess it couldn't handle the surge [13:30:21] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis) [13:30:30] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:30:39] hnowlan: ^ trying something [13:30:54] (03PS1) 10Slyngshede: C:idm::jobs absent permission sync. [puppet] - 10https://gerrit.wikimedia.org/r/903647 [13:31:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:31:35] We can maybe get away with depooling the two thumbor hosts that are in row B and repooling the service ? [13:31:48] experiment failed btw, reverting [13:32:51] claime: we could, but apparently it's fine now? [13:32:58] let's re-evaluate if it alerts again [13:33:02] yup [13:33:03] we're not really fine [13:33:12] partial repooling sgtm, will do [13:33:14] thisisfine.png [13:33:17] we're still high on 5xx [13:33:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=thumbor1001.eqiad.wmnet [13:33:58] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=thumbor1002.eqiad.wmnet [13:34:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:34] yeah, probes are still pretty slow (and flaky) [13:34:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=thumbor,name=eqiad [13:35:45] btw. I really like the toggle switch in https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=now-1h&to=now :D [13:35:53] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:36:02] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:36:03] jayme: thumbor? yes. [13:36:11] yes, that one :D [13:36:13] i wish it worked for turning it off [13:36:20] x) [13:36:29] I literally just found two more problems with that dashboard during this, such a mess [13:36:47] I was looking at the error graph going "oh that's not too bad" [13:36:50] Then I saw it was log10 [13:36:56] +1 [13:37:11] did I do something wrong with conftool to repool there? not seeing anything coming in yet [13:37:26] hnowlan: the DC is depooled in discovery [13:37:29] oh [13:37:33] that'd do it [13:37:34] let me fix that [13:38:36] hnowlan: it's pooled [13:39:21] thanks! [13:39:22] (03PS1) 10Jbond: idp: failover to codfw for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/903648 [13:40:02] ehm, I wasn't clear. It's pooled without me doing anything [13:40:07] I am not sure it was ever depooled [13:40:24] it's not part of the sre.discovery.datacenter cookbook [13:40:39] oh, I need to pool swift [13:41:07] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [13:41:16] ohh [13:41:58] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad [13:41:59] it's a shame the scale up on k8s didn't work, it'd actually help a lot with this workload heh [13:42:05] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [13:42:23] hnowlan: check again ;-) [13:42:26] at least on eqiad [13:42:31] trick worked after all [13:42:38] just needed a bit of tickling [13:42:42] let me upload the patch [13:42:42] do we need to depools some swift hosts now from row B? [13:42:47] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) It seems there was another regular job was needed to sync users in ldap with grafana. @fgiunchedi thankfully did a manual kick: ` Amir1: Mar 28 13:39:18 grafana1002 grafana-ldap-users-sync[2... [13:42:53] akosiaris: oh damn, nice! [13:43:18] hnowlan: I 'll deploy this real quick to codfw first [13:43:20] if the same will work in codfw we could take some of the load on k8s right now, given that we're already erroring [13:43:30] yeah, that was my thinking [13:44:20] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10herron) [13:44:23] jayme: there's one ms-fe host that's depooled, there's apparently nothing to do for ms-be [13:44:24] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [13:44:29] wait thumbor needs to be tickled? [13:44:38] Amir1: shush [13:44:40] So I think we're ok on the swift front [13:44:44] claime: thanks for checking/knowing [13:44:57] jayme: I just checked the rowB task [13:45:10] still, thanks :p [13:45:13] ;) [13:45:25] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [13:45:40] (NodeTextfileStale) firing: Stale textfile for doh5002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:40] (NodeTextfileStale) firing: Stale textfile for aqs2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:40] (NodeTextfileStale) firing: Stale textfile for kafka-main2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:42] my job here never gets boring :D [13:45:45] (NodeTextfileStale) firing: Stale textfile for kubernetes2018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:49] (NodeTextfileStale) firing: Stale textfile for doc2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:54] (NodeTextfileStale) firing: Stale textfile for schema2004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:45:55] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [13:45:58] (NodeTextfileStale) firing: (2) Stale textfile for elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:03] (NodeTextfileStale) firing: (5) Stale textfile for db1103:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:07] whnat are all the NodeTextfilestate ? [13:46:08] (NodeTextfileStale) firing: (2) Stale textfile for mw1352:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:08] some thumbor traffic coming back to eqiad [13:46:09] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [13:46:12] (NodeTextfileStale) firing: (6) Stale textfile for cloudcephmon1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:17] (NodeTextfileStale) firing: Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:21] (NodeTextfileStale) firing: Stale textfile for thumbor1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:26] (NodeTextfileStale) firing: Stale textfile for thanos-be1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:29] ah yeah I get it, I'll silence the alerts [13:46:31] (NodeTextfileStale) firing: Stale textfile for mc1047:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:31] jinxer 'bout to get floodkicked [13:46:35] (NodeTextfileStale) firing: Stale textfile for kafka-logging1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:38] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 249 hosts with reason: eqiad row B upgrade [13:46:40] (NodeTextfileStale) firing: Stale textfile for restbase1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:42] godog: I thought we fixed those stalefile alerts on cp nodes [13:46:49] (NodeTextfileStale) firing: Stale textfile for mw1396:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:54] (NodeTextfileStale) firing: Stale textfile for wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:46:56] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [13:46:58] (NodeTextfileStale) firing: Stale textfile for ganeti1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:03] (NodeTextfileStale) firing: (5) Stale textfile for an-presto1005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:05] vgutierrez: that's unrelated [13:47:07] ack [13:47:07] (NodeTextfileStale) firing: Stale textfile for ms-be1045:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:11] hnowlan: 32 pods in codfw too [13:47:12] (NodeTextfileStale) firing: (2) Stale textfile for arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:19] (NodeTextfileStale) firing: (3) Stale textfile for durum3001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:20] truly sorry for the spam folks [13:47:20] should I depool swift in eqiad once more ? [13:47:23] (NodeTextfileStale) firing: Stale textfile for dns4003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:28] (NodeTextfileStale) firing: Stale textfile for cp6013:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:30] akosiaris: if it's easier or safer then go for it [13:47:33] (NodeTextfileStale) firing: Stale textfile for pki2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:37] (NodeTextfileStale) firing: Stale textfile for lvs2007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:37] I'll try pooling thumbor with a lowish weight [13:47:40] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [13:47:41] thumbor-k8s that is [13:47:42] (NodeTextfileStale) firing: Stale textfile for kubestagemaster2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:43] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=eqiad [13:47:46] (NodeTextfileStale) firing: (3) Stale textfile for mw1445:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:51] (NodeTextfileStale) firing: Stale textfile for aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:56] (NodeTextfileStale) firing: (2) Stale textfile for backup1006:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:47:56] !log depool swift in eqiad for row B upgrade [13:48:00] (NodeTextfileStale) firing: Stale textfile for cloudelastic1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:03] going for it, let's see [13:48:03] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=4; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [13:48:05] (NodeTextfileStale) firing: Stale textfile for an-druid1005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:48:10] (NodeTextfileStale) firing: Stale textfile for dse-k8s-etcd1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:48:30] o/ [13:48:34] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:48:52] akosiaris: Is thumbor eqiad going to hit swift codfw ? [13:49:15] claime: thumbor eqiad shouldn't be receiving traffic any time soon [13:49:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 249 hosts with reason: eqiad row B upgrade [13:49:38] (03PS1) 10Filippo Giunchedi: sre: ignore role_owner for NodeTextfileStale [alerts] - 10https://gerrit.wikimedia.org/r/903650 [13:49:38] akosiaris: Did you deppol it again ? [13:49:40] 5xx way down for codfw thumbor, but that's probably eqiad [13:49:43] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for 2:00:00 on 249 host(s)... [13:49:46] claime: yes [13:49:48] ack [13:49:52] I didn't see in the flood [13:50:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli) [13:50:34] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:50:44] claime: AIUI swift is calling thumbor, so if swift is depooled in eqiad, thumbor won't get traffic [13:51:01] Ah yes it's that way around [13:51:03] gotcha [13:51:10] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: ignore role_owner for NodeTextfileStale [alerts] - 10https://gerrit.wikimedia.org/r/903650 (owner: 10Filippo Giunchedi) [13:51:12] that's why thumbor was never depooled (by the cookbook) but still did not get traffic [13:51:26] right yeah [13:51:33] (03PS30) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [13:51:38] o/ I could in theory run a maint script for koi now [13:51:43] but I assume I shouldn’t do that right now [13:51:47] thanks! [13:52:13] interestingly this graph was broken up until a few minutes ago and is the primary indicator for thumbor overload 🙈 https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-30m&orgId=1&refresh=30s&to=now&viewPanel=46 [13:52:15] claime: akosiari.s did just repool swift in eqiad again as I see it and not touched thumbor. So you did not miss anything in the flood [13:52:16] (03CR) 10Jbond: [C: 03+2] idp: failover to codfw for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/903648 (owner: 10Jbond) [13:52:28] (03CR) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:52:38] jayme: Yeah, what I missed was the actual request flow :') [13:52:41] jayme: and then I re-depooled it [13:52:46] it's depooled now btw [13:52:56] just to be clear [13:52:57] yes, yes [13:53:01] ok [13:53:27] I just understood that claime was asking you if you depooled thumbor in eqiad again and you answered "yes" [13:53:36] which is not straigt way correct :) [13:53:43] it's slow as hell as far as processing requests is concerned but thumbor-k8s is doing okay. Will tweak the weight a bit higher [13:53:48] hnowlan: does the tripling of pods in codfw help ? [13:53:53] !log hnowlan@puppetmaster1001 conftool action : set/weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [13:53:59] akosiaris: oh most definiteyl [13:54:03] Want me to push the governor change ? [13:54:04] ok, I must have some lag [13:54:04] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond) [13:54:12] It may help a tad [13:54:13] thanks for the update [13:54:20] thumbor in eqiad was serving 5xx errors at this weight on the previous setup [13:54:49] !log depool ms-fe1010 before switch work T330165 [13:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:56] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [13:55:45] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon) [13:55:57] (03CR) 10BBlack: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:56:16] (03PS1) 10Alexandros Kosiaris: thumbor: Set lower requests in pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/903651 [13:56:33] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis) [13:57:16] 10SRE, 10Traffic, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) 2.6.12 has been released https://www.mail-archive.com/haproxy@formilux.org/msg43371.html including the patch that we've been testing in text@ulsfo [13:58:09] !log depool thanos-fe1002 - T330165 [13:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] gonna try bumping weight up a bit again. This is less firefighting as much as experimentation now that we're safe so if there's any concerns I can hold off [13:59:11] (03CR) 10Hnowlan: [C: 03+1] thumbor: Set lower requests in pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/903651 (owner: 10Alexandros Kosiaris) [13:59:24] hnowlan: Reiterating the offer to push the performance governor change to k8s nodes if you think that can help [13:59:33] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [14:00:03] claime: oops, sorry, forgot to reply - that's safe enough for the k8s nodes in general right? Couldn't hurt [14:00:16] yep, I don't really see what it could break [14:00:20] * kamila_ has been wondering about that one... thank you claime [14:00:28] let's go then [14:00:58] (03CR) 10Clément Goubert: [C: 03+2] cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli) [14:01:12] (03CR) 10Clément Goubert: [C: 03+2] cpufrequtils: ensure that cpufrequtils is reloded on governor change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli) [14:01:18] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond) [14:01:36] (03CR) 10Jforrester: [C: 03+2] "We'll need to land this, then change the branch commit pointer to this new hash." [extensions/EventLogging] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903562 (https://phabricator.wikimedia.org/T333291) (owner: 10Phuedx) [14:02:00] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond) [14:02:16] (03CR) 10Jbond: [C: 03+2] O:cluster::management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond) [14:02:25] (03PS31) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [14:02:27] Running puppet on kubernetes physical workers [14:03:06] <3 [14:03:58] (03CR) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [14:04:15] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10larissagaulia) 05Open→03Resolved Thanks, everyone. Mission accomplished. [14:04:26] (WidespreadPuppetFailure) firing: Puppet has failed on dse_k8s cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dse_k8s - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:04:45] <_joe_> elukey, klausman ^^ [14:04:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40375/console" [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:05:02] Probably me [14:05:04] _joe_: [14:05:14] <_joe_> yeah without probably :) [14:05:16] I'll go check [14:05:26] (WidespreadPuppetFailure) firing: (2) Puppet has failed on ml_serve cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ml_serve - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:05:26] (WidespreadPuppetFailure) firing: Puppet has failed on kubernetes cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:05:58] !log reboot eqiad row B for upgrade - T330165 [14:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:06] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [14:06:19] I know what's going on, pushing fixc [14:06:26] (WidespreadPuppetFailure) firing: Puppet has failed on kubernetes-staging cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes-staging - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:06:40] claime: thx! [14:07:45] To be clear, it's just breaking puppet runs, nothing more [14:08:04] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.03224 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:08:12] yes, shush [14:08:26] (WidespreadPuppetFailure) firing: Puppet has failed on ml_staging cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ml_staging - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:09:02] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:09:22] phab being down is expected, yes? iirc there was a reboot or something today.. [14:09:32] Hmm, I can't push to gerrit [14:09:35] some switch is being rebooted [14:09:38] PROBLEM - Host gerrit.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:09:50] TheresNoTime: Yes, that's expected [14:09:56] T330165 [14:09:56] ack [14:09:59] I have lost contint1002 as well (but that is not the primary [14:10:10] PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:10] Also expected [14:10:12] ok there will be some puppet failures, we'll fix them as soon as gerrit is up [14:10:25] puppet is disabled fleet wide anyway [14:10:25] But not being able to push to gerrit, idk why [14:10:26] (WidespreadPuppetFailure) firing: (2) Puppet has failed on kubernetes cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:10:27] I took you all down with me with my systemd vs systemctl [14:10:27] isn't redundancy between switches? :] [14:10:37] effie: yes, that was a tricky one [14:10:48] I can't push to gerrit too [14:10:53] hashar: there is, but are the services redundant? [14:10:55] Or use the gerrit REST API [14:10:57] yes.. [14:11:24] PROBLEM - configured eth on lvs1020 is CRITICAL: ens1f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:11:26] (WidespreadPuppetFailure) firing: (2) Puppet has failed on kubernetes-staging cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes-staging - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:11:47] XioNoX: don't worry :-] [14:11:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 212, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:11:50] I'll silence the puppet alerts [14:12:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 197, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:12:14] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, [14:12:14] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:12:17] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:12:17] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:12:23] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:12:26] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:12:32] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [14:12:45] (JobUnavailable) firing: (5) Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:50] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 170 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 176, active_shards: 176, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 170, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num [14:12:50] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.86705202312138 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:12:56] ACKNOWLEDGEMENT - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) Btullis T330165 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_ [14:12:56] a [14:12:58] (KubernetesCalicoDown) firing: (2) kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:13:00] PROBLEM - MariaDB Replica IO: analytics_meta on db1108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:13:00] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 4 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:13:03] (KubernetesCalicoDown) firing: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:13:10] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy [14:13:52] ^ Amir1 not sure if ours, but to review later [14:13:56] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service,netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service,netbox_ganeti_eqsin_sync.service,netbox_ganeti_esams_sync.service,netbox_ganeti_ulsfo_sync.service,netbox_report_coherence_rack_run.service,netbox_report_coherence_run.service,netbox_report_puppetdb_virtual_run.service https://w [14:13:56] wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:11] okay thanks [14:14:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s8_3318: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s [14:14:34] Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s1_3311: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s7_3317: Servers dbproxy1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:14:37] I think Manuel told me we need to reload haproxy, I'll do it once the maint is over [14:14:42] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [14:14:46] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:50] (KubernetesCalicoDown) firing: (2) dse-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:57] (WidespreadPuppetFailure) firing: Puppet has failed on sessionstore cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:15:02] (WidespreadPuppetFailure) firing: Puppet has failed on puppet cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=puppet - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:15:07] (WidespreadPuppetFailure) firing: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:15:44] Amir1: correct [14:15:52] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:16:03] (WidespreadPuppetFailure) firing: Puppet has failed on wdqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wdqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:16:08] (WidespreadPuppetFailure) firing: Puppet has failed on prometheus cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:16:44] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [14:16:56] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:17:12] PROBLEM - Host analytics1069 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:21] taavi: as long as you oped, can you reop sirenbot ? [14:17:26] (WidespreadPuppetFailure) firing: Puppet has failed on thanos cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thanos - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:17:31] (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:17:36] (WidespreadPuppetFailure) firing: (2) Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:17:38] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:17:40] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:17:41] (WidespreadPuppetFailure) firing: Puppet has failed on cache_text cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:17:45] (JobUnavailable) firing: (34) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:46] PROBLEM - Host 2620:0:861:2:208:80:154:134 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:58] (KubernetesCalicoDown) firing: (5) kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:17:58] (KubernetesCalicoDown) firing: (2) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:18:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:18:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:18:20] RECOVERY - Host 2620:0:861:2:208:80:154:134 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:18:24] (KafkaUnderReplicatedPartitions) firing: (3) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:18:26] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [14:18:26] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:18:36] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1094 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [14:18:38] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:18:40] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:19:00] RECOVERY - Host gerrit.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:19:00] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:19:05] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [14:19:06] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 176, active_shards: 352, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max [14:19:06] _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:19:08] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:19:09] (KubernetesCalicoDown) firing: (2) dse-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:19:16] RECOVERY - MariaDB Replica IO: analytics_meta on db1108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:19:16] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:19:26] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:19:26] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:19:29] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903652 (owner: 10Clément Goubert) [14:19:42] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01465 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:19:43] claime: gerrit is back [14:19:52] (in case the flood is floody) [14:20:26] (WidespreadPuppetFailure) firing: Puppet has failed on swift cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:20:28] (03Merged) 10jenkins-bot: MetricsPlatform: Fix ContextAttributesFactoryTest failing on prod branch [extensions/EventLogging] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903562 (https://phabricator.wikimedia.org/T333291) (owner: 10Phuedx) [14:20:31] (WidespreadPuppetFailure) firing: Puppet has failed on memcached cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=memcached - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:20:36] (WidespreadPuppetFailure) firing: Puppet has failed on api_appserver cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:20:41] (WidespreadPuppetFailure) firing: Puppet has failed on ml_cache cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ml_cache - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:20:46] (WidespreadPuppetFailure) firing: Puppet has failed on kafka_test cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_test - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:20:50] (WidespreadPuppetFailure) firing: Puppet has failed on restbase cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=restbase - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:20:55] (WidespreadPuppetFailure) firing: Puppet has failed on kafka_jumbo cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_jumbo - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:00] my apologies for the spam [14:21:00] (WidespreadPuppetFailure) firing: Puppet has failed on relforge cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=relforge - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:05] (WidespreadPuppetFailure) firing: Puppet has failed on misc cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=misc - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:09] kamila_: I know, thanks, I pushed my fix. [14:21:13] silencing [14:21:21] (03CR) 10Hnowlan: [C: 03+1] cpufrequtils: fix systemctl call [puppet] - 10https://gerrit.wikimedia.org/r/903652 (owner: 10Clément Goubert) [14:21:26] (WidespreadPuppetFailure) firing: Puppet has failed on ci cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ci - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:26] (WidespreadPuppetFailure) firing: Puppet has failed on webperf cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=webperf - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:31] (WidespreadPuppetFailure) firing: Puppet has failed on redis cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=redis - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:36] (WidespreadPuppetFailure) firing: Puppet has failed on etcd cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=etcd - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:36] (03CR) 10Clément Goubert: [C: 03+2] cpufrequtils: fix systemctl call [puppet] - 10https://gerrit.wikimedia.org/r/903652 (owner: 10Clément Goubert) [14:21:41] (WidespreadPuppetFailure) firing: Puppet has failed on eventschemas cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=eventschemas - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:45] (WidespreadPuppetFailure) firing: Puppet has failed on appserver cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:22:08] RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [14:22:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:22:18] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:22:23] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:22:34] RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [14:22:35] (03PS1) 10Herron: icinga: remove widespread puppet agent alerts [puppet] - 10https://gerrit.wikimedia.org/r/903654 (https://phabricator.wikimedia.org/T288622) [14:22:45] (JobUnavailable) resolved: (34) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:23:03] (KubernetesCalicoDown) resolved: (2) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:23:07] (KubernetesCalicoDown) resolved: (5) kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:23:22] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:28] (KafkaUnderReplicatedPartitions) resolved: (3) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:24:41] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:24:43] (03PS3) 10JMeybohm: k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) [14:24:45] (03PS1) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [14:24:45] (KubernetesCalicoDown) resolved: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:49] (KubernetesCalicoDown) resolved: (2) dse-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:52] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:25:13] (03CR) 10CI reject: [V: 04-1] k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:25:41] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=THANOS-FE-OLD-FQDN,service=thanos-web [14:25:53] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web [14:26:06] (03PS1) 10Jbond: Revert "idp: failover to codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/903564 [14:26:26] (03CR) 10Jbond: [C: 03+2] Revert "idp: failover to codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/903564 (owner: 10Jbond) [14:26:55] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [14:27:07] (03PS2) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [14:27:15] (03PS1) 10Btullis: Revert "Disable job submission to YARN queues to faciliatate maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903565 [14:27:23] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40376/console" [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:28:40] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:28:42] (03PS1) 10Ayounsi: Revert "Depool eqiad frontends for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/903666 (https://phabricator.wikimedia.org/T330165) [14:28:42] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:28:44] (03CR) 10Jbond: [C: 03+1] "LGTM, i didn't check the script as i assume that has already gone through review but say if not" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [14:28:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:28:55] (03PS2) 10Ayounsi: Revert "Depool eqiad frontends for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/903666 (https://phabricator.wikimedia.org/T330165) [14:29:36] (03PS1) 10Filippo Giunchedi: Revert "prometheus1006: depool from alertmanager" [puppet] - 10https://gerrit.wikimedia.org/r/903667 [14:29:36] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:29:52] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus1006: depool from alertmanager" [puppet] - 10https://gerrit.wikimedia.org/r/903667 (owner: 10Filippo Giunchedi) [14:30:06] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:30:24] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:30:37] (03CR) 10Btullis: [C: 03+2] Revert "Disable job submission to YARN queues to faciliatate maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903565 (owner: 10Btullis) [14:30:46] I did a reload of haproxy on dbproxy 10 18 and 1019 [14:30:53] let's see [14:31:03] (03CR) 10Ssingh: [C: 03+2] Revert "Depool eqiad frontends for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/903666 (https://phabricator.wikimedia.org/T330165) (owner: 10Ayounsi) [14:31:52] !log run authdns-update to revert eqiad depool [14:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:21] (03PS1) 10Btullis: Revert "Disable the gobblin timers temporarily for switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903668 [14:32:27] !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165 [14:32:34] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [14:32:50] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrad... [14:33:05] hm, 1014 and 1015 needs reload too [14:33:22] (03PS1) 10Ssingh: Revert "hiera: temporarily removed dns1003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/903669 [14:34:09] koi: what’s the maintenance script you needed to run for T332241 anyways? it’s not clear to me from the task [14:34:10] T332241: fix Category namespace on gurwiki - https://phabricator.wikimedia.org/T332241 [14:34:22] (03CR) 10Btullis: [C: 03+2] Revert "Disable the gobblin timers temporarily for switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903668 (owner: 10Btullis) [14:34:24] (03PS1) 10Andrew Bogott: Revert "clouddumps: make clouddumps1002 the primary during switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903670 [14:34:26] (03PS1) 10Jforrester: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) [14:34:28] Lucas_WMDE, it's "mwscript maintenance/namespaceDupes.php --wiki gurwiki" [14:34:34] (it looks like the eqiad row B maintenance is still ongoing, but in principle I could run a maint script after that – I also have a change I’d be interesrted in backporting) [14:34:47] (03CR) 10Jforrester: [C: 03+2] Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) (owner: 10Jforrester) [14:35:05] (03CR) 10Andrew Bogott: [C: 03+2] Revert "clouddumps: make clouddumps1002 the primary during switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903670 (owner: 10Andrew Bogott) [14:35:12] (03Abandoned) 10Jforrester: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [14:35:16] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:20] did we backport the fix Taavi did yesterday? I didn't see the backport [14:35:24] RECOVERY - Bird Internet Routing Daemon on dns1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:35:34] on namespaceDupes [14:35:49] * Lucas_WMDE doesn’t know anything about that [14:36:11] (what I wanted to backport was the SpecialRecentChangesLinked query() fix) [14:36:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney can we do this on Thursday ? Can we also do the other batches(3-4) on the same day? [14:36:53] yes I backported it [14:37:01] yeah I can see it on wmf.1 [14:37:06] (and REL1_40 too) [14:37:08] oh thanks [14:37:34] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily removed dns1003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/903669 (owner: 10Ssingh) [14:37:48] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:53] (03PS2) 10Jbond: Revert "idp: failover to codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/903564 [14:38:07] (03PS4) 10JMeybohm: k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) [14:38:08] (03PS3) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) [14:38:29] !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [14:38:34] RECOVERY - Bird Internet Routing Daemon on durum1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:39:08] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:40:25] (03CR) 10Jbond: [C: 03+2] O:cluster::management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond) [14:40:32] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40377/console" [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:40:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thumbor100[12].eqiad.wmnet [14:41:14] (^ restoring ineffective change from during the depool) [14:41:15] (03PS1) 10Herron: alertmanager: manage data.retention option [puppet] - 10https://gerrit.wikimedia.org/r/903658 [14:41:39] (03CR) 10CI reject: [V: 04-1] alertmanager: manage data.retention option [puppet] - 10https://gerrit.wikimedia.org/r/903658 (owner: 10Herron) [14:42:18] RECOVERY - configured eth on lvs1020 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [14:42:56] (03PS2) 10Herron: alertmanager: manage data.retention option [puppet] - 10https://gerrit.wikimedia.org/r/903658 [14:43:12] jbond: hi! possible puppet failure: https://puppetboard.wikimedia.org/report/dns1001.wikimedia.org/810719d816acdcfa7d86149dfa2c240d195ab40a ? [14:46:10] (03PS1) 10Bking: rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) [14:46:37] !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [14:47:36] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:01] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrad... [14:48:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add a ConftoolClient class [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 (owner: 10Giuseppe Lavagetto) [14:48:57] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [14:49:47] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:49:57] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:50:14] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [14:50:28] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) [14:50:38] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:50:48] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:51:11] (03Merged) 10jenkins-bot: Add a ConftoolClient class [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 (owner: 10Giuseppe Lavagetto) [14:51:20] (03CR) 10Herron: "Not a ton of documentation I could find about extending silence history, but this looked promising" [puppet] - 10https://gerrit.wikimedia.org/r/903658 (owner: 10Herron) [14:51:55] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165 [14:52:02] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [14:52:53] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) (owner: 10Jforrester) [14:52:57] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=device-analytics [14:53:26] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=device-analytics,name=eqiad [14:53:37] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=device-analytics,name=pki [14:53:53] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=pki,name=eqiad [14:54:21] interesting that SAL log show up despite the action being wrong [14:54:23] anyway [14:54:52] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=codfw [14:54:54] (03Merged) 10jenkins-bot: rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:55:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:55:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:57:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [14:58:55] Kudos to everyone involved in the switches upgrade for minimal downtime of Phab, Gerrit, Hadoop cluster, etc. 👏 [14:59:29] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [14:59:57] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) The switch upgrade itself went smoothly as well, like the other rows. One issue was that gerrit1001 was missing from the list. This is because th... [15:01:21] (03CR) 10Jaime Nuche: "Thanks for creating the branch, I'll rerun the train presync." [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) (owner: 10Jforrester) [15:03:43] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903660 (https://phabricator.wikimedia.org/T330208) [15:03:45] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903660 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [15:05:11] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903660 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot) [15:05:33] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.2 refs T330208 [15:05:37] !log jnuche@deploy2002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.41.0-wmf.2" --no-progress --store-class=LCStoreCDB --threads=30 --lang en --quiet ' returned non-zero exit status 1. (duration: 00m 03s) [15:05:39] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [15:07:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host an-test-client1002.eqiad.wmnet with OS bullseye [15:08:38] !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [15:13:47] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) a:03Ladsgroup Thanks. I'm clinic duty this week. [15:14:23] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [15:14:36] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005868 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:15:16] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:15:19] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [15:17:33] (03PS1) 10Jbond: P:ldap::bitu: make the group configurable [puppet] - 10https://gerrit.wikimedia.org/r/903665 [15:18:19] (03PS1) 10Cwhite: logstash: remove envoy deprecated options spamfilter [puppet] - 10https://gerrit.wikimedia.org/r/902625 (https://phabricator.wikimedia.org/T320468) [15:18:41] (03PS1) 10Ayounsi: Add role_contacts to buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/903686 [15:19:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi) [15:19:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40378/console" [puppet] - 10https://gerrit.wikimedia.org/r/903665 (owner: 10Jbond) [15:19:42] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004401 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:19:58] (03CR) 10Jbond: [C: 03+2] P:ldap::bitu: make the group configurable [puppet] - 10https://gerrit.wikimedia.org/r/903665 (owner: 10Jbond) [15:20:08] jouncebot: nowandnext [15:20:08] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [15:20:09] In 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1600) [15:20:11] (03PS1) 10Filippo Giunchedi: sre: introduce cluster vs site wide puppet failures [alerts] - 10https://gerrit.wikimedia.org/r/903687 (https://phabricator.wikimedia.org/T294564) [15:20:20] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.2 refs T330208 [15:20:25] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [15:20:29] any objections to me running an mw backport and a maintenance script? [15:20:38] hm, well, maybe not while jnuche is scapping wmf.2 w^ [15:20:40] *^^ [15:21:32] Lucas_WMDE: yeah, the presync failed last night so I'm rerunning manually [15:21:53] it can take a bit, sorry for the inconvenience [15:21:53] ok [15:22:07] no big deal, don’t think either of the things I wanted to do is urgent [15:22:50] (03CR) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [15:22:57] but it sounds like there might be no time before the puppet window, and I won’t be around after that – koi, if you’re still online for the late backport window, perhaps add your maintenance script run there [15:23:19] (I’ll only be around again for tomorrow’s UTC afternoon window, I think) [15:23:35] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/output/903686/40379/" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi) [15:24:08] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) [15:24:18] (03CR) 10Jelto: [C: 03+1] "lgtm, however I'm not sure what happens if we run multiple aphlict instances in eqiad at once. Do you have a plan for that? Will aphlict10" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [15:25:11] (03CR) 10Dzahn: "I really don't think after Andrea did all the work to create doc machines that we should introduce further complication to _avoid_ switchi" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [15:25:18] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [alerts] - 10https://gerrit.wikimedia.org/r/903687 (https://phabricator.wikimedia.org/T294564) (owner: 10Filippo Giunchedi) [15:27:29] (03CR) 10Dzahn: [C: 03+2] zuul: fix up service enable and ensure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [15:27:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi) [15:29:02] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:22] (03PS5) 10Volans: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 [15:35:58] (03CR) 10Dzahn: "I can't ssh to deploy-1002.devtools right now for some reason, will try again later." [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [15:36:26] (03CR) 10Raymond Ndibe: maintain-dbusers: only-users match tool users with or without prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro) [15:37:38] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor200*.codfw.wmnet [15:38:03] (03CR) 10Herron: "adding volans for awareness and in case there are references to alert1001 outside puppet to account for" [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [15:38:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor200[3456].codfw.wmnet [15:40:42] (03PS1) 10DCausse: rdf-streaming-updater: still use PLAINTEXT for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/903689 (https://phabricator.wikimedia.org/T328675) [15:43:32] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: still use PLAINTEXT for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/903689 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:46:46] PROBLEM - Host cp1082 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:22] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:10] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [15:48:16] PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:19] !log btullis@deploy2002 Started deploy [analytics/refinery@6554ec0]: Regular analytics weekly train [analytics/refinery@6554ec0] [15:48:28] (03Merged) 10jenkins-bot: rdf-streaming-updater: still use PLAINTEXT for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/903689 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:49:42] (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:50:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor200[3456].codfw.wmnet [15:50:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:50:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:51:08] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [15:53:44] !log btullis@deploy2002 Finished deploy [analytics/refinery@6554ec0]: Regular analytics weekly train [analytics/refinery@6554ec0] (duration: 05m 24s) [15:54:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [15:54:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [15:55:22] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:55:38] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:55:42] !log btullis@deploy2002 Started deploy [analytics/refinery@6554ec0] (thin): Regular analytics weekly train THIN [analytics/refinery@6554ec0] [15:55:48] (03CR) 10Volans: "Thanks for the heads up." [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [15:55:51] !log btullis@deploy2002 Finished deploy [analytics/refinery@6554ec0] (thin): Regular analytics weekly train THIN [analytics/refinery@6554ec0] (duration: 00m 08s) [15:55:59] !log btullis@deploy2002 Started deploy [analytics/refinery@6554ec0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6554ec0] [15:56:45] (03PS1) 10Ladsgroup: admin: Add Oleksandr Tsyba to ldap [puppet] - 10https://gerrit.wikimedia.org/r/903691 (https://phabricator.wikimedia.org/T333157) [15:57:31] !log btullis@deploy2002 Finished deploy [analytics/refinery@6554ec0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6554ec0] (duration: 01m 32s) [15:57:50] (03PS1) 10Jbond: sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 [15:58:34] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) Hi, I guessed your email address via the WMDE's email pattern, can you please confirm this? https://gerrit.wikimedia.org/r/c/operations... [15:59:57] (03CR) 10CI reject: [V: 04-1] sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 (owner: 10Jbond) [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:32] !log bking@cumin1001 unban elastic and cloudelastic nodes post maintenance T330165 [16:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:42] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [16:02:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:24] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10ayounsi) See guidelines on https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Servers_uplinks but it's usually not worth it. We only... [16:03:30] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) After significantly increasing capacity in thumbor-k8s, we serv... [16:03:37] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) 05Open→03Resolved [16:03:43] (03PS1) 10Jelto: aphlict: pass ensure flags to logrotate timer [puppet] - 10https://gerrit.wikimedia.org/r/903693 (https://phabricator.wikimedia.org/T332869) [16:03:55] (03PS2) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) [16:03:57] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1082.eqiad.wmnet,service=cdn [16:03:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1082.eqiad.wmnet,service=ats-be [16:04:13] (03CR) 10EoghanGaffney: [V: 03+1] Add aphlict role to new vm host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [16:04:46] (03PS9) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [16:04:56] (03CR) 10Herron: alerting_host: failover icinga and alertmanger from eqiad to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron) [16:05:19] (03PS3) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) [16:05:23] (03CR) 10Volans: "addressed comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [16:07:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:07:19] (03PS9) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [16:09:15] !log reboot cp1082 (NIC issues) [16:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:12] !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.2 refs T330208 (duration: 49m 52s) [16:10:18] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [16:14:46] RECOVERY - Host cp1082 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:18:11] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/903697 [16:19:41] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [16:19:42] (CirrusSearchNodeIndexingNotIncreasing) resolved: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:20:14] (03PS10) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [16:22:08] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/903697 (owner: 10Volans) [16:22:26] (WidespreadPuppetFailure) resolved: Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:27:41] (WidespreadPuppetFailure) firing: (2) Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:31:54] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Krd) https://commons.wikimedia.org/wiki/... [16:32:59] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Krd) https://commons.wikimedia.org/wiki/... [16:34:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:36:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:36:37] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [16:36:59] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [16:43:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:47:26] (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:52:16] !log uploaded spicerack_6.4.0 to apt.wikimedia.org bullseye-wikimedia (but I'll deploy it to the cumin hosts tomorrow) [16:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=cdn [16:55:17] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=ats-be [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1700) [17:02:21] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:02:41] (WidespreadPuppetFailure) resolved: Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:02:42] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:05:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:07:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:10:37] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) 05Open→03Declined Thanks, fwiw I added a talk topic on wiki in hopes that link redundancy can be explored the next time switch upgrades/... [17:12:26] (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:16:44] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:16:54] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:17:09] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) We will have to first upgrade the firmware on this server . Most of the time the firmware upgrade might help on 1 - resolving this issue 2- providing also in the idrac l... [17:17:26] (WidespreadPuppetFailure) firing: Puppet has failed on prometheus cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:19:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Papaul) a:05Cmjohnson→03Papaul [17:19:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) a:05Cmjohnson→03Papaul [17:20:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:26] (WidespreadPuppetFailure) resolved: Puppet has failed on prometheus cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:57:48] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [17:57:55] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:00:05] dduvall and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1800). [18:00:24] o/ [18:06:52] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 186846552 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:08:36] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) hi @Ottomata - yes they are two supersets i need to get into [[ https://superset.wikimedia.org/superset/dashboard/riskobservatory |1 ]] & [[ https://... [18:08:48] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 797408 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:21:59] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:23:54] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new db nodes - pt1979@cumin2002" [18:25:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new db nodes - pt1979@cumin2002" [18:25:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:28:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:28:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:32:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:32:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED [18:33:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:36:47] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@0f1c9e8]: Deploy latest image_suggestions on platform_eng Airflow instance [18:37:07] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@0f1c9e8]: Deploy latest image_suggestions on platform_eng Airflow instance (duration: 00m 20s) [18:37:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:38:54] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10cmooney) Yeah I tend to agree, with one top-of-rack switch two connections only protects against link failure (as they both land on the same switch)... [18:39:17] (03CR) 10Raymond Ndibe: maintain-dbusers: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [18:40:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:41:32] (03PS5) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [18:42:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:43:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [18:45:34] 10SRE, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) [18:46:21] 10SRE, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) 05Open→03Stalled p:05Medium→03Low [18:57:51] (03PS3) 10Jbond: sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 [18:59:57] (03CR) 10CI reject: [V: 04-1] sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 (owner: 10Jbond) [19:13:49] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) >>! In T333371#8736041, @cmooney wrote: > In the case of a server failure do the alert hosts fail over? Not automatically at the present... [19:15:26] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! All the ops I was trying on netbox-next are working with the latest patchset." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [19:16:36] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) It is correct email address [19:19:27] !log dduvall@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.2 refs T330208 [19:19:34] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [19:23:16] (03CR) 10Cathal Mooney: "In genrnal the approach here looks ok to me. I'm not overly familiar with the existing puppet profile for the Bird config, but as it's ba" [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [19:24:40] (03CR) 10Cathal Mooney: "I'll leave it to Arzhel to +1 as he's the most knowledgeable on the Bird Anycast vars. But for my part happy for this to be merged and pr" [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [19:26:51] !log dduvall@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.2 refs T330208 (duration: 07m 24s) [19:26:57] T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208 [19:29:16] !log dduvall@deploy2002 Pruned MediaWiki: 1.40.0-wmf.27 (duration: 02m 11s) [19:29:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8735024, @Papaul wrote: > @cmooney can we do this on Thursday ? Can we also do... [19:39:06] (03PS4) 10Jbond: sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 [19:41:08] (03CR) 10CI reject: [V: 04-1] sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 (owner: 10Jbond) [19:44:18] jouncebot: now [19:44:19] For the next 0 hour(s) and 15 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1800) [19:49:49] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Eevans) [19:50:49] (03PS1) 10Bartosz Dziewoński: Only run edit check on main namespace [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903684 [19:51:39] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Eevans) [19:52:12] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Language-setup, 10Patch-For-Review: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915 (10BCornwall) [19:52:48] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Language-setup, 10Patch-For-Review: Chinese subdomain redirect improvements - https://phabricator.wikimedia.org/T86915 (10BCornwall) [19:54:21] 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: Chinese subdomain redirect improvements - https://phabricator.wikimedia.org/T86915 (10BCornwall) [19:54:42] 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: Chinese subdomain redirect improvements - https://phabricator.wikimedia.org/T86915 (10BCornwall) I've updated the description to accurately reflect the current issues. Note that per T230382 there are no longer minnan/zh-cfr aliases. [19:56:50] (03PS5) 10BCornwall: Add redirects for https://nan.wik{tionary,iquote,ibooks,isource}.org [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T86915) (owner: 10Fomafix) [19:58:42] (03CR) 10Volans: setup.py: update dnspython requierments to match spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 (owner: 10Jbond) [19:59:14] (03PS1) 10Bartosz Dziewoński: Enable hidden tag for "Edit Check" project on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903759 (https://phabricator.wikimedia.org/T324733) [19:59:26] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40383/console" [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T86915) (owner: 10Fomafix) [19:59:32] jouncebot: next [19:59:32] In 0 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T2000) [19:59:39] i have some patches, one sec :) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:01:08] o/ [20:01:08] updated: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T2000 [20:01:12] MatmaRex: if you've patches, i can deploy for you tonight :) [20:01:44] (go ahead) [20:01:49] thanks [20:01:53] (03CR) 10Urbanecm: [C: 03+2] Only run edit check on main namespace [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903684 (owner: 10Bartosz Dziewoński) [20:01:59] (03CR) 10Urbanecm: [C: 03+2] Change name of the editcheck-needreference tag to editcheck-references [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903685 (owner: 10Bartosz Dziewoński) [20:02:22] MatmaRex: the config patch seems to depend on the backport(s). is that right? [20:02:38] yes. i can't really test, since the feature isn't deployed anywhere yet [20:02:46] okay [20:02:47] but we wanted it to roll out with the train this week [20:03:04] so then it'd be at testwiki by now (that has wmf.2 now)? [20:03:15] yes [20:03:21] ok [20:03:29] i guess if both the backport and the config are deployed, i could test it there [20:03:33] (03CR) 10Cathal Mooney: "Ought to work well. In terms of naming I think we should make it clear that 185.15.57.24/29 is for public vips. We can assign private VI" [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [20:04:29] okay, i can do both at once, no problem [20:07:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:08:42] 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10BCornwall) [20:09:03] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:08] 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10BCornwall) Further trimmed some stuff as T173966 is tracking the redirects. [20:16:22] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:17:32] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:36] (03Merged) 10jenkins-bot: Only run edit check on main namespace [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903684 (owner: 10Bartosz Dziewoński) [20:18:42] (03Merged) 10jenkins-bot: Change name of the editcheck-needreference tag to editcheck-references [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903685 (owner: 10Bartosz Dziewoński) [20:18:46] (03PS6) 10BCornwall: Add nan to zh-min-nan redirects [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T173966) (owner: 10Fomafix) [20:19:52] (03PS7) 10BCornwall: Add nan to zh-min-nan redirects [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T173966) (owner: 10Fomafix) [20:21:58] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:22:50] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:23:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:23:28] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10lmata) @cmooney thank you! [20:24:59] (03CR) 10BCornwall: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T173966) (owner: 10Fomafix) [20:27:08] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e6febfd]: increase dynamic partition limit when importing cirrus indexes [20:27:22] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e6febfd]: increase dynamic partition limit when importing cirrus indexes (duration: 00m 13s) [20:31:17] urbanecm: the backports merged btw [20:31:35] MatmaRex: thanks for the ping & apologies, i somewhat totally missed that. [20:31:43] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration: Like nan.wikipedia.org, redirect other nan.*.org to the proper zh-min-nan.*.org domains - https://phabricator.wikimedia.org/T173966 (10BCornwall) 05Open→03Resolved a:03BCornwall Thank you for the patch and for your patience, @Fomafix!... [20:32:05] :D easy thing to do when it takes half an hour [20:32:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903759 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:32:19] i only checked just now myself [20:32:35] yup yup. scap'll ping once config+backports are at mwdebug. [20:32:54] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10SRE Observability, 10observability, and 2 others: Database alerting - https://phabricator.wikimedia.org/T172492 (10lmata) [20:33:13] (03Merged) 10jenkins-bot: Enable hidden tag for "Edit Check" project on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903759 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński) [20:33:31] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Observability-Alerting, 10observability, and 2 others: Database alerting - https://phabricator.wikimedia.org/T172492 (10lmata) [20:34:22] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:903684|Only run edit check on main namespace]], [[gerrit:903685|Change name of the editcheck-needreference tag to editcheck-references]], [[gerrit:903759|Enable hidden tag for "Edit Check" project on Wikipedias (T324733)]] [20:34:28] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [20:34:37] MatmaRex: can you try to test now? :) [20:34:57] yeah [20:37:01] ughhhh testwiki has some edit filters that are preventing me from editing. need a minute [20:37:42] (03CR) 10Cathal Mooney: Remove EventGate Icinga checks that have been moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [20:37:56] (03CR) 10Cathal Mooney: [C: 03+2] Remove Eventlogging prometheus-based Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/902454 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [20:38:05] 10SRE, 10Observability-Logging, 10Release-Engineering-Team, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q3): mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10lmata) thanks @colewhite! [20:38:26] MatmaRex: i disabled the filter you were hittng. it was marked as "testing" and untouched since '21, so should be fine. [20:39:42] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Logging, and 2 others: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171 (10lmata) [20:41:16] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Alerting, and 2 others: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10lmata) [20:41:32] MatmaRex: and apologies, i pinged too early... seems it's not ready yet, it only started pulling it to mwdebug :-/ [20:42:05] thanks, i was just trying to figure out why it didn't work [20:42:08] no problem [20:46:55] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) Hi folks. I'm in front of a very strange phenomenon probably linked to this bug, and this time it concerns a PDF File. So. Go to... [20:49:22] ...the new scap backport sometimes does take a while [20:49:40] 10SRE, 10DNS, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10BCornwall) [20:51:10] !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:903684|Only run edit check on main namespace]], [[gerrit:903685|Change name of the editcheck-needreference tag to editcheck-references]], [[gerrit:903759|Enable hidden tag for "Edit Check" project on Wikipedias (T324733)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:51:16] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [20:51:17] finally! [20:51:20] MatmaRex: now it should work [20:52:37] heh [20:53:35] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [20:53:55] and it does! thanks urbanecm [20:54:32] awesome! syncing [20:56:18] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite) [20:56:35] (03PS1) 10Herron: grizzly: adapt slo dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895) [20:57:22] (03PS1) 10BCornwall: pybal: Add runbook link to alert [alerts] - 10https://gerrit.wikimedia.org/r/903777 (https://phabricator.wikimedia.org/T310933) [21:03:15] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:903684|Only run edit check on main namespace]], [[gerrit:903685|Change name of the editcheck-needreference tag to editcheck-references]], [[gerrit:903759|Enable hidden tag for "Edit Check" project on Wikipedias (T324733)]] (duration: 28m 53s) [21:03:21] T324733: Introduce a tag to identify edits that meet the Edit Check heuristic - https://phabricator.wikimedia.org/T324733 [21:03:22] MatmaRex: finally live. thanks for your patience [21:03:27] anything else? [21:03:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [21:03:46] thanks urbanecm [21:03:51] any time [21:04:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [21:05:12] !log phedenskog@deploy2002 Started deploy [performance/navtiming@4d22874]: (no justification provided) [21:05:18] !log phedenskog@deploy2002 Finished deploy [performance/navtiming@4d22874]: (no justification provided) (duration: 00m 06s) [21:05:33] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Create and deploy per-CDN-site DNS domains - https://phabricator.wikimedia.org/T332025 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks @JameelKaisar for the patch! Looks like this is resolved. If this was in error, please feel f... [21:05:35] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BCornwall) [21:06:07] !log updating image_suggestions default table TTL(s) from 1209600 to 1814400 (seconds) — T333319 [21:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:12] T333319: Increase TTL in Cassandra image_suggestions keyspace to 3 weeks - https://phabricator.wikimedia.org/T333319 [21:07:07] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Create and deploy per-CDN-site DNS domains - https://phabricator.wikimedia.org/T332025 (10BCornwall) a:05BCornwall→03JameelKaisar [21:07:59] 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BCornwall) Hi, @CDanis. Thanks for creating this ticket. Would you mind expanding on the nature of the report? Thanks! [21:10:03] (03PS1) 10Bartosz Dziewoński: Enable history page visual diffs on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903780 (https://phabricator.wikimedia.org/T314588) [21:10:05] (03PS1) 10Bartosz Dziewoński: Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781 [21:13:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [21:15:07] (03CR) 10Cwhite: [C: 03+2] logstash: remove envoy deprecated options spamfilter [puppet] - 10https://gerrit.wikimedia.org/r/902625 (https://phabricator.wikimedia.org/T320468) (owner: 10Cwhite) [21:16:18] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [21:20:04] RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:55] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9b31c6b]: correct mw_sql_to_hive.py cli arguments [21:23:09] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@9b31c6b]: correct mw_sql_to_hive.py cli arguments (duration: 00m 13s) [21:25:48] PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:05] (03CR) 10Subramanya Sastry: [C: 03+1] Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) (owner: 10Arlolra) [22:15:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) Hi @MatthewVernon - for additional context, in the past we've seen drive failure issues being resolved after upgrading the firmware. Sometimes, old firmware causes is... [22:17:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [22:23:25] 10SRE-Sprint-Week-Sustainability-March2023, 10MediaWiki-General, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: MediaWiki log spam during row D blip / rack D2 unavailable - https://phabricator.wikimedia.org/T233739 (10lmata) Adding back #observability-logging which is a component tag within #... [22:29:30] (03CR) 10Cwhite: [C: 03+2] logstash: add grafana-server ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901642 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:32:06] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 56.91 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [22:33:28] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:33:41] (03CR) 10EoghanGaffney: [C: 03+1] aphlict: pass ensure flags to logrotate timer [puppet] - 10https://gerrit.wikimedia.org/r/903693 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto) [22:36:24] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for gerrit1003 - pt1979@cumin2002" [22:39:44] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 2.656 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [22:42:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for gerrit1003 - pt1979@cumin2002" [22:42:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:43:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED [22:43:44] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) Hi @RLazarus Apologies for the radio silence, I'm now circling back to this, and as I review, I have one or two questions :D. Will we file a retroactive incident... [22:44:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [22:48:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) While running the provision cookbook on 2 of the db nodes (db1206 and db1207) and gerrit1003 i am getting the error . ` Raised while handling: The `choices` argument is empty and... [22:51:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED [22:53:40] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) [22:57:58] (03CR) 10Dzahn: [C: 03+1] "I also think the risk is low (since the discovery name is used that just points to the current host). If you wanted to be even more carefu" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [22:59:17] (03CR) 10Dzahn: [C: 03+2] "confirmed noop on production deploy servers and on deploy-1004.devtools there is now docker installed." [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [23:00:09] jouncebot: nowandnext [23:00:09] No deployments scheduled for the next 6 hour(s) and 59 minute(s) [23:00:09] In 6 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T0600) [23:10:08] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:15:08] (03PS1) 10Dzahn: alertmanager: delete unused serviceops-collab receivers [puppet] - 10https://gerrit.wikimedia.org/r/903792 (https://phabricator.wikimedia.org/T329587) [23:19:16] (03PS1) 10Zabe: Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) [23:20:08] (03CR) 10CI reject: [V: 04-1] Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:24:03] (03PS1) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [23:24:24] (03PS2) 10Zabe: Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) [23:24:30] (03CR) 10Zabe: [C: 03+2] Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:25:18] (03Merged) 10jenkins-bot: Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:25:30] (03CR) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:25:48] (03CR) 10Dzahn: [C: 03+2] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:27:06] !log central Kurdish Wiktionary (ckbwiktionary) [23:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:28] (03CR) 10Dzahn: [C: 03+2] alertmanager: delete unused serviceops-collab receivers [puppet] - 10https://gerrit.wikimedia.org/r/903792 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:27:53] !log zabe@deploy2002 Started scap: T331831 [23:28:01] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [23:29:04] zabe, is cbkwiktionary happening now? [23:29:33] (03PS1) 10Zabe: Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) [23:29:45] Jhs: yes [23:29:53] niice [23:30:15] (03CR) 10CI reject: [V: 04-1] Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:34:55] !log zabe@deploy2002 Finished scap: T331831 (duration: 07m 01s) [23:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:35:01] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [23:36:01] (03PS2) 10Zabe: Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) [23:36:12] (03CR) 10Zabe: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:37:16] (03Merged) 10jenkins-bot: Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:38:14] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903714 [23:38:16] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903714 (owner: 10Zabe) [23:39:00] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903714 (owner: 10Zabe) [23:39:29] !log zabe@deploy2002 Started scap: T331831 [23:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:44:13] zabe, would you be able to fix T332380 for anpwiki (and preferrably ckbwiktionary too if possible) btw? The lack of RESTBase is causing a lot of problems [23:44:14] T332380: Add anpwiki to RESTBase - https://phabricator.wikimedia.org/T332380 [23:45:03] I can write the necesarry patch, but I don't have the necesarry powers to deploy it [23:45:46] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([mw1351.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [23:45:53] could you ping whoever does? [23:46:20] !log zabe@deploy2002 Finished scap: T331831 (duration: 06m 50s) [23:46:26] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [23:46:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:48:36] (03PS1) 10Dzahn: noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) [23:48:57] (03CR) 10CI reject: [V: 04-1] noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [23:49:50] I added hno_wlan to the patch, they are usually quite fast at getting those deployed [23:51:14] zabe, great, thanks [23:51:25] (03PS2) 10Dzahn: noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) [23:51:40] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:51:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:57:36] 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm confirmed with Sukhe that it was depoooled. worked remotely with Papaul to update the idrac and the bios. [23:57:44] (03PS1) 10Zabe: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 [23:58:14] (03CR) 10Zabe: [C: 03+2] throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 (owner: 10Zabe) [23:58:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 (owner: 10Zabe) [23:58:58] (03Merged) 10jenkins-bot: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 (owner: 10Zabe) [23:59:23] !log zabe@deploy2002 Started scap: Backport for [[gerrit:903803|throttle: Remove expired throttle]]