[00:47:04] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:16:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:16:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49122 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.730 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:45] (03PS2) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) [02:25:47] (03PS1) 10Andrew Bogott: oslo_messaging_rabbit: kombu_reconnect_delay=0.1 [puppet] - 10https://gerrit.wikimedia.org/r/864321 (https://phabricator.wikimedia.org/T318816) [02:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:54:06] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:56:08] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:39:38] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [04:41:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:13:44] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:18] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:19:18] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:31:24] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:39:28] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:02:53] (03PS1) 10AndyRussG: CentralNotice: Add wmflabs to banner preview CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) [06:12:03] (03PS1) 10Marostegui: db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864328 (https://phabricator.wikimedia.org/T322988) [06:14:09] (03CR) 10Marostegui: [C: 03+2] db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864328 (https://phabricator.wikimedia.org/T322988) (owner: 10Marostegui) [06:14:57] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) Host being repooled automatically. [06:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P42217 and previous config saved to /var/cache/conftool/dbconfig/20221205-061616-marostegui.json [06:16:18] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:16:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 1%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42218 and previous config saved to /var/cache/conftool/dbconfig/20221205-061625-root.json [06:17:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P42219 and previous config saved to /var/cache/conftool/dbconfig/20221205-061735-root.json [06:19:02] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:27:36] (03PS1) 10Marostegui: instances.yaml: Add db1206 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/864329 [06:28:55] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1206 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/864329 (owner: 10Marostegui) [06:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 to dbctl (depooled)', diff saved to https://phabricator.wikimedia.org/P42220 and previous config saved to /var/cache/conftool/dbconfig/20221205-063020-marostegui.json [06:31:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 5%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42221 and previous config saved to /var/cache/conftool/dbconfig/20221205-063130-root.json [06:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:32:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P42222 and previous config saved to /var/cache/conftool/dbconfig/20221205-063240-root.json [06:35:57] (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864330 [06:37:13] (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864330 (owner: 10Marostegui) [06:37:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with minimal weight', diff saved to https://phabricator.wikimedia.org/P42223 and previous config saved to /var/cache/conftool/dbconfig/20221205-063743-marostegui.json [06:46:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42224 and previous config saved to /var/cache/conftool/dbconfig/20221205-064635-root.json [06:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P42225 and previous config saved to /var/cache/conftool/dbconfig/20221205-064745-root.json [06:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with minimal weight', diff saved to https://phabricator.wikimedia.org/P42226 and previous config saved to /var/cache/conftool/dbconfig/20221205-065151-marostegui.json [07:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42227 and previous config saved to /var/cache/conftool/dbconfig/20221205-070140-root.json [07:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P42228 and previous config saved to /var/cache/conftool/dbconfig/20221205-070250-root.json [07:16:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42229 and previous config saved to /var/cache/conftool/dbconfig/20221205-071645-root.json [07:17:15] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: allow logging ECS to a local rsyslog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864547 (https://phabricator.wikimedia.org/T265876) [07:17:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P42230 and previous config saved to /var/cache/conftool/dbconfig/20221205-071754-root.json [07:22:18] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) [07:23:30] (03CR) 10CI reject: [V: 04-1] mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto) [07:25:12] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) The two attached patches implement proposal #3 Now we just need to create the appropriate topic, named `mediawiki.httpd.accesslog` on both ka... [07:31:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42231 and previous config saved to /var/cache/conftool/dbconfig/20221205-073150-root.json [07:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P42232 and previous config saved to /var/cache/conftool/dbconfig/20221205-073259-root.json [07:38:53] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [07:39:43] (03PS1) 10Kosta Harlan: Fix ExpensiveUserImpact input validation [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864666 (https://phabricator.wikimedia.org/T324312) [07:39:56] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [07:42:47] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:44:00] (03PS31) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:46:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42233 and previous config saved to /var/cache/conftool/dbconfig/20221205-074655-root.json [07:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P42234 and previous config saved to /var/cache/conftool/dbconfig/20221205-074804-root.json [07:56:53] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1003.eqiad.wmnet,service=thanos-web [07:57:00] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web [07:58:11] (03PS1) 10Abijeet Patro: Deprecate PersonalUrls hook [extensions/LiquidThreads] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/864671 (https://phabricator.wikimedia.org/T310017) [08:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T0800). [08:00:04] kart_ and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:38] hi [08:00:40] * kart_ is here [08:00:53] I'll be back in ~10 minutes, so kart_ you should go ahead with your patches [08:01:21] kostajh: Sure. [08:01:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry) [08:02:27] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry) [08:02:48] !log kartik@deploy1002 Started scap: Backport for [[gerrit:862412|testwiki: Enable Section Translation for 15 Wikipedias (T323825 T319177)]] [08:02:54] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:02:54] T323825: Enable Content and Section translation on 9 Wikipedias - https://phabricator.wikimedia.org/T323825 [08:05:06] !log restarting blazegraph on wdqs1004 (stuck with 2000+ threads, T242453) [08:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:10] T242453: Detect and alert and/or remediate Blazegraph deadlocks - https://phabricator.wikimedia.org/T242453 [08:06:53] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.094 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:07:01] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:08:54] (03CR) 10Abijeet Patro: "Needed for: Iecec234232f2a17e528625b2e21155fc66b5f30b" [extensions/LiquidThreads] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/864671 (https://phabricator.wikimedia.org/T310017) (owner: 10Abijeet Patro) [08:10:12] (back) [08:11:39] Scap seems slow? [08:11:52] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:862412|testwiki: Enable Section Translation for 15 Wikipedias (T323825 T319177)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:11:56] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:11:56] T323825: Enable Content and Section translation on 9 Wikipedias - https://phabricator.wikimedia.org/T323825 [08:11:57] Probably building the mw docker image [08:12:09] (03Abandoned) 10Abijeet Patro: Deprecate PersonalUrls hook [extensions/LiquidThreads] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/864671 (https://phabricator.wikimedia.org/T310017) (owner: 10Abijeet Patro) [08:12:50] Or had it already passed taht step kart_ ? [08:13:49] yeah. docker image build seems slow. Now, deploying.. [08:14:14] kart_: it took ~3 minutes [08:14:22] You can check /home/kartik/scap-image-build-and-push-log [08:14:50] out of curiosity, is that docker image used somewhere? [08:15:08] kostajh: X-mw-debug set to k8s-experimental [08:15:15] ack [08:15:22] It sends you to a k8s deployment of mediawiki [08:16:19] We're currently working on all that mediawiki on kubernetes jazz :) [08:17:32] Nice! [08:19:42] (03PS2) 10KartikMistry: Enable Section Translation on 8 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176) [08:20:14] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:862412|testwiki: Enable Section Translation for 15 Wikipedias (T323825 T319177)]] (duration: 17m 25s) [08:20:18] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:20:19] T323825: Enable Content and Section translation on 9 Wikipedias - https://phabricator.wikimedia.org/T323825 [08:20:49] On second patch.. [08:21:51] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-2003.codfw.wmnet,service=thanos-web [08:21:57] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-2002.codfw.wmnet,service=thanos-web [08:22:39] !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-web,name=eqiad [08:23:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42236 and previous config saved to /var/cache/conftool/dbconfig/20221205-082320-marostegui.json [08:24:00] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfw.wmnet,service=thanos-web [08:24:05] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web [08:24:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176) (owner: 10KartikMistry) [08:25:31] (03Merged) 10jenkins-bot: Enable Section Translation on 8 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176) (owner: 10KartikMistry) [08:25:33] (03CR) 10Muehlenhoff: [C: 03+2] postgresql::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/863286 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [08:25:44] !log kartik@deploy1002 Started scap: Backport for [[gerrit:863097|Enable Section Translation on 8 Wikipedias (T319176)]] [08:25:47] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [08:27:29] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:863097|Enable Section Translation on 8 Wikipedias (T319176)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:29:57] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-web,name=eqiad [08:30:40] (03PS2) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 [08:33:18] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: remove thanos-sso [dns] - 10https://gerrit.wikimedia.org/r/862939 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [08:35:41] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:863097|Enable Section Translation on 8 Wikipedias (T319176)]] (duration: 09m 57s) [08:35:45] T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176 [08:36:30] kostajh: I'm done. It took longer than I expected. [08:36:51] kart_: no worries. I'll get started with mine [08:37:14] (03PS9) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) [08:37:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan) [08:38:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff) [08:38:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4788 [08:38:36] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) a:03Clement_Goubert [08:38:40] (03Merged) 10jenkins-bot: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan) [08:38:47] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:859991|GrowthExperiments: End imagerecommendation experiment (T323686)]] [08:38:50] T323686: End imagerecommendation experiment - https://phabricator.wikimedia.org/T323686 [08:39:18] (03PS1) 10Filippo Giunchedi: hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913) [08:40:31] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:859991|GrowthExperiments: End imagerecommendation experiment (T323686)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:42:23] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38576/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [08:42:27] syncing the config patch [08:43:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4788 [08:44:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "See inline, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [08:46:47] (03PS2) 10Muehlenhoff: presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863300 (https://phabricator.wikimedia.org/T308013) [08:47:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38623 [08:48:14] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:859991|GrowthExperiments: End imagerecommendation experiment (T323686)]] (duration: 09m 26s) [08:48:16] T323686: End imagerecommendation experiment - https://phabricator.wikimedia.org/T323686 [08:48:21] on to the wmf.12 patch [08:48:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38623 [08:48:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864666 (https://phabricator.wikimedia.org/T324312) (owner: 10Kosta Harlan) [08:49:20] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55818 [08:49:23] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [08:50:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55818 [08:51:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136907 [08:51:25] (03Abandoned) 10WMDE-Fisch: Clean up suggested values setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740765 (owner: 10Awight) [08:52:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136907 [08:52:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42237 and previous config saved to /var/cache/conftool/dbconfig/20221205-085235-marostegui.json [08:53:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 52580 [08:53:19] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:54:16] (03CR) 10Muehlenhoff: [C: 03+2] presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863300 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:54:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52580 [08:55:07] (03PS2) 10Muehlenhoff: envoy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863303 (https://phabricator.wikimedia.org/T308013) [08:55:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 141731 [08:56:53] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/864662 (https://phabricator.wikimedia.org/T324437) (owner: 10Clément Goubert) [08:58:00] (03CR) 10Muehlenhoff: [C: 03+2] envoy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863303 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:58:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141731 [08:58:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 58308 [08:59:24] (03PS6) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 [09:00:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 58308 [09:00:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 59689 [09:00:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 59689 [09:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42238 and previous config saved to /var/cache/conftool/dbconfig/20221205-090214-marostegui.json [09:02:16] still going with the backport [09:04:18] (03CR) 10Muehlenhoff: [C: 03+2] Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [09:04:48] (03Merged) 10jenkins-bot: Fix ExpensiveUserImpact input validation [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864666 (https://phabricator.wikimedia.org/T324312) (owner: 10Kosta Harlan) [09:05:04] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864666|Fix ExpensiveUserImpact input validation (T324312)]] [09:05:07] T324312: Exception executing job: refreshUserImpactJob Wikimedia\Assert\ParameterKeyTypeException: Bad value for parameter $json['dailyArticleViews']: all elements must have string keys - https://phabricator.wikimedia.org/T324312 [09:05:44] (03PS6) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) [09:05:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:40] (03PS1) 10Kosta Harlan: User impact: Show discovery tour to desktop users who had old module [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619) [09:06:50] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864666|Fix ExpensiveUserImpact input validation (T324312)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [09:06:55] (03PS1) 10Kosta Harlan: User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) [09:09:01] syncing. As there's nothing coming up, I'm going to sync two more patches [09:12:16] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) [09:12:29] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) a:05Clement_Goubert→03None [09:13:23] 10SRE, 10MW-on-K8s, 10observability, 10serviceops-radar: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) [09:14:14] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864666|Fix ExpensiveUserImpact input validation (T324312)]] (duration: 09m 10s) [09:14:17] T324312: Exception executing job: refreshUserImpactJob Wikimedia\Assert\ParameterKeyTypeException: Bad value for parameter $json['dailyArticleViews']: all elements must have string keys - https://phabricator.wikimedia.org/T324312 [09:15:26] (03PS2) 10Kosta Harlan: User impact: Show discovery tour to desktop users who had old module [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619) [09:15:28] !log kharlan@deploy1002 backport aborted: (duration: 00m 25s) [09:15:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [09:15:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42239 and previous config saved to /var/cache/conftool/dbconfig/20221205-091547-marostegui.json [09:16:00] (03PS2) 10Kosta Harlan: User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) [09:16:11] (03PS1) 10Muehlenhoff: uwsgi: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/864664 (https://phabricator.wikimedia.org/T321783) [09:28:53] 10SRE, 10MW-on-K8s, 10observability, 10serviceops-radar: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10fgiunchedi) Thank you for reaching out @Clement_Goubert ! re: topic creation IIRC is open (i.e. topic will be auto-created on... [09:31:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:05] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:32:10] (03PS3) 10Filippo Giunchedi: icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) [09:32:20] (03CR) 10Filippo Giunchedi: [V: 03+2] icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:36:26] (03CR) 10CI reject: [V: 04-1] User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [09:36:51] !log installing freetype security updates [09:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:24] (03Merged) 10jenkins-bot: User impact: Show discovery tour to desktop users who had old module [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [09:37:40] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864712|User impact: Show discovery tour to desktop users who had old module (T323619)]] [09:37:43] T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619 [09:38:11] !log force a puppet run on physical hosts to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/860572 [09:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:08] !log restarting mediawiki canaries to pick up freetype security updates [09:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:31] (03Abandoned) 10WMDE-Fisch: Rely on the default value for $wgFileExporterTarget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762392 (owner: 10Awight) [09:42:51] (03PS2) 10Michael Große: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) [09:45:02] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [09:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:50:40] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864712|User impact: Show discovery tour to desktop users who had old module (T323619)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [09:50:45] T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619 [09:50:57] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) As [[ https://phabricator.wikimedia.org/T265876#6559439 | noted in the parent task]], and quite an important infor... [09:51:28] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [09:51:50] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) [09:52:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one typo inline" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [09:52:12] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:52:21] (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [09:52:30] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [09:53:16] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:53:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:54:23] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [09:54:24] (03CR) 10Klausman: [C: 03+1] api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [09:54:46] checking the patch [09:56:02] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:57:13] syncing [09:57:18] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove php 7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/839324 (owner: 10Giuseppe Lavagetto) [09:57:48] on to the last patch 😅 [10:05:13] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864712|User impact: Show discovery tour to desktop users who had old module (T323619)]] (duration: 27m 33s) [10:05:17] T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619 [10:05:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [10:06:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42240 and previous config saved to /var/cache/conftool/dbconfig/20221205-100607-marostegui.json [10:06:22] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:02] claime: do you know if the mw/k8s docker image building process is newly added to scap backport? Perhaps we should give a heads up to folks doing backports that it takes X minutes longer than it used to, for planning [10:07:11] (03CR) 10Hashar: [C: 03+1] zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:07:20] Yes, it is new, it's been turned on last week [10:08:31] kostajh: You're right, we should, _joe_ ideas on how to do that? [10:09:20] <_joe_> kostajh: yes, I think release engineering should send an annoucement out to ops@ or wikitech-l [10:09:40] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [10:10:20] These wikifeeds flaps [10:10:23] (03CR) 10JMeybohm: [C: 03+1] Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [10:10:38] <_joe_> claime: I suspect the problem is pretty specific to that page, I'll verify [10:10:54] ack thanks [10:11:08] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [10:11:10] FYI there's a task https://phabricator.wikimedia.org/T324412 [10:12:14] <_joe_> curl https://wikifeeds.svc.codfw.wmnet:4101/en.wikipedia.org/v1/page/featured/2022/12/04 works flawlessly [10:12:37] Can it be because we're requesting feeds from 6 years ago? [10:12:47] April 29, 2016 [10:12:52] <_joe_> dunno, it's now working correctly as well [10:14:49] !log rebalance thanos rings T311690 [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:52] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [10:21:52] (03Merged) 10jenkins-bot: User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan) [10:22:08] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864713|User impact: Show discovery notice to mobile users (T323619)]] [10:22:11] T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619 [10:22:50] RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:23:48] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864713|User impact: Show discovery notice to mobile users (T323619)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:24:09] verifying patch [10:24:54] (03CR) 10JMeybohm: [C: 03+1] calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:25:56] syncing [10:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:27:24] (03PS1) 10Giuseppe Lavagetto: New php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864726 [10:28:40] (03CR) 10Clément Goubert: [C: 03+1] New php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864726 (owner: 10Giuseppe Lavagetto) [10:30:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42241 and previous config saved to /var/cache/conftool/dbconfig/20221205-103028-marostegui.json [10:31:38] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864713|User impact: Show discovery notice to mobile users (T323619)]] (duration: 09m 30s) [10:31:41] T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619 [10:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:32:40] !log contint1001 - racadm serveraction powercyle - crashed [10:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:05] !log UTC morning deploys done [10:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:18] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] New php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864726 (owner: 10Giuseppe Lavagetto) [10:35:12] RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:39:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:07] (03PS2) 10Muehlenhoff: calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) [10:41:18] (03CR) 10CI reject: [V: 04-1] calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:43:19] (03PS1) 10Filippo Giunchedi: utils: autodetect hiera directory in role_team_stats.py [puppet] - 10https://gerrit.wikimedia.org/r/864727 [10:43:29] (03CR) 10CI reject: [V: 04-1] utils: autodetect hiera directory in role_team_stats.py [puppet] - 10https://gerrit.wikimedia.org/r/864727 (owner: 10Filippo Giunchedi) [10:44:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:15] mmhh I'm wondering if my rebooting a crashed contint1001 has anything to do with those -1s [10:47:30] godog: jenkins the butler going mad and -1'ing everything [10:49:23] RECOVERY - Check for large files in client bucket on deploy1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [10:50:53] (03PS3) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) [10:51:02] (03CR) 10CI reject: [V: 04-1] api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [10:53:59] <_joe_> yeah something is very wrong in jenkins [10:54:07] <_joe_> hashar: ^^ [10:54:34] <_joe_> or zuul [10:54:36] hi [10:54:47] what are the symptoms? [10:55:12] <_joe_> -1's with the message [10:55:15] <_joe_> This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset [10:55:28] <_joe_> and pipelinebot didn't pick up https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/864728 [10:55:33] that is the zuul-merger failing to merge the proposed patchset against the tip of the branch [10:55:35] <_joe_> and there's the same error message [10:55:45] usually due to a merge conflict, and sometime cause the zuul-merger is confused/broken [10:55:54] <_joe_> I would assume it's the latter [10:56:23] notably I found contint1001 crashed and rebooted it, I'm wondering if that's related ? [10:56:42] <_joe_> godog: zuul-merger should run from contint2001 [10:56:51] GitCommandError: Cmd('git') failed due to: exit code(128) [10:56:51] cmdline: git fetch --force --tags -v origin [10:56:51] stderr: 'fatal: Could not read from remote repository. [10:57:15] _joe_: I'm aware, I mentioned it just in case [10:57:18] it is a known issue, some connection got stuck [10:57:37] <_joe_> ok so the solution is to kick zuul-merger? [10:59:09] (03PS1) 10Volans: cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 [10:59:19] (03CR) 10CI reject: [V: 04-1] cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [10:59:22] _joe_: I am looking for the task that has the fix [11:00:35] (03CR) 10Clément Goubert: [C: 03+1] httpd-fcgi: allow logging ECS to a local rsyslog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864547 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto) [11:03:12] (03CR) 10Gehel: Elastic: Use OS major version for GC flags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [11:05:04] <_joe_> hashar: should I look into it? [11:06:05] <_joe_> I see you just restarted zuul [11:07:23] !log Restarted Zuul to clear a stuck ssh connection with Gerrit - T309376 [11:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:26] T309376: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 [11:09:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o [11:09:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o [11:10:06] <_joe_> hashar: any idea how can I get pipelinebot to pick up my change? [11:11:06] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:11:41] _joe_: so the issue is that sometime the ssh connection from Zuul to Gerrit get stuck indefinitely which keeps a ssh response thread busy on the Gerrit side [11:11:59] that goes against the 4 ssh connection per user limit and breaks the world [11:12:14] the "fix" is to restart Zuul entirely to clear the faulty connection [11:12:27] for PipelineBot, I guess a `recheck` on the change would be sufficient? [11:12:47] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) Volume recommendation is apparently ~2/3k mps/partition, so we may want 5 partitions, not considering broker equil... [11:13:55] Amir1: Got a minute to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/861813 so we're done with it? [11:14:19] oh [11:14:27] _joe_: I will trigger the postmerge job [11:15:12] sure thing [11:15:31] <3 [11:17:13] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [11:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42242 and previous config saved to /var/cache/conftool/dbconfig/20221205-111836-marostegui.json [11:19:06] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/864727 (owner: 10Filippo Giunchedi) [11:21:14] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) p:05Triage→03Medium [11:22:14] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:22:46] (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/864664 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:22:57] (03CR) 10Muehlenhoff: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/output/864664/38579/" [puppet] - 10https://gerrit.wikimedia.org/r/864664 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:24:03] <_joe_> hashar: <3 [11:24:58] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [11:29:39] _joe_: you are welcome and sorry for the mess :\ [11:30:10] (03PS4) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) [11:30:40] (03PS3) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 [11:31:00] !log installing librsvg bugfix updates from buster point release [11:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:19] (03CR) 10Ladsgroup: [C: 03+1] mediawiki::maintenance::campaignevents: meta [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [11:31:58] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mediawiki::maintenance::campaignevents: meta [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [11:34:15] 10SRE, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10LSobanski) 05Open→03Resolved I updated L3 to reflect the suggestion. [11:36:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:37:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42243 and previous config saved to /var/cache/conftool/dbconfig/20221205-113746-marostegui.json [11:38:28] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [11:40:33] (03PS1) 10Giuseppe Lavagetto: shellbox: bump image version, move to 4.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864734 [11:43:07] (03Merged) 10jenkins-bot: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan) [11:45:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: bump image version, move to 4.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864734 (owner: 10Giuseppe Lavagetto) [11:49:27] 10SRE, 10Infrastructure-Foundations: move human users out of UID range for system accounts - https://phabricator.wikimedia.org/T114446 (10LSobanski) The list Daniel posted above is still more or less accurate and the originally stated question is still valid. [11:49:33] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [11:50:00] (03Merged) 10jenkins-bot: shellbox: bump image version, move to 4.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864734 (owner: 10Giuseppe Lavagetto) [11:50:53] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [11:51:24] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [11:51:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff) [11:52:20] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:53:05] <_joe_> taavi: I am deploying shellbox-score today with php 7.4, and tomorrow I'll deploy the rest of them if everything goes well [11:53:23] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:58:54] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [11:59:25] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [11:59:29] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [11:59:44] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [12:00:37] (03PS1) 10Hnowlan: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/864737 [12:02:24] (03CR) 10Muehlenhoff: [C: 03+2] calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:03:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [12:04:09] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [12:05:29] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:25:09] (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/864737 (owner: 10Hnowlan) [12:26:29] (03CR) 10Hnowlan: [C: 03+2] admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:28:42] Regarding the wikifeeds flaps it seems it's always the same pod [12:28:49] It's the only one with events [12:29:01] I'll scratch it and make helm recreate it [12:29:25] (03PS1) 10Slyngshede: Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 [12:29:42] (03CR) 10CI reject: [V: 04-1] Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede) [12:30:23] (03Merged) 10jenkins-bot: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/864737 (owner: 10Hnowlan) [12:31:27] (03PS2) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 [12:31:29] (03Merged) 10jenkins-bot: admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:31:35] (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [12:32:59] (03CR) 10Slyngshede: Configuration: Add support for setting connection timeout. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [12:33:11] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:34:17] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10MoritzMuehlenhoff) Switching to contint1002 would also be a good opportunity to migrate to Bullseye (which per https://wikitech.wikimedia.org/wiki/Op... [12:34:55] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:35:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [12:36:25] (03CR) 10Muehlenhoff: [C: 03+1] "recheck" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [12:38:06] (03PS2) 10Muehlenhoff: zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013) [12:39:42] !log root@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:40:29] (03PS2) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 [12:40:59] (03CR) 10CI reject: [V: 04-1] DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 (owner: 10David Caro) [12:41:06] !log root@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:41:12] (03CR) 10Muehlenhoff: [C: 03+2] zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:41:20] !log root@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [12:41:50] !log root@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [12:44:26] (03PS1) 10Muehlenhoff: Add AntiCompositeNumber to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/864757 [12:45:34] (03CR) 10Hnowlan: [C: 04-1] Promote Cassandra 3.11.13 to '3.x' (aka stable) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [12:46:31] (03CR) 10Muehlenhoff: [C: 03+2] Add AntiCompositeNumber to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/864757 (owner: 10Muehlenhoff) [12:46:41] (03PS3) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 [12:49:26] (03PS4) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 [12:50:24] !log installing python-keystoneauth1 bugfix updates from Buster 10.13 point release [12:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:15] (03PS1) 10JMeybohm: helm-state-metrics: Update resources for v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864759 (https://phabricator.wikimedia.org/T323706) [12:56:17] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [12:58:05] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:59:04] Well apparently killing and recreating the pod that was bugging out didn't fix it [13:02:53] (03PS3) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 [13:03:44] Amir1: Going to lunch [13:03:45] (03PS1) 10JMeybohm: KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706) [13:03:47] (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [13:03:56] noted [13:04:12] Back in ~1h [13:04:32] I'm taking the pager, if there's anything I'll come back [13:07:19] (03PS4) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 [13:10:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Ottomata) Thanks. Not sure what is going on, but I found some things you could try in [[ https://unix.stackexchange.com/questions/416166/cant-establish-s... [13:11:42] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [13:12:51] !log installing libnet-ssleay-perl bugfix updates from Buster 10.13 point release [13:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:11] (03CR) 10Volans: [C: 03+2] setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [13:16:53] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [13:17:07] !log installing distro-info-data bugfix updates from Buster 10.13 point release [13:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:22] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [13:18:43] (03Merged) 10jenkins-bot: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [13:21:12] (03CR) 10Ottomata: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [13:23:58] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [13:24:41] !log installing postgresql-common bugfix updates from Buster 10.13 point release [13:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] jouncebot: nowandnext [13:25:57] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [13:25:58] In 0 hour(s) and 4 minute(s): Run fixMergeHistoryCorruption.php (T302486) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1330) [13:26:00] T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486 [13:27:30] (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [13:30:05] TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Run fixMergeHistoryCorruption.php (T302486) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1330). [13:30:34] (03PS2) 10Slyngshede: Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 [13:31:45] !log T302486 : [samtar@mwmaint1002 ~]$ mwscript maintenance/fixMergeHistoryCorruption.php --wiki enwiki --ns 828 --delete [13:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:49] T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486 [13:32:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 55818 [13:33:07] (03PS1) 10Volans: setup.py: temporary fix for test dependencies [software/cumin] - 10https://gerrit.wikimedia.org/r/864764 [13:34:21] (03CR) 10Volans: "Thanks for the patch. Do you have in mind any specific use case where this will be needed?" [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [13:42:13] (03CR) 10Volans: "dhinus, dcaro: do you have any objection to merge this in its current status? Do you need more time to have a look?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [13:42:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T324180 [13:42:49] T324180: Switchover s3 master (db2127 -> db2105) - https://phabricator.wikimedia.org/T324180 [13:43:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T324180 [13:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2105 with weight 0 T324180', diff saved to https://phabricator.wikimedia.org/P42245 and previous config saved to /var/cache/conftool/dbconfig/20221205-134346-ladsgroup.json [13:44:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 55818 [13:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:48:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:01] (03PS1) 10Stang: trwiki: Add 20 years celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) [13:51:30] !log repooling wdqs1004 [13:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:17] PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:07] sukhe: good morning, is that known ^ [13:54:02] (03PS2) 10Ladsgroup: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861875 (https://phabricator.wikimedia.org/T324180) (owner: 10Gerrit maintenance bot) [13:54:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861875 (https://phabricator.wikimedia.org/T324180) (owner: 10Gerrit maintenance bot) [13:54:59] !log Starting s3 codfw failover from db2127 to db2105 - T324180 [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:02] T324180: Switchover s3 master (db2127 -> db2105) - https://phabricator.wikimedia.org/T324180 [13:55:26] * claime back [13:55:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2105 to s3 primary T324180', diff saved to https://phabricator.wikimedia.org/P42246 and previous config saved to /var/cache/conftool/dbconfig/20221205-135539-ladsgroup.json [13:59:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2127 T324180', diff saved to https://phabricator.wikimedia.org/P42247 and previous config saved to /var/cache/conftool/dbconfig/20221205-135932-ladsgroup.json [14:00:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1400). nyaa~ [14:00:05] guerganaWMDE and cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] o/ [14:00:28] o/ [14:00:49] (best jouncebot message) [14:01:23] Ok, i will be here [14:02:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:02:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:02:50] If no deployers are available in 5 mins I can deploy [14:07:19] I will deploy [14:07:46] guerganaWMDE: starting with yours [14:07:47] \o/ [14:08:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [14:08:08] im ready [14:08:24] Amir1: thanks! and no, not expected. checkimg [14:08:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:08:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:08:44] (03Merged) 10jenkins-bot: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [14:08:59] !log samtar@deploy1002 Started scap: Backport for [[gerrit:862247|Add Property (120) to Wikidata content Namespace (T321282)]] [14:09:02] T321282: make the Property namespace on Wikidata a content namespace - https://phabricator.wikimedia.org/T321282 [14:10:02] `scap backport` has had an update? [14:10:30] (no idea what `build-and-push-container-images` is) [14:10:49] Building and pushing the mediawiki container image for mw-on-k8s [14:11:10] woah, that's happening? [14:11:22] will you let me know which debug server i have to use? [14:12:41] guerganaWMDE: I will :) it is still doing this new step, not sure how long it will take [14:13:12] claime: is this a long process? [14:13:28] Shouldn´t take more than 3 minutes usually [14:13:47] thanks! sure, i await instructions [14:14:13] Okay, at 5m currently (would be nice if it didn't redirect output but understand why it *does*) [14:15:08] 10SRE, 10Traffic: Drop the VarnishTrafficDrop and HAProxyEdgeTrafficDrop alerts - https://phabricator.wikimedia.org/T322220 (10fgiunchedi) [14:15:32] (took 6m, all ok) [14:18:02] !log samtar@deploy1002 samtar and gtzatchkova: Backport for [[gerrit:862247|Add Property (120) to Wikidata content Namespace (T321282)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:18:05] T321282: make the Property namespace on Wikidata a content namespace - https://phabricator.wikimedia.org/T321282 [14:18:20] guerganaWMDE: live on mwdebug, use mwdebug2001 :) [14:18:26] 6 minutes is kinda long... just pushing the image took 4 minutes [14:19:04] ok, let me check if the change is there, one second [14:19:06] RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:09] ack [14:19:46] claime: me being impatient but so far for a config change this is significantly slower.. [14:19:57] Yes, I agree [14:19:59] it works!!! thanks! [14:20:10] guerganaWMDE: great, syncing [14:20:18] 10SRE, 10Infrastructure-Foundations, 10Traffic: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10ssingh) >>! In T324334#8439896, @Volans wrote: > Sorry for the trouble, that was me indeed, I've fixed the permissions and run the `sre.dns.netbox` cookbook successfully: > >... [14:20:44] (03PS2) 10Samtar: logos: icon could be not square [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863467 (owner: 10Stang) [14:22:18] (03PS2) 10Samtar: trwiki: Add 20 years celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) (owner: 10Stang) [14:23:14] (03PS3) 10Elukey: knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) [14:23:50] (03CR) 10Elukey: "Updated all suggestions, and also added Build-Depends where needed :) Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [14:23:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede) [14:24:16] (03CR) 10Slyngshede: [C: 03+2] Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede) [14:25:20] (03Merged) 10jenkins-bot: Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede) [14:25:48] cirno: will be doing 863467 and 864766 next [14:25:59] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:862247|Add Property (120) to Wikidata content Namespace (T321282)]] (duration: 16m 59s) [14:26:01] guerganaWMDE: should be live on production now :) [14:26:03] T321282: make the Property namespace on Wikidata a content namespace - https://phabricator.wikimedia.org/T321282 [14:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:26:23] (03PS1) 10Marostegui: db1206: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864769 [14:26:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863467 (owner: 10Stang) [14:26:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) (owner: 10Stang) [14:26:54] *checks [14:27:24] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mfossati-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:27] (03Merged) 10jenkins-bot: logos: icon could be not square [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863467 (owner: 10Stang) [14:27:27] \o/ it's live, thanks! [14:27:30] (03CR) 10Marostegui: [C: 03+2] db1206: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864769 (owner: 10Marostegui) [14:27:32] (03Merged) 10jenkins-bot: trwiki: Add 20 years celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) (owner: 10Stang) [14:27:32] great :) [14:27:45] !log samtar@deploy1002 Started scap: Backport for [[gerrit:863467|logos: icon could be not square]], [[gerrit:864766|trwiki: Add 20 years celebration logos (T324393)]] [14:27:48] T324393: Change the logo of Turkish Wikipedia for 20th anniversary of Turkish Wikipedia - https://phabricator.wikimedia.org/T324393 [14:27:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42249 and previous config saved to /var/cache/conftool/dbconfig/20221205-142752-marostegui.json [14:28:48] i will log off. thank you! [14:28:52] o/ [14:29:14] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:31] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:863467|logos: icon could be not square]], [[gerrit:864766|trwiki: Add 20 years celebration logos (T324393)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:29:39] cirno: live on mwdebug [14:30:29] TheresNoTime: tested under vector, vector-2022 and timeless, all looks good to me [14:30:36] syncin' [14:31:10] (03PS1) 10Btullis: Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) [14:31:12] (03PS5) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 [14:32:06] (03PS2) 10Samtar: beta: Set wgPageTriageEnableEnglishWikipediaFeatures to False [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863441 (https://phabricator.wikimedia.org/T321922) (owner: 10Stang) [14:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:32:43] (03PS8) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [14:34:18] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5021.eqsin.wmnet,service=ats-be [14:34:19] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5021.eqsin.wmnet,service=ats-tls [14:34:19] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5021.eqsin.wmnet,service=varnish-fe [14:34:22] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=ats-be [14:34:23] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=ats-tls [14:34:23] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=varnish-fe [14:34:23] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5025.eqsin.wmnet,service=ats-be [14:34:24] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5025.eqsin.wmnet,service=ats-tls [14:34:24] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5025.eqsin.wmnet,service=varnish-fe [14:34:27] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=ats-be [14:34:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=ats-tls [14:34:28] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=varnish-fe [14:34:48] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mfossati-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:44] (03CR) 10CI reject: [V: 04-1] cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [14:36:11] cirno: about to start 863441 (beta only) [14:36:23] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:863467|logos: icon could be not square]], [[gerrit:864766|trwiki: Add 20 years celebration logos (T324393)]] (duration: 08m 37s) [14:36:26] T324393: Change the logo of Turkish Wikipedia for 20th anniversary of Turkish Wikipedia - https://phabricator.wikimedia.org/T324393 [14:36:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863441 (https://phabricator.wikimedia.org/T321922) (owner: 10Stang) [14:36:34] (03CR) 10Andrew Bogott: [C: 03+2] oslo_messaging_rabbit: kombu_reconnect_delay=0.1 [puppet] - 10https://gerrit.wikimedia.org/r/864321 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [14:36:42] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:13] (03Merged) 10jenkins-bot: beta: Set wgPageTriageEnableEnglishWikipediaFeatures to False [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863441 (https://phabricator.wikimedia.org/T321922) (owner: 10Stang) [14:37:38] all done [14:37:56] !log closing UTC afternoon backport window [14:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:35] (03PS2) 10Btullis: Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) [14:39:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [14:39:49] (03CR) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [14:39:55] (03PS3) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) [14:40:07] (03PS1) 10Ssingh: cp5011, cp5013: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864771 (https://phabricator.wikimedia.org/T323830) [14:40:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:40:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [14:40:57] (03CR) 10Andrew Bogott: [C: 03+2] Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [14:41:00] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [14:41:06] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:41:32] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5011.eqsin.wmnet,service=ats-tls [14:41:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5011.eqsin.wmnet,service=ats-be [14:41:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5011.eqsin.wmnet,service=varnish-fe [14:41:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5013.eqsin.wmnet,service=ats-tls [14:41:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5013.eqsin.wmnet,service=ats-be [14:41:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5013.eqsin.wmnet,service=varnish-fe [14:42:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp[5011,5013].eqsin.wmnet with reason: downtimed, to be depooled [14:42:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp[5011,5013].eqsin.wmnet with reason: downtimed, to be depooled [14:42:58] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:43:14] (03CR) 10Ssingh: [C: 03+2] cp5011, cp5013: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864771 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [14:44:13] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10fgiunchedi) I'm optimistically pulling o11y since AFAICS there's no actionabled [14:48:08] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [14:48:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5011,5013].eqsin.wmnet [14:49:42] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mfossati-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:47] (03PS9) 10David Caro: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [14:51:27] 10SRE, 10Observability-Metrics, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10fgiunchedi) 05Open→03Invalid I've run the following query `sum by (state) (apache_workers)` and I'm seeing only state `busy` or... [14:53:58] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:54:15] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:54:29] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:55:15] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:55:48] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:56:15] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5011,5013].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:57:09] (03PS2) 10Majavah: puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 [14:57:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5011,5013].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:57:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[5011,5013].eqsin.wmnet [14:57:41] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp[5011,5013].eqsin.wmnet` - cp5011.eqsin.w... [14:57:54] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [14:59:52] (03CR) 10Majavah: puppetdb: support using client certificates (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [15:01:10] (03CR) 10FNegri: spicerack: add module injection support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:02:22] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:08] PROBLEM - Check whether ferm is active by checking the default input chain on moss-fe1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:05:02] 10SRE, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10fgiunchedi) [15:05:13] (03CR) 10David Caro: [C: 03+2] cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [15:05:23] 10SRE-OnFire, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10jcrespo) > I would suggest that the alert should be on a request p... [15:05:23] !log root@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:06:12] !log root@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:06:24] 10SRE-OnFire, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10jcrespo) [15:06:40] !log root@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:07:08] !log root@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:08:14] (03Merged) 10jenkins-bot: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [15:11:44] (03Abandoned) 10Herron: swift: update ephemeral port range from 1024-65535 to 10240-65535 [puppet] - 10https://gerrit.wikimedia.org/r/808040 (https://phabricator.wikimedia.org/T311262) (owner: 10Herron) [15:11:48] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:14:07] (03PS35) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [15:14:26] (03CR) 10Btullis: "Thanks, this looks great in general." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:14:45] (03PS36) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [15:14:49] !log deleted wikitech-static-ord-prebuster image backup in rackspace cloud. Here concludes the wikitech-static upgrade to Buster and php7.4 [15:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] PROBLEM - Check systemd state on kubernetes2010 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:33] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:16:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:10] RECOVERY - Check systemd state on kubernetes2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:52] (03CR) 10David Caro: [C: 03+1] "LGTM, I have not played much with it yet, but I'm sure we can fix any issues that pop up if any." [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:18:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:21:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:36] (03PS1) 10Hnowlan: thumbor: increase memory limit for instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/864773 (https://phabricator.wikimedia.org/T323936) [15:25:35] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5022.eqsin.wmnet,service=ats-be [15:25:35] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5022.eqsin.wmnet,service=ats-tls [15:25:35] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5022.eqsin.wmnet,service=varnish-fe [15:25:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet,service=ats-be [15:25:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet,service=ats-tls [15:25:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet,service=varnish-fe [15:25:41] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5026.eqsin.wmnet,service=ats-be [15:25:41] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5026.eqsin.wmnet,service=ats-tls [15:25:41] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5026.eqsin.wmnet,service=varnish-fe [15:25:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=ats-be [15:25:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=ats-tls [15:25:44] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=varnish-fe [15:26:04] (03PS1) 10Ssingh: cp5012, cp5014: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864775 (https://phabricator.wikimedia.org/T323830) [15:28:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5012.eqsin.wmnet,service=ats-tls [15:28:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5012.eqsin.wmnet,service=ats-be [15:28:43] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5012.eqsin.wmnet,service=varnish-fe [15:28:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5014.eqsin.wmnet,service=ats-tls [15:28:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5014.eqsin.wmnet,service=ats-be [15:28:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5014.eqsin.wmnet,service=varnish-fe [15:30:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp[5012,5014].eqsin.wmnet with reason: downtimed, to be depooled [15:30:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp[5012,5014].eqsin.wmnet with reason: downtimed, to be depooled [15:31:42] (03CR) 10Ssingh: [C: 03+2] cp5012, cp5014: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864775 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [15:34:58] RECOVERY - Check whether ferm is active by checking the default input chain on moss-fe1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:35:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5012,5014].eqsin.wmnet [15:36:06] !log installing apache2 security updates on buster [15:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:15] 10SRE, 10PyBal, 10Traffic-Icebox: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730 (10Aklapper) p:05Medium→03Low [15:41:56] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:43:53] 10SRE, 10Infrastructure-Foundations: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10MoritzMuehlenhoff) [15:43:59] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:44:24] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5012,5014].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [15:44:40] 10SRE, 10Infrastructure-Foundations: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10Volans) [15:44:41] (03CR) 10Ssingh: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [15:45:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5012,5014].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [15:45:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[5012,5014].eqsin.wmnet [15:45:51] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp[5012,5014].eqsin.wmnet` - cp5012.eqsin.w... [15:45:57] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [15:46:03] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1010.eqiad.wmnet with OS bullseye [15:48:18] 10SRE, 10Traffic, 10Patch-For-Review: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10ssingh) We have enabled the hardened haproxy unit on `traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud` to start with, before rolling it out to the production cp hosts. [15:49:28] (03PS4) 10Ssingh: [In case of emergency/Stage 3] depool eqsin for hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/856664 [15:50:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:22] (03CR) 10Ssingh: "Emergency patch for Stage 3 of eqsin hardware refresh (Monday Dec 5). DO NOT MERGE unless there are issues with eqsin." [dns] - 10https://gerrit.wikimedia.org/r/856664 (owner: 10Ssingh) [15:52:49] (03PS1) 10Herron: vo-escalate: kill process if run time exceeds 10s [puppet] - 10https://gerrit.wikimedia.org/r/864776 (https://phabricator.wikimedia.org/T324466) [15:52:58] (03CR) 10DCausse: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [15:58:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:02:54] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:04:56] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:05:49] (03PS4) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 [16:06:30] !log installing glibc security updates on buster [16:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:44] !log restarted kube-apiserver on ml-serve-ctrl1001 to adress high latency and large number of 504s [16:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10Muhammad_Yasser_Jazirahly_WMDE) [16:08:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:11:27] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1010.eqiad.wmnet with reason: host reimage [16:13:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:14:36] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1010.eqiad.wmnet with reason: host reimage [16:21:36] (03PS13) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:25:44] (03PS14) 10Elukey: Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:26:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:27:05] !log restarted kube-apiserver on ml-staging-ctrl2001 to adress high latency [16:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1630). [16:30:17] (03CR) 10Volans: [C: 03+2] "Merging to unblock CI" [software/cumin] - 10https://gerrit.wikimedia.org/r/864764 (owner: 10Volans) [16:31:40] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [16:32:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificates) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:38:11] (03Merged) 10jenkins-bot: setup.py: temporary fix for test dependencies [software/cumin] - 10https://gerrit.wikimedia.org/r/864764 (owner: 10Volans) [16:38:30] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5023.eqsin.wmnet,service=ats-be [16:38:31] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5023.eqsin.wmnet,service=ats-tls [16:38:31] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5023.eqsin.wmnet,service=varnish-fe [16:38:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet,service=ats-be [16:38:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet,service=ats-tls [16:38:33] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet,service=varnish-fe [16:38:37] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5027.eqsin.wmnet,service=ats-be [16:38:37] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5027.eqsin.wmnet,service=ats-tls [16:38:37] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5027.eqsin.wmnet,service=varnish-fe [16:38:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet,service=ats-be [16:38:39] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet,service=ats-tls [16:38:40] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet,service=varnish-fe [16:38:55] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1035.eqiad.wmnet with OS bullseye [16:39:00] (03PS1) 10Ssingh: cp5015: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864785 (https://phabricator.wikimedia.org/T323830) [16:39:37] (03PS3) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) [16:40:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1010.eqiad.wmnet with OS bullseye [16:40:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5015.eqsin.wmnet,service=ats-tls [16:40:16] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5015.eqsin.wmnet,service=ats-be [16:40:16] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5015.eqsin.wmnet,service=varnish-fe [16:41:00] (03CR) 10Herron: service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [16:41:37] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1034.eqiad.wmnet with OS bullseye [16:43:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5015.eqsin.wmnet with reason: downtimed, to be depooled [16:43:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5015.eqsin.wmnet with reason: downtimed, to be depooled [16:43:31] (03CR) 10Volans: "Looks good! Small nits and a proposal for a small improvement inline." [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [16:44:07] (03CR) 10Ssingh: [C: 03+2] cp5015: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864785 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [16:44:45] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1033.eqiad.wmnet with OS bullseye [16:45:11] (03PS1) 10CDanis: Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T234466) [16:45:34] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) [16:46:18] 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10LSobanski) p:05Medium→03Low [16:47:46] 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10Jrbranaa) Approved. [16:48:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp5015.eqsin.wmnet [16:49:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:49:37] (03PS2) 10CDanis: Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T324466) [16:49:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:50:06] 10SRE, 10Traffic-Icebox, 10Wikimedia-Planet, 10serviceops-collab, 10Patch-For-Review: mixed-content issues on planet.wikimedia.org - https://phabricator.wikimedia.org/T141480 (10LSobanski) p:05Medium→03Lowest [16:53:02] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [16:53:37] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1035.eqiad.wmnet with reason: host reimage [16:55:55] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) @Ottomata, @elukey any updates on this? Should we keep it open/... [16:56:11] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1034.eqiad.wmnet with reason: host reimage [16:56:11] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1035.eqiad.wmnet with reason: host reimage [16:56:42] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5015.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [16:57:05] (03CR) 10Herron: [C: 03+1] Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T324466) (owner: 10CDanis) [16:57:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5015.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [16:57:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp5015.eqsin.wmnet [16:58:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp5015.eqsin.wmnet` - cp5015.eqsin.wmnet (*... [16:58:14] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [16:59:14] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1034.eqiad.wmnet with reason: host reimage [16:59:33] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1033.eqiad.wmnet with reason: host reimage [17:00:03] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10LSobanski) [17:00:23] 10SRE, 10Diffusion, 10Release-Engineering-Team, 10serviceops-collab: svn.wikimedia.org redirects to Diffusion main page, hence hard to find e.g. "flexbisonparse" - https://phabricator.wikimedia.org/T140594 (10LSobanski) [17:00:33] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:00:47] 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10LSobanski) p:05Low→03Lowest [17:01:51] (03PS1) 10Ssingh: cp5016: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864789 (https://phabricator.wikimedia.org/T323830) [17:02:34] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) I would like to see config management for Kafka topics one day, i... [17:02:35] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1033.eqiad.wmnet with reason: host reimage [17:03:37] ^ httpbb timeout for https://en.wikivoyage.org/wiki/Main_Page this time, interesting [17:03:48] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - free space: / 3538 MB (3% inode=80%): /tmp 3538 MB (3% inode=80%): /var/tmp 3538 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [17:03:52] something must have changed, we're getting random timeouts a lot more frequently [17:10:50] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:32] (03CR) 10Jforrester: [C: 03+1] "Oops. Apparently we don't trigger this prod!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe) [17:15:55] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) [17:15:59] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett) [17:19:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:55] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1035.eqiad.wmnet with OS bullseye [17:21:58] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1034.eqiad.wmnet with OS bullseye [17:28:34] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5024.eqsin.wmnet,service=ats-be [17:28:35] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5024.eqsin.wmnet,service=ats-tls [17:28:35] !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5024.eqsin.wmnet,service=varnish-fe [17:28:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet,service=ats-be [17:28:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet,service=ats-tls [17:28:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet,service=varnish-fe [17:28:43] 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) 05Open→03Stalled Cool, thanks for that write up @Ottomata. I... [17:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:25] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5016.eqsin.wmnet,service=ats-tls [17:30:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5016.eqsin.wmnet,service=ats-be [17:30:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5016.eqsin.wmnet,service=varnish-fe [17:30:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5016.eqsin.wmnet with reason: downtimed, to be depooled [17:31:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5016.eqsin.wmnet with reason: downtimed, to be depooled [17:31:25] (03CR) 10Ssingh: [C: 03+2] cp5016: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864789 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [17:31:54] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:31:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:31:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:32:56] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [17:33:36] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp5016.eqsin.wmnet [17:34:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:37:20] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:38:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:39:06] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:40:55] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5016.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:41:45] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:42:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5016.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:42:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp5016.eqsin.wmnet [17:42:16] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp5016.eqsin.wmnet` - cp5016.eqsin.wmnet (*... [17:42:34] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [17:44:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:44:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:48:41] (03PS1) 10Ssingh: hiera: remove obsolete per-cp hosts override [puppet] - 10https://gerrit.wikimedia.org/r/864797 [17:49:18] (03PS2) 10Ssingh: hiera: remove obsolete per-cp hosts override (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/864797 [17:49:40] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:52:03] 10SRE: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10LSobanski) 05Open→03Resolved a:03LSobanski The linked GitHub task was resolved without resolution. Resolving this one as well, please reopen if this is still needed. [17:52:54] 10SRE, 10serviceops, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10LSobanski) [17:53:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:53:14] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:53:46] (03CR) 10Ssingh: [C: 03+2] hiera: remove obsolete per-cp hosts override (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/864797 (owner: 10Ssingh) [17:54:44] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:54:48] 10SRE, 10serviceops, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) 05Open→03Resolved a:03akosiaris ` ssh kubernetes1007.eqiad.wmnet dpkg -l docker.io |grep docker.io ii... [17:54:56] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:57:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [17:58:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [18:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1800). [18:01:53] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:02:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:51] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:04:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [18:04:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [18:13:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [18:13:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance [18:13:38] (03CR) 10CDanis: [C: 03+2] Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T324466) (owner: 10CDanis) [18:19:05] (03CR) 10Herron: [C: 03+1] decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [18:19:41] (03CR) 10Herron: [C: 03+1] hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [18:20:52] (03PS1) 10CDanis: TIL that systemd doesn't allow mid-line comments [puppet] - 10https://gerrit.wikimedia.org/r/864827 [18:21:45] (03CR) 10CDanis: [C: 03+2] TIL that systemd doesn't allow mid-line comments [puppet] - 10https://gerrit.wikimedia.org/r/864827 (owner: 10CDanis) [18:21:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42251 and previous config saved to /var/cache/conftool/dbconfig/20221205-182155-ladsgroup.json [18:22:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:05] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:24:43] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:27:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:36:06] 10SRE, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) 05Resolved→03Open [18:36:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [18:37:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42252 and previous config saved to /var/cache/conftool/dbconfig/20221205-183700-ladsgroup.json [18:37:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [18:37:06] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1033.eqiad.wmnet with OS bullseye [18:37:09] 10SRE, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) a:05jbond→03None [18:37:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P42253 and previous config saved to /var/cache/conftool/dbconfig/20221205-183712-ladsgroup.json [18:37:13] (03PS1) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) [18:37:15] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:38:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1197.eqiad.wmnet with reason: Maintenance [18:38:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1197.eqiad.wmnet with reason: Maintenance [18:38:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P42254 and previous config saved to /var/cache/conftool/dbconfig/20221205-183851-ladsgroup.json [18:38:55] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:40:39] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) [18:41:35] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) @Fuzzy Hi, access requests are handled by a different person each week, that's why you see me reopen and unassign/tag it. it... [18:41:58] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) a:03jhathaway [18:45:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:46:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P42255 and previous config saved to /var/cache/conftool/dbconfig/20221205-184643-ladsgroup.json [18:46:48] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [18:47:34] (03PS2) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) [18:47:42] (03PS1) 10Ebernhardson: prom: Add elasticsearch cluster name to exported latency metrics [puppet] - 10https://gerrit.wikimedia.org/r/864829 [18:49:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:49:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [18:49:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:49:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance [18:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42256 and previous config saved to /var/cache/conftool/dbconfig/20221205-184944-ladsgroup.json [18:49:48] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T323907)', diff saved to https://phabricator.wikimedia.org/P42257 and previous config saved to /var/cache/conftool/dbconfig/20221205-184950-ladsgroup.json [18:51:42] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [18:52:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42258 and previous config saved to /var/cache/conftool/dbconfig/20221205-185205-ladsgroup.json [18:52:17] (03CR) 10Andrea Denisse: [C: 03+1] hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [18:54:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P42259 and previous config saved to /var/cache/conftool/dbconfig/20221205-185429-ladsgroup.json [18:59:56] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) 05Open→03Resolved a:03Legoktm [19:01:43] (03CR) 10Jberkel: "Perhaps the change should be made in bullseye|buster-sssd/Dockerfile.template instead?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [19:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P42260 and previous config saved to /var/cache/conftool/dbconfig/20221205-190150-ladsgroup.json [19:03:43] (03PS2) 10Bking: prom: Add elasticsearch cluster name to exported latency metrics [puppet] - 10https://gerrit.wikimedia.org/r/864829 (https://phabricator.wikimedia.org/T324500) (owner: 10Ebernhardson) [19:05:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/864829 (https://phabricator.wikimedia.org/T324500) (owner: 10Ebernhardson) [19:05:37] (03PS1) 10Effie Mouzeli: Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) [19:06:57] (03CR) 10David Caro: [C: 03+2] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [19:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42261 and previous config saved to /var/cache/conftool/dbconfig/20221205-190710-ladsgroup.json [19:09:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P42262 and previous config saved to /var/cache/conftool/dbconfig/20221205-190935-ladsgroup.json [19:10:14] (03Merged) 10jenkins-bot: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [19:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P42263 and previous config saved to /var/cache/conftool/dbconfig/20221205-191656-ladsgroup.json [19:19:01] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10Dzahn) Hi @Muhammad_Yasser_Jazirahly_WMDE , welcome! Could you get a manager at WMDE to approve this here on the ticket? This will be picked up soon by our rotating clinic duty. note to cl... [19:19:45] (03CR) 10Eevans: [C: 03+1] Promote Cassandra 3.11.13 to '3.x' (aka stable) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [19:20:17] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) [19:20:25] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) a:03Dzahn [19:22:55] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) p:05Triage→03Medium [19:24:19] !log phab1001, previous long time phabricator host, is about to be shut down, made a final copy of /srv/deployment, /root, /home, /etc and synced it to phab1004 - T323418 [19:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:22] T323418: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 [19:24:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P42264 and previous config saved to /var/cache/conftool/dbconfig/20221205-192442-ladsgroup.json [19:30:37] (03PS5) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) [19:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P42265 and previous config saved to /var/cache/conftool/dbconfig/20221205-193203-ladsgroup.json [19:32:07] T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827 [19:32:33] (03CR) 10Dzahn: "this removes the production role and phab1001 will be removed from firewall on remaining hosts, also deletes the entire hosts Hiera entry." [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:32:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323907)', diff saved to https://phabricator.wikimedia.org/P42266 and previous config saved to /var/cache/conftool/dbconfig/20221205-193250-ladsgroup.json [19:32:53] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42267 and previous config saved to /var/cache/conftool/dbconfig/20221205-193448-ladsgroup.json [19:35:39] (03PS1) 10Ahmon Dancy: logspam: Filter out some very common errors by default [puppet] - 10https://gerrit.wikimedia.org/r/864834 [19:39:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [19:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P42268 and previous config saved to /var/cache/conftool/dbconfig/20221205-193949-ladsgroup.json [19:39:53] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:40:41] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:41:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:44:05] (03CR) 10Andrew Bogott: [C: 03+1] "LGTM although this now needs a manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:46:01] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:47:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42269 and previous config saved to /var/cache/conftool/dbconfig/20221205-194757-ladsgroup.json [19:48:52] (03PS6) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) [19:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42270 and previous config saved to /var/cache/conftool/dbconfig/20221205-194955-ladsgroup.json [19:49:55] (03CR) 10Dzahn: [C: 03+2] phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:50:01] (03PS6) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) [19:50:32] (03CR) 10Brennen Bearnes: [C: 03+1] "Tested on mwlog1002; works great." [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy) [19:53:30] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/816046/38583/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:56:53] (03CR) 10Andrew Bogott: [C: 03+2] rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:57:54] !log phab1004 (prod) - removing phab1001 from firewall rules, rsync config | phab1001 (formerly prod) - removing prod role T323418 T280597 [19:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:59] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [19:57:59] T323418: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 [19:58:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2127.codfw.wmnet with reason: Maintenance [19:58:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2127.codfw.wmnet with reason: Maintenance [19:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T312984)', diff saved to https://phabricator.wikimedia.org/P42271 and previous config saved to /var/cache/conftool/dbconfig/20221205-195842-ladsgroup.json [19:58:46] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:00:24] (03PS1) 10Dzahn: site/phabricator: fix insetup role name which is now team specific [puppet] - 10https://gerrit.wikimedia.org/r/864840 (https://phabricator.wikimedia.org/T323418) [20:01:11] (03CR) 10Dzahn: [C: 03+2] site/phabricator: fix insetup role name which is now team specific [puppet] - 10https://gerrit.wikimedia.org/r/864840 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [20:02:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on phab1001.eqiad.wmnet with reason: decom, replaced by phab1004 [20:02:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on phab1001.eqiad.wmnet with reason: decom, replaced by phab1004 [20:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42272 and previous config saved to /var/cache/conftool/dbconfig/20221205-200303-ladsgroup.json [20:05:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42273 and previous config saved to /var/cache/conftool/dbconfig/20221205-200501-ladsgroup.json [20:05:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [20:05:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [20:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42274 and previous config saved to /var/cache/conftool/dbconfig/20221205-200530-ladsgroup.json [20:05:34] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:07:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [20:07:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance [20:07:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [20:07:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42275 and previous config saved to /var/cache/conftool/dbconfig/20221205-200743-ladsgroup.json [20:07:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance [20:07:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T322618)', diff saved to https://phabricator.wikimedia.org/P42276 and previous config saved to /var/cache/conftool/dbconfig/20221205-200755-ladsgroup.json [20:07:58] (03PS9) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [20:08:19] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (0311 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [20:09:52] (03CR) 10Ottomata: "Thanks ben! Adding serviceops folks for review now." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [20:10:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T322618)', diff saved to https://phabricator.wikimedia.org/P42277 and previous config saved to /var/cache/conftool/dbconfig/20221205-201021-ladsgroup.json [20:12:11] (03PS3) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) [20:12:59] (03CR) 10CI reject: [V: 04-1] site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [20:16:03] (03CR) 10Dzahn: [C: 03+2] "looks reasonable and already tested" [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy) [20:18:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323907)', diff saved to https://phabricator.wikimedia.org/P42278 and previous config saved to /var/cache/conftool/dbconfig/20221205-201810-ladsgroup.json [20:18:11] (03CR) 10Dzahn: [C: 03+2] "deployed on mwlog1002. /usr/local/bin/logspam still works. running puppet on mwlog*" [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy) [20:18:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [20:18:15] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [20:18:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance [20:18:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T323907)', diff saved to https://phabricator.wikimedia.org/P42279 and previous config saved to /var/cache/conftool/dbconfig/20221205-201831-ladsgroup.json [20:18:36] (03CR) 10Bking: [C: 03+2] prom: Add elasticsearch cluster name to exported latency metrics [puppet] - 10https://gerrit.wikimedia.org/r/864829 (https://phabricator.wikimedia.org/T324500) (owner: 10Ebernhardson) [20:19:32] (03PS4) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) [20:19:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42280 and previous config saved to /var/cache/conftool/dbconfig/20221205-202008-ladsgroup.json [20:20:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [20:20:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [20:20:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42281 and previous config saved to /var/cache/conftool/dbconfig/20221205-202029-ladsgroup.json [20:21:56] (03PS2) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714) [20:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42282 and previous config saved to /var/cache/conftool/dbconfig/20221205-202250-ladsgroup.json [20:25:25] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts phab1001.eqiad.wmnet [20:25:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42283 and previous config saved to /var/cache/conftool/dbconfig/20221205-202528-ladsgroup.json [20:28:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T312984)', diff saved to https://phabricator.wikimedia.org/P42284 and previous config saved to /var/cache/conftool/dbconfig/20221205-202846-ladsgroup.json [20:28:50] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:59] (03PS1) 10Bartosz Dziewoński: Adjust to changes to redlink behavior from parsoid [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) [20:37:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42285 and previous config saved to /var/cache/conftool/dbconfig/20221205-203756-ladsgroup.json [20:38:23] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:40:20] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42286 and previous config saved to /var/cache/conftool/dbconfig/20221205-204034-ladsgroup.json [20:43:00] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P42287 and previous config saved to /var/cache/conftool/dbconfig/20221205-204352-ladsgroup.json [20:44:45] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [20:47:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002" [20:47:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:06] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab1001.eqiad.wmnet [20:47:18] (03CR) 10BryanDavis: [C: 04-1] Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [20:50:19] (03PS5) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) [20:51:02] (03CR) 10Dzahn: [C: 03+2] site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [20:51:24] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [20:53:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:53:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42288 and previous config saved to /var/cache/conftool/dbconfig/20221205-205303-ladsgroup.json [20:53:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [20:53:07] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:53:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [20:53:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T322618)', diff saved to https://phabricator.wikimedia.org/P42289 and previous config saved to /var/cache/conftool/dbconfig/20221205-205324-ladsgroup.json [20:55:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T322618)', diff saved to https://phabricator.wikimedia.org/P42290 and previous config saved to /var/cache/conftool/dbconfig/20221205-205537-ladsgroup.json [20:55:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T322618)', diff saved to https://phabricator.wikimedia.org/P42291 and previous config saved to /var/cache/conftool/dbconfig/20221205-205547-ladsgroup.json [20:55:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:56:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T322618)', diff saved to https://phabricator.wikimedia.org/P42292 and previous config saved to /var/cache/conftool/dbconfig/20221205-205610-ladsgroup.json [20:56:42] jouncebot: next [20:56:42] In 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T2100) [20:56:49] i will be 10 minutes late. sorry [20:57:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T322618)', diff saved to https://phabricator.wikimedia.org/P42293 and previous config saved to /var/cache/conftool/dbconfig/20221205-205735-ladsgroup.json [20:58:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P42294 and previous config saved to /var/cache/conftool/dbconfig/20221205-205859-ladsgroup.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T2100). Please do the needful. [21:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:13] I can deploy in about 5m [21:02:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323907)', diff saved to https://phabricator.wikimedia.org/P42295 and previous config saved to /var/cache/conftool/dbconfig/20221205-210220-ladsgroup.json [21:02:24] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [21:03:23] i'm here now [21:05:19] ack, here too [21:05:38] starting with 856552 [21:05:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:06:51] (03Merged) 10jenkins-bot: Use new DiscussionTools heading markup on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:07:07] !log samtar@deploy1002 Started scap: Backport for [[gerrit:856552|Use new DiscussionTools heading markup on group0 wikis (T314714)]] [21:07:11] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:07:59] (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) (owner: 10Bartosz Dziewoński) [21:08:50] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:856552|Use new DiscussionTools heading markup on group0 wikis (T314714)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42296 and previous config saved to /var/cache/conftool/dbconfig/20221205-210855-ladsgroup.json [21:09:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [21:09:15] MatmaRex: live on mwdebug2002, can you test? [21:09:29] looking [21:10:10] (03PS3) 10Vlad.shapik: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T47212) [21:10:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42297 and previous config saved to /var/cache/conftool/dbconfig/20221205-211045-ladsgroup.json [21:11:01] TheresNoTime: looks good [21:11:08] syncin' [21:11:30] (03PS4) 10Vlad.shapik: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T47212) [21:12:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42298 and previous config saved to /var/cache/conftool/dbconfig/20221205-211242-ladsgroup.json [21:14:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T312984)', diff saved to https://phabricator.wikimedia.org/P42299 and previous config saved to /var/cache/conftool/dbconfig/20221205-211405-ladsgroup.json [21:14:09] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:14:27] (03PS6) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [21:17:03] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:856552|Use new DiscussionTools heading markup on group0 wikis (T314714)]] (duration: 09m 55s) [21:17:06] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:17:07] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) 05Resolved→03Open It's back :( ` cr2-esams> show system alarms 1 alarms currently active Alarm time Class Description 2022-12-05 18:15:58 UTC Minor FPC 0 M... [21:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42300 and previous config saved to /var/cache/conftool/dbconfig/20221205-211727-ladsgroup.json [21:17:31] MatmaRex: that should be live now, just waiting on 864724 to merge [21:17:42] cool, thanks [21:19:32] (03CR) 10Ahmon Dancy: "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy) [21:21:55] (03Merged) 10jenkins-bot: Adjust to changes to redlink behavior from parsoid [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) (owner: 10Bartosz Dziewoński) [21:22:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) (owner: 10Bartosz Dziewoński) [21:22:26] !log samtar@deploy1002 Started scap: Backport for [[gerrit:864724|Adjust to changes to redlink behavior from parsoid (T324352)]] [21:22:29] T324352: Red links marked as uneditable in visual editor - https://phabricator.wikimedia.org/T324352 [21:23:33] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED [21:23:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42301 and previous config saved to /var/cache/conftool/dbconfig/20221205-212402-ladsgroup.json [21:24:07] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:864724|Adjust to changes to redlink behavior from parsoid (T324352)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:24:11] MatmaRex: live on mwdebug [21:25:33] TheresNoTime: thanks, looks good as well [21:25:39] syncin' [21:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42302 and previous config saved to /var/cache/conftool/dbconfig/20221205-212552-ladsgroup.json [21:25:57] (03PS7) 10Dzahn: O:phabricator: move host based settings to role hiera per DC [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [21:26:55] TheresNoTime: while i have your attention, can you check on the status of this maintenance script run for me? https://phabricator.wikimedia.org/T315510#8392683 [21:27:06] looking.. [21:27:17] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10WMDE-leszek) I approve this request, thank you. [21:27:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42303 and previous config saved to /var/cache/conftool/dbconfig/20221205-212748-ladsgroup.json [21:27:51] (03PS8) 10Dzahn: O:phabricator: move host based settings to role hiera per DC [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [21:29:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:30:52] MatmaRex: I'm unsure how to check that, I see no tmux sessions running [21:30:58] (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [21:31:28] (I would say that's more a "me problem" than an indication that one isn't running though, cc taavi) [21:31:32] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:864724|Adjust to changes to redlink behavior from parsoid (T324352)]] (duration: 09m 05s) [21:31:35] T324352: Red links marked as uneditable in visual editor - https://phabricator.wikimedia.org/T324352 [21:31:44] (03PS9) 10Dzahn: O:phabricator: move host based settings to role hiere [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [21:31:47] hello [21:32:00] what do you need from me? [21:32:19] taavi: message from Matma/Rex in scrollback regarding checking progress of https://phabricator.wikimedia.org/T315510#8392683 [21:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42304 and previous config saved to /var/cache/conftool/dbconfig/20221205-213233-ladsgroup.json [21:32:36] I'm unsure how to do that, I see no running tmux sessions [21:32:44] (03Abandoned) 10Effie Mouzeli: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [21:32:48] MatmaRex: it's halfway through incubatorwiki [21:32:58] MatmaRex: just to confirm, did you see https://phabricator.wikimedia.org/T315510#8427310? [21:33:16] thanks [21:33:21] yes [21:33:54] !log close UTC late backport window [21:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:55] (taavi: for reference, how should I have checked that?) [21:35:38] TheresNoTime: `w` on mwmaint1002? [21:36:59] d'oh :) [21:37:09] (ty) [21:38:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42305 and previous config saved to /var/cache/conftool/dbconfig/20221205-213908-ladsgroup.json [21:39:19] (03PS1) 10Dzahn: phabricator: set enable_vcs to false in main profile [puppet] - 10https://gerrit.wikimedia.org/r/864852 [21:40:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T322618)', diff saved to https://phabricator.wikimedia.org/P42306 and previous config saved to /var/cache/conftool/dbconfig/20221205-214058-ladsgroup.json [21:41:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [21:41:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:41:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [21:41:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T322618)', diff saved to https://phabricator.wikimedia.org/P42307 and previous config saved to /var/cache/conftool/dbconfig/20221205-214120-ladsgroup.json [21:42:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:42:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:42:54] (03PS30) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [21:42:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T322618)', diff saved to https://phabricator.wikimedia.org/P42308 and previous config saved to /var/cache/conftool/dbconfig/20221205-214255-ladsgroup.json [21:42:56] (03PS1) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache: add gutter pools for /*/mw-wan keys [puppet] - 10https://gerrit.wikimedia.org/r/864853 (https://phabricator.wikimedia.org/T258779) [21:42:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [21:42:58] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "no change in compiler https://puppet-compiler.wmflabs.org/output/824412/38584/" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [21:43:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance [21:43:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:43:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [21:43:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1005.mgmt.eqiad.wmnet with reboot policy FORCED [21:43:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T322618)', diff saved to https://phabricator.wikimedia.org/P42309 and previous config saved to /var/cache/conftool/dbconfig/20221205-214332-ladsgroup.json [21:43:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T322618)', diff saved to https://phabricator.wikimedia.org/P42310 and previous config saved to /var/cache/conftool/dbconfig/20221205-214333-ladsgroup.json [21:43:53] (03CR) 10Dzahn: [V: 03+1 C: 03+2] O:phabricator: move host based settings to role hiere [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond) [21:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T322618)', diff saved to https://phabricator.wikimedia.org/P42311 and previous config saved to /var/cache/conftool/dbconfig/20221205-214558-ladsgroup.json [21:47:18] (03CR) 10Dzahn: [C: 03+1] "@marostegui phab1001 has been decom'ed and is shut down permanently. the grants for this IP can be revoked" [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [21:47:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323907)', diff saved to https://phabricator.wikimedia.org/P42312 and previous config saved to /var/cache/conftool/dbconfig/20221205-214740-ladsgroup.json [21:47:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [21:47:44] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [21:47:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [21:48:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T323907)', diff saved to https://phabricator.wikimedia.org/P42313 and previous config saved to /var/cache/conftool/dbconfig/20221205-214801-ladsgroup.json [21:50:37] 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) [21:51:09] 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) [21:51:51] 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) a:05Dzahn→03Jclark-ctr https://netbox.wikimedia.org/dcim/devices/1557/ has been permanently shut down [21:54:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42314 and previous config saved to /var/cache/conftool/dbconfig/20221205-215415-ladsgroup.json [21:54:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:54:19] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [21:54:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:54:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42315 and previous config saved to /var/cache/conftool/dbconfig/20221205-215436-ladsgroup.json [21:55:14] !log deleting special DNS entries for "phab10010-vcs.eqiad.wmnet", IPv4 and IPv6 (Role: VIP), from netbox - T280597 [21:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:17] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [21:55:50] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:58:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42316 and previous config saved to /var/cache/conftool/dbconfig/20221205-215839-ladsgroup.json [21:58:56] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: deleted phab1001-vcs.eqiad.wmnet IPs - dzahn@cumin2002" [21:59:32] !log deleting special DNS entries for "phab10010-vcs.eqiad.wmnet", IPv4 and IPv6 (Role: VIP), from netbox and syncing netbox data - T296022 [21:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:35] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [21:59:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: deleted phab1001-vcs.eqiad.wmnet IPs - dzahn@cumin2002" [21:59:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:00:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:00:05] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T2200). [22:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42317 and previous config saved to /var/cache/conftool/dbconfig/20221205-220105-ladsgroup.json [22:05:24] (03Abandoned) 10Dzahn: phabricator: move some more settings from host file to common [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:05:28] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/864853/38588/" [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [22:06:09] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/output/864853/38588/" [puppet] - 10https://gerrit.wikimedia.org/r/864853 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [22:08:46] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:13:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42319 and previous config saved to /var/cache/conftool/dbconfig/20221205-221346-ladsgroup.json [22:16:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42320 and previous config saved to /var/cache/conftool/dbconfig/20221205-221612-ladsgroup.json [22:20:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:20:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1005.mgmt.eqiad.wmnet with reboot policy FORCED [22:21:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1001.mgmt.eqiad.wmnet with reboot policy FORCED [22:24:50] !log removing 1 file for legal compliance [22:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T322618)', diff saved to https://phabricator.wikimedia.org/P42321 and previous config saved to /var/cache/conftool/dbconfig/20221205-222852-ladsgroup.json [22:28:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [22:28:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:28:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [22:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42322 and previous config saved to /var/cache/conftool/dbconfig/20221205-222903-ladsgroup.json [22:29:38] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:30:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42323 and previous config saved to /var/cache/conftool/dbconfig/20221205-223015-ladsgroup.json [22:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323907)', diff saved to https://phabricator.wikimedia.org/P42324 and previous config saved to /var/cache/conftool/dbconfig/20221205-223049-ladsgroup.json [22:30:53] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [22:31:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T322618)', diff saved to https://phabricator.wikimedia.org/P42325 and previous config saved to /var/cache/conftool/dbconfig/20221205-223119-ladsgroup.json [22:31:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [22:31:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [22:31:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42326 and previous config saved to /var/cache/conftool/dbconfig/20221205-223140-ladsgroup.json [22:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42328 and previous config saved to /var/cache/conftool/dbconfig/20221205-223406-ladsgroup.json [22:34:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:39:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42329 and previous config saved to /var/cache/conftool/dbconfig/20221205-223912-ladsgroup.json [22:39:16] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [22:40:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1001.mgmt.eqiad.wmnet with reboot policy FORCED [22:45:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42330 and previous config saved to /var/cache/conftool/dbconfig/20221205-224522-ladsgroup.json [22:45:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42331 and previous config saved to /var/cache/conftool/dbconfig/20221205-224555-ladsgroup.json [22:49:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42332 and previous config saved to /var/cache/conftool/dbconfig/20221205-224913-ladsgroup.json [22:54:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42333 and previous config saved to /var/cache/conftool/dbconfig/20221205-225419-ladsgroup.json [22:56:36] 10SRE, 10LDAP-Access-Requests, 10User-vaughnwalters, 10User-zeljkofilipin: Request for wmf group access for user: vwalters - https://phabricator.wikimedia.org/T324515 (10Jrbranaa) Approved [23:00:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42334 and previous config saved to /var/cache/conftool/dbconfig/20221205-230028-ladsgroup.json [23:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42335 and previous config saved to /var/cache/conftool/dbconfig/20221205-230102-ladsgroup.json [23:04:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42336 and previous config saved to /var/cache/conftool/dbconfig/20221205-230419-ladsgroup.json [23:09:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42337 and previous config saved to /var/cache/conftool/dbconfig/20221205-230925-ladsgroup.json [23:15:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42338 and previous config saved to /var/cache/conftool/dbconfig/20221205-231535-ladsgroup.json [23:15:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:15:39] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:15:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42339 and previous config saved to /var/cache/conftool/dbconfig/20221205-231556-ladsgroup.json [23:16:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323907)', diff saved to https://phabricator.wikimedia.org/P42340 and previous config saved to /var/cache/conftool/dbconfig/20221205-231608-ladsgroup.json [23:16:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [23:16:12] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:16:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [23:18:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42341 and previous config saved to /var/cache/conftool/dbconfig/20221205-231809-ladsgroup.json [23:19:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42342 and previous config saved to /var/cache/conftool/dbconfig/20221205-231926-ladsgroup.json [23:19:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [23:19:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [23:19:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T322618)', diff saved to https://phabricator.wikimedia.org/P42343 and previous config saved to /var/cache/conftool/dbconfig/20221205-231948-ladsgroup.json [23:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T322618)', diff saved to https://phabricator.wikimedia.org/P42344 and previous config saved to /var/cache/conftool/dbconfig/20221205-232113-ladsgroup.json [23:21:17] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:24:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42345 and previous config saved to /var/cache/conftool/dbconfig/20221205-232432-ladsgroup.json [23:24:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [23:24:36] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:24:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [23:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T323907)', diff saved to https://phabricator.wikimedia.org/P42346 and previous config saved to /var/cache/conftool/dbconfig/20221205-232453-ladsgroup.json [23:33:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42347 and previous config saved to /var/cache/conftool/dbconfig/20221205-233316-ladsgroup.json [23:36:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42348 and previous config saved to /var/cache/conftool/dbconfig/20221205-233620-ladsgroup.json [23:41:54] !log removing 5 files for legal compliance [23:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323907)', diff saved to https://phabricator.wikimedia.org/P42349 and previous config saved to /var/cache/conftool/dbconfig/20221205-234425-ladsgroup.json [23:44:29] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:44:59] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@1d3ba41]: import_cirrus: Update doc cleaning to match cirrus updates [23:47:30] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@1d3ba41]: import_cirrus: Update doc cleaning to match cirrus updates (duration: 02m 30s) [23:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42350 and previous config saved to /var/cache/conftool/dbconfig/20221205-234822-ladsgroup.json [23:51:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42351 and previous config saved to /var/cache/conftool/dbconfig/20221205-235126-ladsgroup.json [23:52:33] (03PS3) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) [23:55:42] (03PS4) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) [23:56:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:57:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:57:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:57:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:57:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T323907)', diff saved to https://phabricator.wikimedia.org/P42352 and previous config saved to /var/cache/conftool/dbconfig/20221205-235724-ladsgroup.json [23:57:27] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:57:45] !log removing 2 files for legal compliance [23:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42353 and previous config saved to /var/cache/conftool/dbconfig/20221205-235932-ladsgroup.json