[00:47:04] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:16:22] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:16:38] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49122 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:16] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.730 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:25:45] <wikibugs>	 (03PS2) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816)
[02:25:47] <wikibugs>	 (03PS1) 10Andrew Bogott: oslo_messaging_rabbit: kombu_reconnect_delay=0.1 [puppet] - 10https://gerrit.wikimedia.org/r/864321 (https://phabricator.wikimedia.org/T318816)
[02:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[03:54:06] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:56:08] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:39:38] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[04:41:40] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[05:13:44] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 242, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:18] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:19:18] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:31:24] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:39:28] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:02:53] <wikibugs>	 (03PS1) 10AndyRussG: CentralNotice: Add wmflabs to banner preview CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055)
[06:12:03] <wikibugs>	 (03PS1) 10Marostegui: db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864328 (https://phabricator.wikimedia.org/T322988)
[06:14:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864328 (https://phabricator.wikimedia.org/T322988) (owner: 10Marostegui)
[06:14:57] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) Host being repooled automatically.
[06:16:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P42217 and previous config saved to /var/cache/conftool/dbconfig/20221205-061616-marostegui.json
[06:16:18] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:16:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 1%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42218 and previous config saved to /var/cache/conftool/dbconfig/20221205-061625-root.json
[06:17:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: After schema change', diff saved to https://phabricator.wikimedia.org/P42219 and previous config saved to /var/cache/conftool/dbconfig/20221205-061735-root.json
[06:19:02] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[06:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:27:36] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1206 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/864329
[06:28:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1206 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/864329 (owner: 10Marostegui)
[06:30:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 to dbctl (depooled)', diff saved to https://phabricator.wikimedia.org/P42220 and previous config saved to /var/cache/conftool/dbconfig/20221205-063020-marostegui.json
[06:31:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 5%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42221 and previous config saved to /var/cache/conftool/dbconfig/20221205-063130-root.json
[06:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[06:32:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P42222 and previous config saved to /var/cache/conftool/dbconfig/20221205-063240-root.json
[06:35:57] <wikibugs>	 (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864330
[06:37:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864330 (owner: 10Marostegui)
[06:37:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with minimal weight', diff saved to https://phabricator.wikimedia.org/P42223 and previous config saved to /var/cache/conftool/dbconfig/20221205-063743-marostegui.json
[06:46:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42224 and previous config saved to /var/cache/conftool/dbconfig/20221205-064635-root.json
[06:47:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P42225 and previous config saved to /var/cache/conftool/dbconfig/20221205-064745-root.json
[06:51:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with minimal weight', diff saved to https://phabricator.wikimedia.org/P42226 and previous config saved to /var/cache/conftool/dbconfig/20221205-065151-marostegui.json
[07:01:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42227 and previous config saved to /var/cache/conftool/dbconfig/20221205-070140-root.json
[07:02:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P42228 and previous config saved to /var/cache/conftool/dbconfig/20221205-070250-root.json
[07:16:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42229 and previous config saved to /var/cache/conftool/dbconfig/20221205-071645-root.json
[07:17:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: allow logging ECS to a local rsyslog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864547 (https://phabricator.wikimedia.org/T265876)
[07:17:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P42230 and previous config saved to /var/cache/conftool/dbconfig/20221205-071754-root.json
[07:22:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876)
[07:23:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto)
[07:25:12] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) The two attached patches implement proposal #3  Now we just need to create the appropriate topic, named `mediawiki.httpd.accesslog` on both ka...
[07:31:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42231 and previous config saved to /var/cache/conftool/dbconfig/20221205-073150-root.json
[07:33:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P42232 and previous config saved to /var/cache/conftool/dbconfig/20221205-073259-root.json
[07:38:53] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:39:43] <wikibugs>	 (03PS1) 10Kosta Harlan: Fix ExpensiveUserImpact input validation [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864666 (https://phabricator.wikimedia.org/T324312)
[07:39:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[07:42:47] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:44:00] <wikibugs>	 (03PS31) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[07:46:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: After HW issues', diff saved to https://phabricator.wikimedia.org/P42233 and previous config saved to /var/cache/conftool/dbconfig/20221205-074655-root.json
[07:48:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P42234 and previous config saved to /var/cache/conftool/dbconfig/20221205-074804-root.json
[07:56:53] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1003.eqiad.wmnet,service=thanos-web
[07:57:00] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web
[07:58:11] <wikibugs>	 (03PS1) 10Abijeet Patro: Deprecate PersonalUrls hook [extensions/LiquidThreads] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/864671 (https://phabricator.wikimedia.org/T310017)
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T0800).
[08:00:04] <jouncebot>	 kart_ and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:38] <kostajh>	 hi
[08:00:40] * kart_ is here
[08:00:53] <kostajh>	 I'll be back in ~10 minutes, so kart_ you should go ahead with your patches
[08:01:21] <kart_>	 kostajh: Sure.
[08:01:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry)
[08:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry)
[08:02:48] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:862412|testwiki: Enable Section Translation for 15 Wikipedias (T323825 T319177)]]
[08:02:54] <stashbot>	 T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177
[08:02:54] <stashbot>	 T323825: Enable Content and Section translation on 9 Wikipedias - https://phabricator.wikimedia.org/T323825
[08:05:06] <dcausse>	 !log restarting blazegraph on wdqs1004 (stuck with 2000+ threads, T242453)
[08:05:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:10] <stashbot>	 T242453: Detect and alert and/or remediate Blazegraph deadlocks - https://phabricator.wikimedia.org/T242453
[08:06:53] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.094 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:07:01] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[08:08:54] <wikibugs>	 (03CR) 10Abijeet Patro: "Needed for: Iecec234232f2a17e528625b2e21155fc66b5f30b" [extensions/LiquidThreads] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/864671 (https://phabricator.wikimedia.org/T310017) (owner: 10Abijeet Patro)
[08:10:12] <kostajh>	 (back)
[08:11:39] <kart_>	 Scap seems slow?
[08:11:52] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:862412|testwiki: Enable Section Translation for 15 Wikipedias (T323825 T319177)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[08:11:56] <stashbot>	 T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177
[08:11:56] <stashbot>	 T323825: Enable Content and Section translation on 9 Wikipedias - https://phabricator.wikimedia.org/T323825
[08:11:57] <claime>	 Probably building the mw docker image
[08:12:09] <wikibugs>	 (03Abandoned) 10Abijeet Patro: Deprecate PersonalUrls hook [extensions/LiquidThreads] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/864671 (https://phabricator.wikimedia.org/T310017) (owner: 10Abijeet Patro)
[08:12:50] <claime>	 Or had it already passed taht step kart_ ?
[08:13:49] <kart_>	 yeah. docker image build seems slow. Now, deploying..
[08:14:14] <claime>	 kart_: it took ~3 minutes
[08:14:22] <claime>	 You can check /home/kartik/scap-image-build-and-push-log
[08:14:50] <kostajh>	 out of curiosity, is that docker image used somewhere?
[08:15:08] <claime>	 kostajh: X-mw-debug set to k8s-experimental
[08:15:15] <kostajh>	 ack
[08:15:22] <claime>	 It sends you to a k8s deployment of mediawiki
[08:16:19] <claime>	 We're currently working on all that mediawiki on kubernetes jazz :)
[08:17:32] <kart_>	 Nice!
[08:19:42] <wikibugs>	 (03PS2) 10KartikMistry: Enable Section Translation on 8 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176)
[08:20:14] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:862412|testwiki: Enable Section Translation for 15 Wikipedias (T323825 T319177)]] (duration: 17m 25s)
[08:20:18] <stashbot>	 T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177
[08:20:19] <stashbot>	 T323825: Enable Content and Section translation on 9 Wikipedias - https://phabricator.wikimedia.org/T323825
[08:20:49] <kart_>	 On second patch..
[08:21:51] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-2003.codfw.wmnet,service=thanos-web
[08:21:57] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-2002.codfw.wmnet,service=thanos-web
[08:22:39] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-web,name=eqiad
[08:23:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42236 and previous config saved to /var/cache/conftool/dbconfig/20221205-082320-marostegui.json
[08:24:00] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfw.wmnet,service=thanos-web
[08:24:05] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web
[08:24:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176) (owner: 10KartikMistry)
[08:25:31] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Section Translation on 8 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176) (owner: 10KartikMistry)
[08:25:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] postgresql::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/863286 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[08:25:44] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:863097|Enable Section Translation on 8 Wikipedias (T319176)]]
[08:25:47] <stashbot>	 T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176
[08:27:29] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:863097|Enable Section Translation on 8 Wikipedias (T319176)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[08:29:57] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-web,name=eqiad
[08:30:40] <wikibugs>	 (03PS2) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255
[08:33:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: remove thanos-sso [dns] - 10https://gerrit.wikimedia.org/r/862939 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[08:35:41] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:863097|Enable Section Translation on 8 Wikipedias (T319176)]] (duration: 09m 57s)
[08:35:45] <stashbot>	 T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319176
[08:36:30] <kart_>	 kostajh: I'm done. It took longer than I expected.
[08:36:51] <kostajh>	 kart_: no worries. I'll get started with mine
[08:37:14] <wikibugs>	 (03PS9) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686)
[08:37:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan)
[08:38:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff)
[08:38:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4788
[08:38:36] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) a:03Clement_Goubert
[08:38:40] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan)
[08:38:47] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:859991|GrowthExperiments: End imagerecommendation experiment (T323686)]]
[08:38:50] <stashbot>	 T323686: End imagerecommendation experiment - https://phabricator.wikimedia.org/T323686
[08:39:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913)
[08:40:31] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:859991|GrowthExperiments: End imagerecommendation experiment (T323686)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[08:42:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38576/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[08:42:27] <kostajh>	 syncing the config patch
[08:43:15] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4788
[08:44:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "See inline, LGTM overall though" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[08:46:47] <wikibugs>	 (03PS2) 10Muehlenhoff: presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863300 (https://phabricator.wikimedia.org/T308013)
[08:47:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38623
[08:48:14] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:859991|GrowthExperiments: End imagerecommendation experiment (T323686)]] (duration: 09m 26s)
[08:48:16] <stashbot>	 T323686: End imagerecommendation experiment - https://phabricator.wikimedia.org/T323686
[08:48:21] <kostajh>	 on to the wmf.12 patch
[08:48:32] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38623
[08:48:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864666 (https://phabricator.wikimedia.org/T324312) (owner: 10Kosta Harlan)
[08:49:20] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 55818
[08:49:23] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:50:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 55818
[08:51:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136907
[08:51:25] <wikibugs>	 (03Abandoned) 10WMDE-Fisch: Clean up suggested values setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740765 (owner: 10Awight)
[08:52:02] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136907
[08:52:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42237 and previous config saved to /var/cache/conftool/dbconfig/20221205-085235-marostegui.json
[08:53:04] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 52580
[08:53:19] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[08:54:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863300 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:54:26] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52580
[08:55:07] <wikibugs>	 (03PS2) 10Muehlenhoff: envoy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863303 (https://phabricator.wikimedia.org/T308013)
[08:55:23] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 141731
[08:56:53] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/864662 (https://phabricator.wikimedia.org/T324437) (owner: 10Clément Goubert)
[08:58:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] envoy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863303 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:58:16] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141731
[08:58:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 58308
[08:59:24] <wikibugs>	 (03PS6) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996
[09:00:20] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 58308
[09:00:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 59689
[09:00:50] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 59689
[09:02:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42238 and previous config saved to /var/cache/conftool/dbconfig/20221205-090214-marostegui.json
[09:02:16] <kostajh>	 still going with the backport
[09:04:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff)
[09:04:48] <wikibugs>	 (03Merged) 10jenkins-bot: Fix ExpensiveUserImpact input validation [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864666 (https://phabricator.wikimedia.org/T324312) (owner: 10Kosta Harlan)
[09:05:04] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864666|Fix ExpensiveUserImpact input validation (T324312)]]
[09:05:07] <stashbot>	 T324312: Exception executing job: refreshUserImpactJob Wikimedia\Assert\ParameterKeyTypeException: Bad value for parameter $json['dailyArticleViews']: all elements must have string keys - https://phabricator.wikimedia.org/T324312
[09:05:44] <wikibugs>	 (03PS6) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282)
[09:05:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:40] <wikibugs>	 (03PS1) 10Kosta Harlan: User impact: Show discovery tour to desktop users who had old module [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619)
[09:06:50] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864666|Fix ExpensiveUserImpact input validation (T324312)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[09:06:55] <wikibugs>	 (03PS1) 10Kosta Harlan: User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619)
[09:09:01] <kostajh>	 syncing. As there's nothing coming up, I'm going to sync two more patches
[09:12:16] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert)
[09:12:29] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) a:05Clement_Goubert→03None
[09:13:23] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops-radar: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert)
[09:14:14] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864666|Fix ExpensiveUserImpact input validation (T324312)]] (duration: 09m 10s)
[09:14:17] <stashbot>	 T324312: Exception executing job: refreshUserImpactJob Wikimedia\Assert\ParameterKeyTypeException: Bad value for parameter $json['dailyArticleViews']: all elements must have string keys - https://phabricator.wikimedia.org/T324312
[09:15:26] <wikibugs>	 (03PS2) 10Kosta Harlan: User impact: Show discovery tour to desktop users who had old module [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619)
[09:15:28] <logmsgbot>	 !log kharlan@deploy1002 backport aborted:  (duration: 00m 25s)
[09:15:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan)
[09:15:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42239 and previous config saved to /var/cache/conftool/dbconfig/20221205-091547-marostegui.json
[09:16:00] <wikibugs>	 (03PS2) 10Kosta Harlan: User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619)
[09:16:11] <wikibugs>	 (03PS1) 10Muehlenhoff: uwsgi: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/864664 (https://phabricator.wikimedia.org/T321783)
[09:28:53] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops-radar: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10fgiunchedi) Thank you for reaching out @Clement_Goubert ! re: topic creation IIRC is open (i.e. topic will be auto-created on...
[09:31:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:32:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[09:32:10] <wikibugs>	 (03PS3) 10Filippo Giunchedi: icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266)
[09:32:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] icinga: decom mgmt monitoring [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi)
[09:36:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan)
[09:36:51] <moritzm>	 !log installing freetype security updates
[09:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:24] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: Show discovery tour to desktop users who had old module [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864712 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan)
[09:37:40] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864712|User impact: Show discovery tour to desktop users who had old module (T323619)]]
[09:37:43] <stashbot>	 T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619
[09:38:11] <godog>	 !log force a puppet run on physical hosts to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/860572
[09:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:08] <moritzm>	 !log restarting mediawiki canaries to pick up freetype security updates
[09:39:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:31] <wikibugs>	 (03Abandoned) 10WMDE-Fisch: Rely on the default value for $wgFileExporterTarget [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762392 (owner: 10Awight)
[09:42:51] <wikibugs>	 (03PS2) 10Michael Große: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093)
[09:45:02] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan)
[09:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:50:40] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864712|User impact: Show discovery tour to desktop users who had old module (T323619)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[09:50:45] <stashbot>	 T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619
[09:50:57] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) As [[ https://phabricator.wikimedia.org/T265876#6559439 | noted in the parent task]], and quite an important infor...
[09:51:28] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[09:51:50] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert)
[09:52:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one typo inline" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[09:52:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:52:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[09:52:30] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[09:53:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:53:44] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:54:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[09:54:24] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[09:54:46] <kostajh>	 checking the patch
[09:56:02] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:57:13] <kostajh>	 syncing
[09:57:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove php 7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/839324 (owner: 10Giuseppe Lavagetto)
[09:57:48] <kostajh>	 on to the last patch 😅
[10:05:13] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864712|User impact: Show discovery tour to desktop users who had old module (T323619)]] (duration: 27m 33s)
[10:05:17] <stashbot>	 T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619
[10:05:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan)
[10:06:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42240 and previous config saved to /var/cache/conftool/dbconfig/20221205-100607-marostegui.json
[10:06:22] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:02] <kostajh>	 claime: do you know if the mw/k8s docker image building process is newly added to scap backport? Perhaps we should give a heads up to folks doing backports that it takes X minutes longer than it used to, for planning
[10:07:11] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:07:20] <claime>	 Yes, it is new, it's been turned on last week
[10:08:31] <claime>	 kostajh: You're right, we should, _joe_ ideas on how to do that?
[10:09:20] <_joe_>	 kostajh: yes, I think release engineering should send an annoucement out to ops@ or wikitech-l
[10:09:40] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[10:10:20] <claime>	 These wikifeeds flaps
[10:10:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[10:10:38] <_joe_>	 claime: I suspect the problem is pretty specific to that page, I'll verify
[10:10:54] <claime>	 ack thanks
[10:11:08] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[10:11:10] <claime>	 FYI there's a task https://phabricator.wikimedia.org/T324412
[10:12:14] <_joe_>	 curl https://wikifeeds.svc.codfw.wmnet:4101/en.wikipedia.org/v1/page/featured/2022/12/04 works flawlessly
[10:12:37] <claime>	 Can it be because we're requesting feeds from 6 years ago?
[10:12:47] <claime>	 April 29, 2016
[10:12:52] <_joe_>	 dunno, it's now working correctly as well
[10:14:49] <Emperor>	 !log rebalance thanos rings T311690
[10:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:52] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[10:21:52] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: Show discovery notice to mobile users [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864713 (https://phabricator.wikimedia.org/T323619) (owner: 10Kosta Harlan)
[10:22:08] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864713|User impact: Show discovery notice to mobile users (T323619)]]
[10:22:11] <stashbot>	 T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619
[10:22:50] <icinga-wm>	 RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:23:48] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864713|User impact: Show discovery notice to mobile users (T323619)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[10:24:09] <kostajh>	 verifying patch
[10:24:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:25:56] <kostajh>	 syncing
[10:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:27:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864726
[10:28:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] New php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864726 (owner: 10Giuseppe Lavagetto)
[10:30:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42241 and previous config saved to /var/cache/conftool/dbconfig/20221205-103028-marostegui.json
[10:31:38] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864713|User impact: Show discovery notice to mobile users (T323619)]] (duration: 09m 30s)
[10:31:41] <stashbot>	 T323619: NewImpact: Introduce new design to existing newcomers - https://phabricator.wikimedia.org/T323619
[10:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:32:40] <godog>	 !log contint1001 - racadm serveraction powercyle - crashed
[10:32:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:05] <kostajh>	 !log UTC morning deploys done
[10:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] New php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864726 (owner: 10Giuseppe Lavagetto)
[10:35:12] <icinga-wm>	 RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[10:39:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:41:07] <wikibugs>	 (03PS2) 10Muehlenhoff: calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013)
[10:41:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:43:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: utils: autodetect hiera directory in role_team_stats.py [puppet] - 10https://gerrit.wikimedia.org/r/864727
[10:43:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] utils: autodetect hiera directory in role_team_stats.py [puppet] - 10https://gerrit.wikimedia.org/r/864727 (owner: 10Filippo Giunchedi)
[10:44:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:46:15] <godog>	 mmhh I'm wondering if my rebooting a crashed contint1001 has anything to do with those -1s
[10:47:30] <claime>	 godog: jenkins the butler going mad and -1'ing everything
[10:49:23] <icinga-wm>	 RECOVERY - Check for large files in client bucket on deploy1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[10:50:53] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326)
[10:51:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[10:53:59] <_joe_>	 yeah something is very wrong in jenkins
[10:54:07] <_joe_>	 hashar: ^^
[10:54:34] <_joe_>	 or zuul
[10:54:36] <hashar>	 hi
[10:54:47] <hashar>	 what are the symptoms?
[10:55:12] <_joe_>	 -1's with the message
[10:55:15] <_joe_>	 This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset
[10:55:28] <_joe_>	 and pipelinebot didn't pick up https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/864728
[10:55:33] <hashar>	 that is the zuul-merger failing to merge the proposed patchset against the tip of the branch
[10:55:35] <_joe_>	 and there's the same error message
[10:55:45] <hashar>	 usually due to a merge conflict, and sometime cause the zuul-merger is confused/broken
[10:55:54] <_joe_>	 I would assume it's the latter
[10:56:23] <godog>	 notably I found contint1001 crashed and rebooted it, I'm wondering if that's related ?
[10:56:42] <_joe_>	 godog: zuul-merger should run from contint2001
[10:56:51] <hashar>	 GitCommandError: Cmd('git') failed due to: exit code(128)
[10:56:51] <hashar>	   cmdline: git fetch --force --tags -v origin
[10:56:51] <hashar>	   stderr: 'fatal: Could not read from remote repository.
[10:57:15] <godog>	 _joe_: I'm aware, I mentioned it just in case
[10:57:18] <hashar>	 it is a known issue, some connection got stuck
[10:57:37] <_joe_>	 ok so the solution is to kick zuul-merger?
[10:59:09] <wikibugs>	 (03PS1) 10Volans: cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729
[10:59:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans)
[10:59:22] <hashar>	 _joe_: I am looking for the task that has the fix
[11:00:35] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] httpd-fcgi: allow logging ECS to a local rsyslog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864547 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto)
[11:03:12] <wikibugs>	 (03CR) 10Gehel: Elastic: Use OS major version for GC flags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking)
[11:05:04] <_joe_>	 hashar: should I look into it?
[11:06:05] <_joe_>	 I see you just restarted zuul
[11:07:23] <hashar>	 !log Restarted Zuul to clear a stuck ssh connection with Gerrit - T309376
[11:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:26] <stashbot>	 T309376: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376
[11:09:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o
[11:09:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o
[11:10:06] <_joe_>	 hashar: any idea how can I get pipelinebot to pick up my change?
[11:11:06] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[11:11:41] <hashar>	 _joe_: so the issue is that sometime the ssh connection from Zuul to Gerrit get stuck indefinitely   which keeps a ssh response thread busy on the Gerrit side
[11:11:59] <hashar>	 that goes against the 4 ssh connection per user limit and breaks the world
[11:12:14] <hashar>	 the "fix" is to restart Zuul entirely to clear the faulty connection
[11:12:27] <hashar>	 for PipelineBot, I guess a `recheck` on the change would be sufficient?
[11:12:47] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) Volume recommendation is apparently ~2/3k mps/partition, so we may want 5 partitions, not considering broker equil...
[11:13:55] <claime>	 Amir1: Got a minute to +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/861813 so we're done with it?
[11:14:19] <hashar>	 oh
[11:14:27] <hashar>	 _joe_: I will trigger the postmerge job
[11:15:12] <Amir1>	 sure thing
[11:15:31] <claime>	 <3
[11:17:13] <wikibugs>	 (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans)
[11:18:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42242 and previous config saved to /var/cache/conftool/dbconfig/20221205-111836-marostegui.json
[11:19:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/864727 (owner: 10Filippo Giunchedi)
[11:21:14] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) p:05Triage→03Medium
[11:22:14] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:22:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] uwsgi: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/864664 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:22:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/output/864664/38579/" [puppet] - 10https://gerrit.wikimedia.org/r/864664 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[11:24:03] <_joe_>	 hashar: <3
[11:24:58] <wikibugs>	 (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[11:29:39] <hashar>	 _joe_: you are welcome and sorry for the mess :\
[11:30:10] <wikibugs>	 (03PS4) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326)
[11:30:40] <wikibugs>	 (03PS3) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255
[11:31:00] <moritzm>	 !log installing librsvg bugfix updates from buster point release
[11:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:19] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mediawiki::maintenance::campaignevents: meta [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert)
[11:31:58] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mediawiki::maintenance::campaignevents: meta [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert)
[11:34:15] <wikibugs>	 10SRE, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10LSobanski) 05Open→03Resolved I updated L3 to reflect the suggestion.
[11:36:29] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:37:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1206 with more weight', diff saved to https://phabricator.wikimedia.org/P42243 and previous config saved to /var/cache/conftool/dbconfig/20221205-113746-marostegui.json
[11:38:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[11:40:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: shellbox: bump image version, move to 4.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864734
[11:43:07] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[11:45:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: bump image version, move to 4.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864734 (owner: 10Giuseppe Lavagetto)
[11:49:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: move human users out of UID range for system accounts - https://phabricator.wikimedia.org/T114446 (10LSobanski) The list Daniel posted above is still more or less accurate and the originally stated question is still valid.
[11:49:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[11:50:00] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: bump image version, move to 4.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864734 (owner: 10Giuseppe Lavagetto)
[11:50:53] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply
[11:51:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[11:51:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff)
[11:52:20] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[11:53:05] <_joe_>	 taavi: I am deploying shellbox-score today with php 7.4, and tomorrow I'll deploy the rest of them if everything goes well
[11:53:23] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[11:58:54] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[11:59:25] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[11:59:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[11:59:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[12:00:37] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/864737
[12:02:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:03:41] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:04:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[12:05:29] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:25:09] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/864737 (owner: 10Hnowlan)
[12:26:29] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:28:42] <claime>	 Regarding the wikifeeds flaps it seems it's always the same pod
[12:28:49] <claime>	 It's the only one with events
[12:29:01] <claime>	 I'll scratch it and make helm recreate it
[12:29:25] <wikibugs>	 (03PS1) 10Slyngshede: Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753
[12:29:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede)
[12:30:23] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/864737 (owner: 10Hnowlan)
[12:31:27] <wikibugs>	 (03PS2) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453
[12:31:29] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:31:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[12:32:59] <wikibugs>	 (03CR) 10Slyngshede: Configuration: Add support for setting connection timeout. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[12:33:11] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[12:34:17] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10MoritzMuehlenhoff) Switching to contint1002 would also be a good opportunity to migrate to Bullseye (which per https://wikitech.wikimedia.org/wiki/Op...
[12:34:55] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[12:35:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[12:36:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "recheck" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[12:38:06] <wikibugs>	 (03PS2) 10Muehlenhoff: zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013)
[12:39:42] <logmsgbot>	 !log root@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:40:29] <wikibugs>	 (03PS2) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766
[12:40:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 (owner: 10David Caro)
[12:41:06] <logmsgbot>	 !log root@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:41:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:41:20] <logmsgbot>	 !log root@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[12:41:50] <logmsgbot>	 !log root@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[12:44:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Add AntiCompositeNumber to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/864757
[12:45:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] Promote Cassandra 3.11.13 to '3.x' (aka stable) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[12:46:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add AntiCompositeNumber to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/864757 (owner: 10Muehlenhoff)
[12:46:41] <wikibugs>	 (03PS3) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766
[12:49:26] <wikibugs>	 (03PS4) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766
[12:50:24] <moritzm>	 !log installing python-keystoneauth1 bugfix updates from Buster 10.13 point release
[12:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:15] <wikibugs>	 (03PS1) 10JMeybohm: helm-state-metrics: Update resources for v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864759 (https://phabricator.wikimedia.org/T323706)
[12:56:17] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:58:05] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[12:59:04] <claime>	 Well apparently killing and recreating the pod that was bugging out didn't fix it
[13:02:53] <wikibugs>	 (03PS3) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453
[13:03:44] <claime>	 Amir1: Going to lunch
[13:03:45] <wikibugs>	 (03PS1) 10JMeybohm: KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706)
[13:03:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[13:03:56] <Amir1>	 noted
[13:04:12] <claime>	 Back in ~1h
[13:04:32] <claime>	 I'm taking the pager, if there's anything I'll come back
[13:07:19] <wikibugs>	 (03PS4) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453
[13:10:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Ottomata) Thanks.  Not sure what is going on, but I found some things you could try in [[ https://unix.stackexchange.com/questions/416166/cant-establish-s...
[13:11:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[13:12:51] <moritzm>	 !log installing libnet-ssleay-perl bugfix updates from Buster 10.13 point release
[13:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:11] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans)
[13:16:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[13:17:07] <moritzm>	 !log installing distro-info-data bugfix updates from Buster 10.13 point release
[13:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:22] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi)
[13:18:43] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans)
[13:21:12] <wikibugs>	 (03CR) 10Ottomata: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[13:23:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[13:24:41] <moritzm>	 !log installing postgresql-common bugfix updates from Buster 10.13 point release
[13:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:57] <TheresNoTime>	 jouncebot: nowandnext
[13:25:57] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 4 minute(s)
[13:25:58] <jouncebot>	 In 0 hour(s) and 4 minute(s): Run fixMergeHistoryCorruption.php (T302486) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1330)
[13:26:00] <stashbot>	 T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486
[13:27:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[13:30:05] <jouncebot>	 TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Run fixMergeHistoryCorruption.php (T302486) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1330).
[13:30:34] <wikibugs>	 (03PS2) 10Slyngshede: Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753
[13:31:45] <TheresNoTime>	 !log T302486 : [samtar@mwmaint1002 ~]$ mwscript maintenance/fixMergeHistoryCorruption.php --wiki enwiki --ns 828 --delete
[13:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:49] <stashbot>	 T302486: Run fixMergeHistoryCorruption.php on affected wikis - https://phabricator.wikimedia.org/T302486
[13:32:53] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 55818
[13:33:07] <wikibugs>	 (03PS1) 10Volans: setup.py: temporary fix for test dependencies [software/cumin] - 10https://gerrit.wikimedia.org/r/864764
[13:34:21] <wikibugs>	 (03CR) 10Volans: "Thanks for the patch. Do you have in mind any specific use case where this will be needed?" [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah)
[13:42:13] <wikibugs>	 (03CR) 10Volans: "dhinus, dcaro: do you have any objection to merge this in its current status? Do you need more time to have a look?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[13:42:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T324180
[13:42:49] <stashbot>	 T324180: Switchover s3 master (db2127 -> db2105) - https://phabricator.wikimedia.org/T324180
[13:43:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:43:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T324180
[13:43:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2105 with weight 0 T324180', diff saved to https://phabricator.wikimedia.org/P42245 and previous config saved to /var/cache/conftool/dbconfig/20221205-134346-ladsgroup.json
[13:44:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 55818
[13:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:48:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:01] <wikibugs>	 (03PS1) 10Stang: trwiki: Add 20 years celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393)
[13:51:30] <dcausse>	 !log repooling wdqs1004
[13:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:17] <icinga-wm>	 PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:07] <Amir1>	 sukhe: good morning, is that known ^
[13:54:02] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861875 (https://phabricator.wikimedia.org/T324180) (owner: 10Gerrit maintenance bot)
[13:54:09] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861875 (https://phabricator.wikimedia.org/T324180) (owner: 10Gerrit maintenance bot)
[13:54:59] <Amir1>	 !log Starting s3 codfw failover from db2127 to db2105 - T324180
[13:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:02] <stashbot>	 T324180: Switchover s3 master (db2127 -> db2105) - https://phabricator.wikimedia.org/T324180
[13:55:26] * claime back
[13:55:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2105 to s3 primary T324180', diff saved to https://phabricator.wikimedia.org/P42246 and previous config saved to /var/cache/conftool/dbconfig/20221205-135539-ladsgroup.json
[13:59:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2127 T324180', diff saved to https://phabricator.wikimedia.org/P42247 and previous config saved to /var/cache/conftool/dbconfig/20221205-135932-ladsgroup.json
[14:00:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1400). nyaa~
[14:00:05] <jouncebot>	 guerganaWMDE and cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:07] <guerganaWMDE>	 o/
[14:00:28] <cirno>	 o/
[14:00:49] <TheresNoTime>	 (best jouncebot message)
[14:01:23] <guerganaWMDE>	 Ok, i will be here
[14:02:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[14:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[14:02:50] <TheresNoTime>	 If no deployers are available in 5 mins I can deploy 
[14:07:19] <TheresNoTime>	 I will deploy
[14:07:46] <TheresNoTime>	 guerganaWMDE: starting with yours
[14:07:47] <guerganaWMDE>	 \o/
[14:08:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[14:08:08] <guerganaWMDE>	 im ready
[14:08:24] <sukhe>	 Amir1: thanks! and no, not expected. checkimg
[14:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[14:08:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[14:08:44] <wikibugs>	 (03Merged) 10jenkins-bot: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[14:08:59] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:862247|Add Property (120) to Wikidata content Namespace (T321282)]]
[14:09:02] <stashbot>	 T321282: make the Property namespace on Wikidata a content namespace - https://phabricator.wikimedia.org/T321282
[14:10:02] <TheresNoTime>	 `scap backport` has had an update?
[14:10:30] <TheresNoTime>	 (no idea what `build-and-push-container-images` is)
[14:10:49] <claime>	 Building and pushing the mediawiki container image for mw-on-k8s
[14:11:10] <TheresNoTime>	 woah, that's happening?
[14:11:22] <guerganaWMDE>	 will you let me know which debug server i have to use?
[14:12:41] <TheresNoTime>	 guerganaWMDE: I will :) it is still doing this new step, not sure how long it will take
[14:13:12] <TheresNoTime>	 claime: is this a long process?
[14:13:28] <claime>	 Shouldn´t  take more than 3 minutes usually
[14:13:47] <guerganaWMDE>	 thanks! sure, i await instructions
[14:14:13] <TheresNoTime>	 Okay, at 5m currently (would be nice if it didn't redirect output but understand why it *does*)
[14:15:08] <wikibugs>	 10SRE, 10Traffic: Drop the VarnishTrafficDrop and HAProxyEdgeTrafficDrop alerts - https://phabricator.wikimedia.org/T322220 (10fgiunchedi)
[14:15:32] <TheresNoTime>	 (took 6m, all ok)
[14:18:02] <logmsgbot>	 !log samtar@deploy1002 samtar and gtzatchkova: Backport for [[gerrit:862247|Add Property (120) to Wikidata content Namespace (T321282)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[14:18:05] <stashbot>	 T321282: make the Property namespace on Wikidata a content namespace - https://phabricator.wikimedia.org/T321282
[14:18:20] <TheresNoTime>	 guerganaWMDE: live on mwdebug, use mwdebug2001 :)
[14:18:26] <claime>	 6 minutes is kinda long... just pushing the image took 4 minutes
[14:19:04] <guerganaWMDE>	 ok, let me check if the change is there, one second
[14:19:06] <icinga-wm>	 RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:19:09] <TheresNoTime>	 ack
[14:19:46] <TheresNoTime>	 claime: me being impatient but so far for a config change this is significantly slower.. 
[14:19:57] <claime>	 Yes, I agree
[14:19:59] <guerganaWMDE>	 it works!!! thanks!
[14:20:10] <TheresNoTime>	 guerganaWMDE: great, syncing
[14:20:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10ssingh) >>! In T324334#8439896, @Volans wrote: > Sorry for the trouble, that was me indeed, I've fixed the permissions and run the `sre.dns.netbox` cookbook successfully: >  >...
[14:20:44] <wikibugs>	 (03PS2) 10Samtar: logos: icon could be not square [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863467 (owner: 10Stang)
[14:22:18] <wikibugs>	 (03PS2) 10Samtar: trwiki: Add 20 years celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) (owner: 10Stang)
[14:23:14] <wikibugs>	 (03PS3) 10Elukey: knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793)
[14:23:50] <wikibugs>	 (03CR) 10Elukey: "Updated all suggestions, and also added Build-Depends where needed :) Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey)
[14:23:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede)
[14:24:16] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede)
[14:25:20] <wikibugs>	 (03Merged) 10jenkins-bot: Remove dependency on LDAP container. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/864753 (owner: 10Slyngshede)
[14:25:48] <TheresNoTime>	 cirno: will be doing 863467 and 864766 next
[14:25:59] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:862247|Add Property (120) to Wikidata content Namespace (T321282)]] (duration: 16m 59s)
[14:26:01] <TheresNoTime>	 guerganaWMDE: should be live on production now :)
[14:26:03] <stashbot>	 T321282: make the Property namespace on Wikidata a content namespace - https://phabricator.wikimedia.org/T321282
[14:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:26:23] <wikibugs>	 (03PS1) 10Marostegui: db1206: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864769
[14:26:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863467 (owner: 10Stang)
[14:26:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) (owner: 10Stang)
[14:26:54] <guerganaWMDE>	 *checks
[14:27:24] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mfossati-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:27] <wikibugs>	 (03Merged) 10jenkins-bot: logos: icon could be not square [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863467 (owner: 10Stang)
[14:27:27] <guerganaWMDE>	 \o/ it's live, thanks!
[14:27:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/864769 (owner: 10Marostegui)
[14:27:32] <wikibugs>	 (03Merged) 10jenkins-bot: trwiki: Add 20 years celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864766 (https://phabricator.wikimedia.org/T324393) (owner: 10Stang)
[14:27:32] <TheresNoTime>	 great :)
[14:27:45] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:863467|logos: icon could be not square]], [[gerrit:864766|trwiki: Add 20 years celebration logos (T324393)]]
[14:27:48] <stashbot>	 T324393: Change the logo of Turkish Wikipedia for 20th anniversary of Turkish Wikipedia - https://phabricator.wikimedia.org/T324393
[14:27:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42249 and previous config saved to /var/cache/conftool/dbconfig/20221205-142752-marostegui.json
[14:28:48] <guerganaWMDE>	 i will log off. thank you!
[14:28:52] <TheresNoTime>	 o/
[14:29:14] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:31] <logmsgbot>	 !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:863467|logos: icon could be not square]], [[gerrit:864766|trwiki: Add 20 years celebration logos (T324393)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[14:29:39] <TheresNoTime>	 cirno: live on mwdebug
[14:30:29] <cirno>	 TheresNoTime: tested under vector, vector-2022 and timeless, all looks good to me
[14:30:36] <TheresNoTime>	 syncin'
[14:31:10] <wikibugs>	 (03PS1) 10Btullis: Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926)
[14:31:12] <wikibugs>	 (03PS5) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453
[14:32:06] <wikibugs>	 (03PS2) 10Samtar: beta: Set wgPageTriageEnableEnglishWikipediaFeatures to False [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863441 (https://phabricator.wikimedia.org/T321922) (owner: 10Stang)
[14:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:32:43] <wikibugs>	 (03PS8) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188)
[14:34:18] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5021.eqsin.wmnet,service=ats-be
[14:34:19] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5021.eqsin.wmnet,service=ats-tls
[14:34:19] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5021.eqsin.wmnet,service=varnish-fe
[14:34:22] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=ats-be
[14:34:23] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=ats-tls
[14:34:23] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet,service=varnish-fe
[14:34:23] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5025.eqsin.wmnet,service=ats-be
[14:34:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5025.eqsin.wmnet,service=ats-tls
[14:34:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5025.eqsin.wmnet,service=varnish-fe
[14:34:27] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=ats-be
[14:34:28] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=ats-tls
[14:34:28] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5025.eqsin.wmnet,service=varnish-fe
[14:34:48] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mfossati-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[14:36:11] <TheresNoTime>	 cirno: about to start 863441 (beta only)
[14:36:23] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:863467|logos: icon could be not square]], [[gerrit:864766|trwiki: Add 20 years celebration logos (T324393)]] (duration: 08m 37s)
[14:36:26] <stashbot>	 T324393: Change the logo of Turkish Wikipedia for 20th anniversary of Turkish Wikipedia - https://phabricator.wikimedia.org/T324393
[14:36:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863441 (https://phabricator.wikimedia.org/T321922) (owner: 10Stang)
[14:36:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] oslo_messaging_rabbit: kombu_reconnect_delay=0.1 [puppet] - 10https://gerrit.wikimedia.org/r/864321 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott)
[14:36:42] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:13] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Set wgPageTriageEnableEnglishWikipediaFeatures to False [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863441 (https://phabricator.wikimedia.org/T321922) (owner: 10Stang)
[14:37:38] <TheresNoTime>	 all done
[14:37:56] <TheresNoTime>	 !log closing UTC afternoon backport window
[14:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:35] <wikibugs>	 (03PS2) 10Btullis: Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926)
[14:39:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede)
[14:39:49] <wikibugs>	 (03CR) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott)
[14:39:55] <wikibugs>	 (03PS3) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816)
[14:40:07] <wikibugs>	 (03PS1) 10Ssingh: cp5011, cp5013: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864771 (https://phabricator.wikimedia.org/T323830)
[14:40:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[14:40:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[14:40:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott)
[14:41:00] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[14:41:06] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:41:32] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5011.eqsin.wmnet,service=ats-tls
[14:41:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5011.eqsin.wmnet,service=ats-be
[14:41:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5011.eqsin.wmnet,service=varnish-fe
[14:41:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5013.eqsin.wmnet,service=ats-tls
[14:41:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5013.eqsin.wmnet,service=ats-be
[14:41:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5013.eqsin.wmnet,service=varnish-fe
[14:42:37] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp[5011,5013].eqsin.wmnet with reason: downtimed, to be depooled
[14:42:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp[5011,5013].eqsin.wmnet with reason: downtimed, to be depooled
[14:42:58] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:43:14] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5011, cp5013: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864771 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[14:44:13] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb Investigate the expected bahaviour of the edges table - https://phabricator.wikimedia.org/T287673 (10fgiunchedi) I'm optimistically pulling o11y since AFAICS there's no actionabled
[14:48:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[14:48:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5011,5013].eqsin.wmnet
[14:49:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-mfossati-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:50:47] <wikibugs>	 (03PS9) 10David Caro: cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[14:51:27] <wikibugs>	 10SRE, 10Observability-Metrics, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10fgiunchedi) 05Open→03Invalid I've run the following query `sum by (state) (apache_workers)` and I'm seeing only state `busy` or...
[14:53:58] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[14:54:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[14:54:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[14:55:15] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[14:55:48] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[14:56:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5011,5013].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:57:09] <wikibugs>	 (03PS2) 10Majavah: puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874
[14:57:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5011,5013].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:57:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:57:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[5011,5013].eqsin.wmnet
[14:57:41] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp[5011,5013].eqsin.wmnet` - cp5011.eqsin.w...
[14:57:54] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[14:59:52] <wikibugs>	 (03CR) 10Majavah: puppetdb: support using client certificates (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah)
[15:01:10] <wikibugs>	 (03CR) 10FNegri: spicerack: add module injection support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[15:02:22] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on moss-fe1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:05:02] <wikibugs>	 10SRE, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10fgiunchedi)
[15:05:13] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[15:05:23] <wikibugs>	 10SRE-OnFire, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10jcrespo) > I would suggest that the alert should be on a request p...
[15:05:23] <logmsgbot>	 !log root@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:06:12] <logmsgbot>	 !log root@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:06:24] <wikibugs>	 10SRE-OnFire, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10jcrespo)
[15:06:40] <logmsgbot>	 !log root@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:07:08] <logmsgbot>	 !log root@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:08:14] <wikibugs>	 (03Merged) 10jenkins-bot: cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[15:11:44] <wikibugs>	 (03Abandoned) 10Herron: swift: update ephemeral port range from 1024-65535 to 10240-65535 [puppet] - 10https://gerrit.wikimedia.org/r/808040 (https://phabricator.wikimedia.org/T311262) (owner: 10Herron)
[15:11:48] <wikibugs>	 (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:14:07] <wikibugs>	 (03PS35) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[15:14:26] <wikibugs>	 (03CR) 10Btullis: "Thanks, this looks great in general." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[15:14:45] <wikibugs>	 (03PS36) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[15:14:49] <andrewbogott>	 !log deleted wikitech-static-ord-prebuster image backup in rackspace cloud. Here concludes the wikitech-static upgrade to Buster and php7.4
[15:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:16] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2010 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:33] <wikibugs>	 (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:16:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:18:10] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:52] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, I have not played much with it yet, but I'm sure we can fix any issues that pop up if any." [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[15:18:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST gateways) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:21:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:21:36] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: increase memory limit for instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/864773 (https://phabricator.wikimedia.org/T323936)
[15:25:35] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5022.eqsin.wmnet,service=ats-be
[15:25:35] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5022.eqsin.wmnet,service=ats-tls
[15:25:35] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5022.eqsin.wmnet,service=varnish-fe
[15:25:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet,service=ats-be
[15:25:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet,service=ats-tls
[15:25:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet,service=varnish-fe
[15:25:41] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5026.eqsin.wmnet,service=ats-be
[15:25:41] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5026.eqsin.wmnet,service=ats-tls
[15:25:41] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5026.eqsin.wmnet,service=varnish-fe
[15:25:43] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=ats-be
[15:25:43] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=ats-tls
[15:25:44] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5026.eqsin.wmnet,service=varnish-fe
[15:26:04] <wikibugs>	 (03PS1) 10Ssingh: cp5012, cp5014: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864775 (https://phabricator.wikimedia.org/T323830)
[15:28:43] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5012.eqsin.wmnet,service=ats-tls
[15:28:43] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5012.eqsin.wmnet,service=ats-be
[15:28:43] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5012.eqsin.wmnet,service=varnish-fe
[15:28:47] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5014.eqsin.wmnet,service=ats-tls
[15:28:47] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5014.eqsin.wmnet,service=ats-be
[15:28:48] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5014.eqsin.wmnet,service=varnish-fe
[15:30:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp[5012,5014].eqsin.wmnet with reason: downtimed, to be depooled
[15:30:41] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp[5012,5014].eqsin.wmnet with reason: downtimed, to be depooled
[15:31:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5012, cp5014: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864775 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[15:34:58] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on moss-fe1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:35:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp[5012,5014].eqsin.wmnet
[15:36:06] <moritzm>	 !log installing apache2 security updates on buster
[15:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:15] <wikibugs>	 10SRE, 10PyBal, 10Traffic-Icebox: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730 (10Aklapper) p:05Medium→03Low
[15:41:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[15:43:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10MoritzMuehlenhoff)
[15:43:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (8) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:44:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5012,5014].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[15:44:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Retire role::spare::system - https://phabricator.wikimedia.org/T324475 (10Volans)
[15:44:41] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh)
[15:45:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp[5012,5014].eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[15:45:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:45:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp[5012,5014].eqsin.wmnet
[15:45:51] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp[5012,5014].eqsin.wmnet` - cp5012.eqsin.w...
[15:45:57] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[15:46:03] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1010.eqiad.wmnet with OS bullseye
[15:48:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10ssingh) We have enabled the hardened haproxy unit on `traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud` to start with, before rolling it out to the production cp hosts.
[15:49:28] <wikibugs>	 (03PS4) 10Ssingh: [In case of emergency/Stage 3] depool eqsin for hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/856664
[15:50:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:22] <wikibugs>	 (03CR) 10Ssingh: "Emergency patch for Stage 3 of eqsin hardware refresh (Monday Dec 5). DO NOT MERGE unless there are issues with eqsin." [dns] - 10https://gerrit.wikimedia.org/r/856664 (owner: 10Ssingh)
[15:52:49] <wikibugs>	 (03PS1) 10Herron: vo-escalate: kill process if run time exceeds 10s [puppet] - 10https://gerrit.wikimedia.org/r/864776 (https://phabricator.wikimedia.org/T324466)
[15:52:58] <wikibugs>	 (03CR) 10DCausse: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[15:58:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:02:54] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[16:04:56] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:05:49] <wikibugs>	 (03PS4) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294
[16:06:30] <moritzm>	 !log installing glibc security updates on buster
[16:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:44] <klausman>	 !log restarted kube-apiserver on ml-serve-ctrl1001 to adress high latency and large number of 504s
[16:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10Muhammad_Yasser_Jazirahly_WMDE)
[16:08:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:11:27] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1010.eqiad.wmnet with reason: host reimage
[16:13:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:14:36] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1010.eqiad.wmnet with reason: host reimage
[16:21:36] <wikibugs>	 (03PS13) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:25:44] <wikibugs>	 (03PS14) 10Elukey: Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:27:05] <klausman>	 !log restarted kube-apiserver on ml-staging-ctrl2001 to adress high latency
[16:27:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:05] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1630).
[16:30:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Merging to unblock CI" [software/cumin] - 10https://gerrit.wikimedia.org/r/864764 (owner: 10Volans)
[16:31:40] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[16:32:41] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificates) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:38:11] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: temporary fix for test dependencies [software/cumin] - 10https://gerrit.wikimedia.org/r/864764 (owner: 10Volans)
[16:38:30] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5023.eqsin.wmnet,service=ats-be
[16:38:31] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5023.eqsin.wmnet,service=ats-tls
[16:38:31] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5023.eqsin.wmnet,service=varnish-fe
[16:38:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet,service=ats-be
[16:38:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet,service=ats-tls
[16:38:33] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet,service=varnish-fe
[16:38:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5027.eqsin.wmnet,service=ats-be
[16:38:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5027.eqsin.wmnet,service=ats-tls
[16:38:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5027.eqsin.wmnet,service=varnish-fe
[16:38:39] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet,service=ats-be
[16:38:39] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet,service=ats-tls
[16:38:40] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5027.eqsin.wmnet,service=varnish-fe
[16:38:55] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1035.eqiad.wmnet with OS bullseye
[16:39:00] <wikibugs>	 (03PS1) 10Ssingh: cp5015: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864785 (https://phabricator.wikimedia.org/T323830)
[16:39:37] <wikibugs>	 (03PS3) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944)
[16:40:10] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1010.eqiad.wmnet with OS bullseye
[16:40:15] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5015.eqsin.wmnet,service=ats-tls
[16:40:16] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5015.eqsin.wmnet,service=ats-be
[16:40:16] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5015.eqsin.wmnet,service=varnish-fe
[16:41:00] <wikibugs>	 (03CR) 10Herron: service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[16:41:37] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1034.eqiad.wmnet with OS bullseye
[16:43:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5015.eqsin.wmnet with reason: downtimed, to be depooled
[16:43:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5015.eqsin.wmnet with reason: downtimed, to be depooled
[16:43:31] <wikibugs>	 (03CR) 10Volans: "Looks good! Small nits and a proposal for a small improvement inline." [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah)
[16:44:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5015: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864785 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[16:44:45] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1033.eqiad.wmnet with OS bullseye
[16:45:11] <wikibugs>	 (03PS1) 10CDanis: Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T234466)
[16:45:34] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris)
[16:46:18] <wikibugs>	 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10LSobanski) p:05Medium→03Low
[16:47:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10Jrbranaa) Approved.
[16:48:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp5015.eqsin.wmnet
[16:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[16:49:37] <wikibugs>	 (03PS2) 10CDanis: Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T324466)
[16:49:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[16:50:06] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Wikimedia-Planet, 10serviceops-collab, 10Patch-For-Review: mixed-content issues on planet.wikimedia.org - https://phabricator.wikimedia.org/T141480 (10LSobanski) p:05Medium→03Lowest
[16:53:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[16:53:37] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1035.eqiad.wmnet with reason: host reimage
[16:55:55] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) @Ottomata, @elukey  any updates on this? Should we keep it open/...
[16:56:11] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1034.eqiad.wmnet with reason: host reimage
[16:56:11] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1035.eqiad.wmnet with reason: host reimage
[16:56:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5015.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[16:57:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T324466) (owner: 10CDanis)
[16:57:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5015.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[16:57:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:57:57] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp5015.eqsin.wmnet
[16:58:05] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp5015.eqsin.wmnet` - cp5015.eqsin.wmnet (*...
[16:58:14] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[16:59:14] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1034.eqiad.wmnet with reason: host reimage
[16:59:33] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1033.eqiad.wmnet with reason: host reimage
[17:00:03] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10LSobanski)
[17:00:23] <wikibugs>	 10SRE, 10Diffusion, 10Release-Engineering-Team, 10serviceops-collab: svn.wikimedia.org redirects to Diffusion main page, hence hard to find e.g. "flexbisonparse" - https://phabricator.wikimedia.org/T140594 (10LSobanski)
[17:00:33] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:00:47] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10LSobanski) p:05Low→03Lowest
[17:01:51] <wikibugs>	 (03PS1) 10Ssingh: cp5016: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864789 (https://phabricator.wikimedia.org/T323830)
[17:02:34] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) I would like to see config management for Kafka topics one day, i...
[17:02:35] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1033.eqiad.wmnet with reason: host reimage
[17:03:37] <rzl>	 ^ httpbb timeout for https://en.wikivoyage.org/wiki/Main_Page this time, interesting
[17:03:48] <icinga-wm>	 PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - free space: / 3538 MB (3% inode=80%): /tmp 3538 MB (3% inode=80%): /var/tmp 3538 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops
[17:03:52] <rzl>	 something must have changed, we're getting random timeouts a lot more frequently
[17:10:50] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:32] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Oops. Apparently we don't trigger this prod!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe)
[17:15:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett)
[17:15:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett)
[17:19:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:55] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1035.eqiad.wmnet with OS bullseye
[17:21:58] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1034.eqiad.wmnet with OS bullseye
[17:28:34] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: name=cp5024.eqsin.wmnet,service=ats-be
[17:28:35] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5024.eqsin.wmnet,service=ats-tls
[17:28:35] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/weight=1; selector: name=cp5024.eqsin.wmnet,service=varnish-fe
[17:28:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet,service=ats-be
[17:28:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet,service=ats-tls
[17:28:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet,service=varnish-fe
[17:28:43] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10akosiaris) 05Open→03Stalled Cool, thanks for that write up @Ottomata. I...
[17:30:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:25] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5016.eqsin.wmnet,service=ats-tls
[17:30:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5016.eqsin.wmnet,service=ats-be
[17:30:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5016.eqsin.wmnet,service=varnish-fe
[17:30:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5016.eqsin.wmnet with reason: downtimed, to be depooled
[17:31:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5016.eqsin.wmnet with reason: downtimed, to be depooled
[17:31:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5016: decommission hosts (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/864789 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[17:31:54] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:31:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:31:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:32:56] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[17:33:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts cp5016.eqsin.wmnet
[17:34:48] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:37:20] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:38:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:39:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[17:40:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5016.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:41:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:42:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp5016.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:42:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp5016.eqsin.wmnet
[17:42:16] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `cp5016.eqsin.wmnet` - cp5016.eqsin.wmnet (*...
[17:42:34] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[17:44:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:44:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:45:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:48:41] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove obsolete per-cp hosts override [puppet] - 10https://gerrit.wikimedia.org/r/864797
[17:49:18] <wikibugs>	 (03PS2) 10Ssingh: hiera: remove obsolete per-cp hosts override (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/864797
[17:49:40] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:52:03] <wikibugs>	 10SRE: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10LSobanski) 05Open→03Resolved a:03LSobanski The linked GitHub task was resolved without resolution. Resolving this one as well, please reopen if this is still needed.
[17:52:54] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10LSobanski)
[17:53:00] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[17:53:14] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:53:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove obsolete per-cp hosts override (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/864797 (owner: 10Ssingh)
[17:54:44] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:54:48] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) 05Open→03Resolved a:03akosiaris ` ssh kubernetes1007.eqiad.wmnet dpkg -l docker.io |grep docker.io ii...
[17:54:56] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:55:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:57:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[17:58:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[18:00:04] <jouncebot>	 ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T1800).
[18:01:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:02:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:03:51] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:04:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[18:04:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[18:13:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[18:13:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2127.codfw.wmnet with reason: Maintenance
[18:13:38] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Add a timeout for vo-escalate [puppet] - 10https://gerrit.wikimedia.org/r/864787 (https://phabricator.wikimedia.org/T324466) (owner: 10CDanis)
[18:19:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi)
[18:19:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[18:20:52] <wikibugs>	 (03PS1) 10CDanis: TIL that systemd doesn't allow mid-line comments [puppet] - 10https://gerrit.wikimedia.org/r/864827
[18:21:45] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] TIL that systemd doesn't allow mid-line comments [puppet] - 10https://gerrit.wikimedia.org/r/864827 (owner: 10CDanis)
[18:21:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P42251 and previous config saved to /var/cache/conftool/dbconfig/20221205-182155-ladsgroup.json
[18:22:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:23:05] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[18:24:43] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[18:27:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:36:06] <wikibugs>	 10SRE, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) 05Resolved→03Open
[18:36:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[18:37:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P42252 and previous config saved to /var/cache/conftool/dbconfig/20221205-183700-ladsgroup.json
[18:37:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[18:37:06] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1033.eqiad.wmnet with OS bullseye
[18:37:09] <wikibugs>	 10SRE, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) a:05jbond→03None
[18:37:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P42253 and previous config saved to /var/cache/conftool/dbconfig/20221205-183712-ladsgroup.json
[18:37:13] <wikibugs>	 (03PS1) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343)
[18:37:15] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[18:38:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[18:38:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[18:38:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P42254 and previous config saved to /var/cache/conftool/dbconfig/20221205-183851-ladsgroup.json
[18:38:55] <stashbot>	 T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827
[18:40:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn)
[18:41:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) @Fuzzy Hi, access requests are handled by a different person each week, that's why you see me reopen and unassign/tag it. it...
[18:41:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Dzahn) a:03jhathaway
[18:45:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1089-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:46:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P42255 and previous config saved to /var/cache/conftool/dbconfig/20221205-184643-ladsgroup.json
[18:46:48] <stashbot>	 T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827
[18:47:34] <wikibugs>	 (03PS2) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343)
[18:47:42] <wikibugs>	 (03PS1) 10Ebernhardson: prom: Add elasticsearch cluster name to exported latency metrics [puppet] - 10https://gerrit.wikimedia.org/r/864829
[18:49:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[18:49:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance
[18:49:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[18:49:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2117.codfw.wmnet with reason: Maintenance
[18:49:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42256 and previous config saved to /var/cache/conftool/dbconfig/20221205-184944-ladsgroup.json
[18:49:48] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[18:49:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T323907)', diff saved to https://phabricator.wikimedia.org/P42257 and previous config saved to /var/cache/conftool/dbconfig/20221205-184950-ladsgroup.json
[18:51:42] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi)
[18:52:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P42258 and previous config saved to /var/cache/conftool/dbconfig/20221205-185205-ladsgroup.json
[18:52:17] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[18:54:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P42259 and previous config saved to /var/cache/conftool/dbconfig/20221205-185429-ladsgroup.json
[18:59:56] <wikibugs>	 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) 05Open→03Resolved a:03Legoktm
[19:01:43] <wikibugs>	 (03CR) 10Jberkel: "Perhaps the change should be made in bullseye|buster-sssd/Dockerfile.template instead?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel)
[19:01:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P42260 and previous config saved to /var/cache/conftool/dbconfig/20221205-190150-ladsgroup.json
[19:03:43] <wikibugs>	 (03PS2) 10Bking: prom: Add elasticsearch cluster name to exported latency metrics [puppet] - 10https://gerrit.wikimedia.org/r/864829 (https://phabricator.wikimedia.org/T324500) (owner: 10Ebernhardson)
[19:05:02] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/864829 (https://phabricator.wikimedia.org/T324500) (owner: 10Ebernhardson)
[19:05:37] <wikibugs>	 (03PS1) 10Effie Mouzeli: Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581)
[19:06:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro)
[19:07:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2127 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P42261 and previous config saved to /var/cache/conftool/dbconfig/20221205-190710-ladsgroup.json
[19:09:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P42262 and previous config saved to /var/cache/conftool/dbconfig/20221205-190935-ladsgroup.json
[19:10:14] <wikibugs>	 (03Merged) 10jenkins-bot: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro)
[19:16:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P42263 and previous config saved to /var/cache/conftool/dbconfig/20221205-191656-ladsgroup.json
[19:19:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10Dzahn) Hi @Muhammad_Yasser_Jazirahly_WMDE , welcome!  Could you get a manager at WMDE to approve this here on the ticket? This will be picked up soon by our rotating clinic duty.  note to cl...
[19:19:45] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Promote Cassandra 3.11.13 to '3.x' (aka stable) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[19:20:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn)
[19:20:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) a:03Dzahn
[19:22:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) p:05Triage→03Medium
[19:24:19] <mutante>	 !log phab1001, previous long time phabricator host, is about to be shut down, made a final copy of /srv/deployment, /root, /home, /etc and synced it to phab1004 - T323418
[19:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:22] <stashbot>	 T323418: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418
[19:24:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P42264 and previous config saved to /var/cache/conftool/dbconfig/20221205-192442-ladsgroup.json
[19:30:37] <wikibugs>	 (03PS5) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597)
[19:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323827)', diff saved to https://phabricator.wikimedia.org/P42265 and previous config saved to /var/cache/conftool/dbconfig/20221205-193203-ladsgroup.json
[19:32:07] <stashbot>	 T323827: Finish timestamp schema changes in flaggedrevs - https://phabricator.wikimedia.org/T323827
[19:32:33] <wikibugs>	 (03CR) 10Dzahn: "this removes the production role and phab1001 will be removed from firewall on remaining hosts, also deletes the entire hosts Hiera entry." [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[19:32:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323907)', diff saved to https://phabricator.wikimedia.org/P42266 and previous config saved to /var/cache/conftool/dbconfig/20221205-193250-ladsgroup.json
[19:32:53] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[19:34:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42267 and previous config saved to /var/cache/conftool/dbconfig/20221205-193448-ladsgroup.json
[19:35:39] <wikibugs>	 (03PS1) 10Ahmon Dancy: logspam: Filter out some very common errors by default [puppet] - 10https://gerrit.wikimedia.org/r/864834
[19:39:29] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[19:39:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P42268 and previous config saved to /var/cache/conftool/dbconfig/20221205-193949-ladsgroup.json
[19:39:53] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[19:40:41] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:41:13] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:44:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "LGTM although this now needs a manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[19:46:01] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:47:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42269 and previous config saved to /var/cache/conftool/dbconfig/20221205-194757-ladsgroup.json
[19:48:52] <wikibugs>	 (03PS6) 10Southparkfan: rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717)
[19:49:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42270 and previous config saved to /var/cache/conftool/dbconfig/20221205-194955-ladsgroup.json
[19:49:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[19:50:01] <wikibugs>	 (03PS6) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597)
[19:50:32] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Tested on mwlog1002; works great." [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy)
[19:53:30] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/816046/38583/cloudcontrol1005.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[19:56:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rsyslog: allow specifying TLS client auth settings and filename property [puppet] - 10https://gerrit.wikimedia.org/r/816046 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[19:57:54] <mutante>	 !log phab1004 (prod) - removing phab1001 from firewall rules, rsync config | phab1001 (formerly prod) - removing prod role T323418 T280597
[19:57:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:59] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[19:57:59] <stashbot>	 T323418: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418
[19:58:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2127.codfw.wmnet with reason: Maintenance
[19:58:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2127.codfw.wmnet with reason: Maintenance
[19:58:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T312984)', diff saved to https://phabricator.wikimedia.org/P42271 and previous config saved to /var/cache/conftool/dbconfig/20221205-195842-ladsgroup.json
[19:58:46] <stashbot>	 T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984
[20:00:24] <wikibugs>	 (03PS1) 10Dzahn: site/phabricator: fix insetup role name which is now team specific [puppet] - 10https://gerrit.wikimedia.org/r/864840 (https://phabricator.wikimedia.org/T323418)
[20:01:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/phabricator: fix insetup role name which is now team specific [puppet] - 10https://gerrit.wikimedia.org/r/864840 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[20:02:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on phab1001.eqiad.wmnet with reason: decom, replaced by phab1004
[20:02:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on phab1001.eqiad.wmnet with reason: decom, replaced by phab1004
[20:03:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42272 and previous config saved to /var/cache/conftool/dbconfig/20221205-200303-ladsgroup.json
[20:05:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42273 and previous config saved to /var/cache/conftool/dbconfig/20221205-200501-ladsgroup.json
[20:05:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[20:05:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[20:05:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42274 and previous config saved to /var/cache/conftool/dbconfig/20221205-200530-ladsgroup.json
[20:05:34] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[20:07:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance
[20:07:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2101.codfw.wmnet with reason: Maintenance
[20:07:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance
[20:07:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42275 and previous config saved to /var/cache/conftool/dbconfig/20221205-200743-ladsgroup.json
[20:07:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2111.codfw.wmnet with reason: Maintenance
[20:07:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T322618)', diff saved to https://phabricator.wikimedia.org/P42276 and previous config saved to /var/cache/conftool/dbconfig/20221205-200755-ladsgroup.json
[20:07:58] <wikibugs>	 (03PS9) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[20:08:19] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (0311 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[20:09:52] <wikibugs>	 (03CR) 10Ottomata: "Thanks ben!  Adding serviceops folks for review now." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[20:10:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T322618)', diff saved to https://phabricator.wikimedia.org/P42277 and previous config saved to /var/cache/conftool/dbconfig/20221205-201021-ladsgroup.json
[20:12:11] <wikibugs>	 (03PS3) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418)
[20:12:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[20:16:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "looks reasonable and already tested" [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy)
[20:18:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T323907)', diff saved to https://phabricator.wikimedia.org/P42278 and previous config saved to /var/cache/conftool/dbconfig/20221205-201810-ladsgroup.json
[20:18:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed on mwlog1002. /usr/local/bin/logspam still works. running puppet on mwlog*" [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy)
[20:18:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[20:18:15] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[20:18:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2124.codfw.wmnet with reason: Maintenance
[20:18:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T323907)', diff saved to https://phabricator.wikimedia.org/P42279 and previous config saved to /var/cache/conftool/dbconfig/20221205-201831-ladsgroup.json
[20:18:36] <wikibugs>	 (03CR) 10Bking: [C: 03+2] prom: Add elasticsearch cluster name to exported latency metrics [puppet] - 10https://gerrit.wikimedia.org/r/864829 (https://phabricator.wikimedia.org/T324500) (owner: 10Ebernhardson)
[20:19:32] <wikibugs>	 (03PS4) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418)
[20:19:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:20:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42280 and previous config saved to /var/cache/conftool/dbconfig/20221205-202008-ladsgroup.json
[20:20:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[20:20:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[20:20:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42281 and previous config saved to /var/cache/conftool/dbconfig/20221205-202029-ladsgroup.json
[20:21:56] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714)
[20:22:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42282 and previous config saved to /var/cache/conftool/dbconfig/20221205-202250-ladsgroup.json
[20:25:25] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts phab1001.eqiad.wmnet
[20:25:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42283 and previous config saved to /var/cache/conftool/dbconfig/20221205-202528-ladsgroup.json
[20:28:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T312984)', diff saved to https://phabricator.wikimedia.org/P42284 and previous config saved to /var/cache/conftool/dbconfig/20221205-202846-ladsgroup.json
[20:28:50] <stashbot>	 T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984
[20:30:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:31:59] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Adjust to changes to redlink behavior from parsoid [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352)
[20:37:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42285 and previous config saved to /var/cache/conftool/dbconfig/20221205-203756-ladsgroup.json
[20:38:23] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[20:40:20] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:40:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42286 and previous config saved to /var/cache/conftool/dbconfig/20221205-204034-ladsgroup.json
[20:43:00] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P42287 and previous config saved to /var/cache/conftool/dbconfig/20221205-204352-ladsgroup.json
[20:44:45] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002"
[20:47:05] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin2002"
[20:47:05] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:47:06] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab1001.eqiad.wmnet
[20:47:18] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel)
[20:50:19] <wikibugs>	 (03PS5) 10Dzahn: site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418)
[20:51:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/858421 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[20:51:24] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[20:53:00] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:53:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42288 and previous config saved to /var/cache/conftool/dbconfig/20221205-205303-ladsgroup.json
[20:53:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[20:53:07] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[20:53:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[20:53:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T322618)', diff saved to https://phabricator.wikimedia.org/P42289 and previous config saved to /var/cache/conftool/dbconfig/20221205-205324-ladsgroup.json
[20:55:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T322618)', diff saved to https://phabricator.wikimedia.org/P42290 and previous config saved to /var/cache/conftool/dbconfig/20221205-205537-ladsgroup.json
[20:55:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T322618)', diff saved to https://phabricator.wikimedia.org/P42291 and previous config saved to /var/cache/conftool/dbconfig/20221205-205547-ladsgroup.json
[20:55:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[20:56:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[20:56:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T322618)', diff saved to https://phabricator.wikimedia.org/P42292 and previous config saved to /var/cache/conftool/dbconfig/20221205-205610-ladsgroup.json
[20:56:42] <MatmaRex>	 jouncebot: next
[20:56:42] <jouncebot>	 In 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T2100)
[20:56:49] <MatmaRex>	 i will be 10 minutes late. sorry
[20:57:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T322618)', diff saved to https://phabricator.wikimedia.org/P42293 and previous config saved to /var/cache/conftool/dbconfig/20221205-205735-ladsgroup.json
[20:58:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P42294 and previous config saved to /var/cache/conftool/dbconfig/20221205-205859-ladsgroup.json
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T2100). Please do the needful.
[21:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:01:13] <TheresNoTime>	 I can deploy in about 5m
[21:02:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323907)', diff saved to https://phabricator.wikimedia.org/P42295 and previous config saved to /var/cache/conftool/dbconfig/20221205-210220-ladsgroup.json
[21:02:24] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[21:03:23] <MatmaRex>	 i'm here now
[21:05:19] <TheresNoTime>	 ack, here too
[21:05:38] <TheresNoTime>	 starting with 856552
[21:05:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński)
[21:06:51] <wikibugs>	 (03Merged) 10jenkins-bot: Use new DiscussionTools heading markup on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński)
[21:07:07] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:856552|Use new DiscussionTools heading markup on group0 wikis (T314714)]]
[21:07:11] <stashbot>	 T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714
[21:07:59] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploy" [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) (owner: 10Bartosz Dziewoński)
[21:08:50] <logmsgbot>	 !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:856552|Use new DiscussionTools heading markup on group0 wikis (T314714)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:08:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42296 and previous config saved to /var/cache/conftool/dbconfig/20221205-210855-ladsgroup.json
[21:09:00] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[21:09:15] <TheresNoTime>	 MatmaRex: live on mwdebug2002, can you test?
[21:09:29] <MatmaRex>	 looking
[21:10:10] <wikibugs>	 (03PS3) 10Vlad.shapik: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T47212)
[21:10:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42297 and previous config saved to /var/cache/conftool/dbconfig/20221205-211045-ladsgroup.json
[21:11:01] <MatmaRex>	 TheresNoTime: looks good
[21:11:08] <TheresNoTime>	 syncin'
[21:11:30] <wikibugs>	 (03PS4) 10Vlad.shapik: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T47212)
[21:12:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42298 and previous config saved to /var/cache/conftool/dbconfig/20221205-211242-ladsgroup.json
[21:14:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T312984)', diff saved to https://phabricator.wikimedia.org/P42299 and previous config saved to /var/cache/conftool/dbconfig/20221205-211405-ladsgroup.json
[21:14:09] <stashbot>	 T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984
[21:14:27] <wikibugs>	 (03PS6) 10Dzahn: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[21:17:03] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:856552|Use new DiscussionTools heading markup on group0 wikis (T314714)]] (duration: 09m 55s)
[21:17:06] <stashbot>	 T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714
[21:17:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) 05Resolved→03Open It's back :(  ` cr2-esams> show system alarms  1 alarms currently active Alarm time               Class  Description 2022-12-05 18:15:58 UTC  Minor  FPC 0 M...
[21:17:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42300 and previous config saved to /var/cache/conftool/dbconfig/20221205-211727-ladsgroup.json
[21:17:31] <TheresNoTime>	 MatmaRex: that should be live now, just waiting on 864724 to merge
[21:17:42] <MatmaRex>	 cool, thanks
[21:19:32] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/864834 (owner: 10Ahmon Dancy)
[21:21:55] <wikibugs>	 (03Merged) 10jenkins-bot: Adjust to changes to redlink behavior from parsoid [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) (owner: 10Bartosz Dziewoński)
[21:22:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864724 (https://phabricator.wikimedia.org/T324352) (owner: 10Bartosz Dziewoński)
[21:22:26] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:864724|Adjust to changes to redlink behavior from parsoid (T324352)]]
[21:22:29] <stashbot>	 T324352: Red links marked as uneditable in visual editor - https://phabricator.wikimedia.org/T324352
[21:23:33] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[21:23:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1003.mgmt.eqiad.wmnet with reboot policy FORCED
[21:24:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42301 and previous config saved to /var/cache/conftool/dbconfig/20221205-212402-ladsgroup.json
[21:24:07] <logmsgbot>	 !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:864724|Adjust to changes to redlink behavior from parsoid (T324352)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:24:11] <TheresNoTime>	 MatmaRex: live on mwdebug
[21:25:33] <MatmaRex>	 TheresNoTime: thanks, looks good as well
[21:25:39] <TheresNoTime>	 syncin'
[21:25:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42302 and previous config saved to /var/cache/conftool/dbconfig/20221205-212552-ladsgroup.json
[21:25:57] <wikibugs>	 (03PS7) 10Dzahn: O:phabricator: move host based settings to role hiera per DC [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[21:26:55] <MatmaRex>	 TheresNoTime: while i have your attention, can you check on the status of this maintenance script run for me? https://phabricator.wikimedia.org/T315510#8392683
[21:27:06] <TheresNoTime>	 looking..
[21:27:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10WMDE-leszek) I approve this request, thank you.
[21:27:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42303 and previous config saved to /var/cache/conftool/dbconfig/20221205-212748-ladsgroup.json
[21:27:51] <wikibugs>	 (03PS8) 10Dzahn: O:phabricator: move host based settings to role hiera per DC [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[21:29:12] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:29:32] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:30:52] <TheresNoTime>	 MatmaRex: I'm unsure how to check that, I see no tmux sessions running
[21:30:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[21:31:28] <TheresNoTime>	 (I would say that's more a "me problem" than an indication that one isn't running though, cc taavi)
[21:31:32] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:864724|Adjust to changes to redlink behavior from parsoid (T324352)]] (duration: 09m 05s)
[21:31:35] <stashbot>	 T324352: Red links marked as uneditable in visual editor - https://phabricator.wikimedia.org/T324352
[21:31:44] <wikibugs>	 (03PS9) 10Dzahn: O:phabricator: move host based settings to role hiere [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[21:31:47] <taavi>	 hello
[21:32:00] <taavi>	 what do you need from me?
[21:32:19] <TheresNoTime>	 taavi: message from Matma/Rex in scrollback regarding checking progress of https://phabricator.wikimedia.org/T315510#8392683 
[21:32:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42304 and previous config saved to /var/cache/conftool/dbconfig/20221205-213233-ladsgroup.json
[21:32:36] <TheresNoTime>	 I'm unsure how to do that, I see no running tmux sessions
[21:32:44] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli)
[21:32:48] <taavi>	 MatmaRex: it's halfway through incubatorwiki
[21:32:58] <taavi>	 MatmaRex: just to confirm, did you see https://phabricator.wikimedia.org/T315510#8427310?
[21:33:16] <MatmaRex>	 thanks
[21:33:21] <MatmaRex>	 yes
[21:33:54] <TheresNoTime>	 !log close UTC late backport window
[21:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:24] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:34:44] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:34:55] <TheresNoTime>	 (taavi: for reference, how should I have checked that?)
[21:35:38] <taavi>	 TheresNoTime: `w` on mwmaint1002?
[21:36:59] <TheresNoTime>	 d'oh :)
[21:37:09] <TheresNoTime>	 (ty)
[21:38:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:39:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42305 and previous config saved to /var/cache/conftool/dbconfig/20221205-213908-ladsgroup.json
[21:39:19] <wikibugs>	 (03PS1) 10Dzahn: phabricator: set enable_vcs to false in main profile [puppet] - 10https://gerrit.wikimedia.org/r/864852
[21:40:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T322618)', diff saved to https://phabricator.wikimedia.org/P42306 and previous config saved to /var/cache/conftool/dbconfig/20221205-214058-ladsgroup.json
[21:41:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[21:41:02] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[21:41:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[21:41:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T322618)', diff saved to https://phabricator.wikimedia.org/P42307 and previous config saved to /var/cache/conftool/dbconfig/20221205-214120-ladsgroup.json
[21:42:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:42:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1003.mgmt.eqiad.wmnet with reboot policy FORCED
[21:42:54] <wikibugs>	 (03PS30) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102
[21:42:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T322618)', diff saved to https://phabricator.wikimedia.org/P42308 and previous config saved to /var/cache/conftool/dbconfig/20221205-214255-ladsgroup.json
[21:42:56] <wikibugs>	 (03PS1) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache: add gutter pools for /*/mw-wan keys [puppet] - 10https://gerrit.wikimedia.org/r/864853 (https://phabricator.wikimedia.org/T258779)
[21:42:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance
[21:42:58] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "no change in compiler https://puppet-compiler.wmflabs.org/output/824412/38584/" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[21:43:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:43:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2128.codfw.wmnet with reason: Maintenance
[21:43:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[21:43:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[21:43:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1005.mgmt.eqiad.wmnet with reboot policy FORCED
[21:43:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T322618)', diff saved to https://phabricator.wikimedia.org/P42309 and previous config saved to /var/cache/conftool/dbconfig/20221205-214332-ladsgroup.json
[21:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T322618)', diff saved to https://phabricator.wikimedia.org/P42310 and previous config saved to /var/cache/conftool/dbconfig/20221205-214333-ladsgroup.json
[21:43:53] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] O:phabricator: move host based settings to role hiere [puppet] - 10https://gerrit.wikimedia.org/r/824412 (https://phabricator.wikimedia.org/T280597) (owner: 10Jbond)
[21:45:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T322618)', diff saved to https://phabricator.wikimedia.org/P42311 and previous config saved to /var/cache/conftool/dbconfig/20221205-214558-ladsgroup.json
[21:47:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "@marostegui phab1001 has been decom'ed and is shut down permanently. the grants for this IP can be revoked" [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[21:47:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T323907)', diff saved to https://phabricator.wikimedia.org/P42312 and previous config saved to /var/cache/conftool/dbconfig/20221205-214740-ladsgroup.json
[21:47:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance
[21:47:44] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[21:47:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance
[21:48:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T323907)', diff saved to https://phabricator.wikimedia.org/P42313 and previous config saved to /var/cache/conftool/dbconfig/20221205-214801-ladsgroup.json
[21:50:37] <wikibugs>	 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn)
[21:51:09] <wikibugs>	 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn)
[21:51:51] <wikibugs>	 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) a:05Dzahn→03Jclark-ctr https://netbox.wikimedia.org/dcim/devices/1557/  has been permanently shut down
[21:54:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42314 and previous config saved to /var/cache/conftool/dbconfig/20221205-215415-ladsgroup.json
[21:54:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[21:54:19] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[21:54:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[21:54:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42315 and previous config saved to /var/cache/conftool/dbconfig/20221205-215436-ladsgroup.json
[21:55:14] <mutante>	 !log deleting special DNS entries for "phab10010-vcs.eqiad.wmnet", IPv4 and IPv6 (Role: VIP), from netbox - T280597 
[21:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:17] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[21:55:50] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[21:58:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42316 and previous config saved to /var/cache/conftool/dbconfig/20221205-215839-ladsgroup.json
[21:58:56] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: deleted phab1001-vcs.eqiad.wmnet IPs - dzahn@cumin2002"
[21:59:32] <mutante>	 !log deleting special DNS entries for "phab10010-vcs.eqiad.wmnet", IPv4 and IPv6 (Role: VIP), from netbox and syncing netbox data - T296022
[21:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:35] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[21:59:58] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: deleted phab1001-vcs.eqiad.wmnet IPs - dzahn@cumin2002"
[21:59:58] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:00:04] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1004.mgmt.eqiad.wmnet with reboot policy FORCED
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221205T2200).
[22:01:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42317 and previous config saved to /var/cache/conftool/dbconfig/20221205-220105-ladsgroup.json
[22:05:24] <wikibugs>	 (03Abandoned) 10Dzahn: phabricator: move some more settings from host file to common [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:05:28] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/864853/38588/" [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli)
[22:06:09] <wikibugs>	 (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/output/864853/38588/" [puppet] - 10https://gerrit.wikimedia.org/r/864853 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli)
[22:08:46] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:13:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42319 and previous config saved to /var/cache/conftool/dbconfig/20221205-221346-ladsgroup.json
[22:16:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42320 and previous config saved to /var/cache/conftool/dbconfig/20221205-221612-ladsgroup.json
[22:20:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1004.mgmt.eqiad.wmnet with reboot policy FORCED
[22:20:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1005.mgmt.eqiad.wmnet with reboot policy FORCED
[22:21:37] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1001.mgmt.eqiad.wmnet with reboot policy FORCED
[22:24:50] <tzatziki>	 !log removing 1 file for legal compliance
[22:24:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T322618)', diff saved to https://phabricator.wikimedia.org/P42321 and previous config saved to /var/cache/conftool/dbconfig/20221205-222852-ladsgroup.json
[22:28:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[22:28:56] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[22:28:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[22:29:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42322 and previous config saved to /var/cache/conftool/dbconfig/20221205-222903-ladsgroup.json
[22:29:38] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:30:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42323 and previous config saved to /var/cache/conftool/dbconfig/20221205-223015-ladsgroup.json
[22:30:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323907)', diff saved to https://phabricator.wikimedia.org/P42324 and previous config saved to /var/cache/conftool/dbconfig/20221205-223049-ladsgroup.json
[22:30:53] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[22:31:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T322618)', diff saved to https://phabricator.wikimedia.org/P42325 and previous config saved to /var/cache/conftool/dbconfig/20221205-223119-ladsgroup.json
[22:31:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance
[22:31:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance
[22:31:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42326 and previous config saved to /var/cache/conftool/dbconfig/20221205-223140-ladsgroup.json
[22:32:32] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:34:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42328 and previous config saved to /var/cache/conftool/dbconfig/20221205-223406-ladsgroup.json
[22:34:10] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[22:39:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42329 and previous config saved to /var/cache/conftool/dbconfig/20221205-223912-ladsgroup.json
[22:39:16] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[22:40:33] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1001.mgmt.eqiad.wmnet with reboot policy FORCED
[22:45:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42330 and previous config saved to /var/cache/conftool/dbconfig/20221205-224522-ladsgroup.json
[22:45:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42331 and previous config saved to /var/cache/conftool/dbconfig/20221205-224555-ladsgroup.json
[22:49:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42332 and previous config saved to /var/cache/conftool/dbconfig/20221205-224913-ladsgroup.json
[22:54:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42333 and previous config saved to /var/cache/conftool/dbconfig/20221205-225419-ladsgroup.json
[22:56:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-vaughnwalters, 10User-zeljkofilipin: Request for wmf group access for user: vwalters - https://phabricator.wikimedia.org/T324515 (10Jrbranaa) Approved
[23:00:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42334 and previous config saved to /var/cache/conftool/dbconfig/20221205-230028-ladsgroup.json
[23:01:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42335 and previous config saved to /var/cache/conftool/dbconfig/20221205-230102-ladsgroup.json
[23:04:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42336 and previous config saved to /var/cache/conftool/dbconfig/20221205-230419-ladsgroup.json
[23:09:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42337 and previous config saved to /var/cache/conftool/dbconfig/20221205-230925-ladsgroup.json
[23:15:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42338 and previous config saved to /var/cache/conftool/dbconfig/20221205-231535-ladsgroup.json
[23:15:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[23:15:39] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[23:15:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[23:15:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42339 and previous config saved to /var/cache/conftool/dbconfig/20221205-231556-ladsgroup.json
[23:16:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T323907)', diff saved to https://phabricator.wikimedia.org/P42340 and previous config saved to /var/cache/conftool/dbconfig/20221205-231608-ladsgroup.json
[23:16:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[23:16:12] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[23:16:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance
[23:18:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42341 and previous config saved to /var/cache/conftool/dbconfig/20221205-231809-ladsgroup.json
[23:19:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42342 and previous config saved to /var/cache/conftool/dbconfig/20221205-231926-ladsgroup.json
[23:19:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance
[23:19:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance
[23:19:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T322618)', diff saved to https://phabricator.wikimedia.org/P42343 and previous config saved to /var/cache/conftool/dbconfig/20221205-231948-ladsgroup.json
[23:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T322618)', diff saved to https://phabricator.wikimedia.org/P42344 and previous config saved to /var/cache/conftool/dbconfig/20221205-232113-ladsgroup.json
[23:21:17] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[23:24:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42345 and previous config saved to /var/cache/conftool/dbconfig/20221205-232432-ladsgroup.json
[23:24:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[23:24:36] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[23:24:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[23:24:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T323907)', diff saved to https://phabricator.wikimedia.org/P42346 and previous config saved to /var/cache/conftool/dbconfig/20221205-232453-ladsgroup.json
[23:33:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42347 and previous config saved to /var/cache/conftool/dbconfig/20221205-233316-ladsgroup.json
[23:36:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42348 and previous config saved to /var/cache/conftool/dbconfig/20221205-233620-ladsgroup.json
[23:41:54] <tzatziki>	 !log removing 5 files for legal compliance
[23:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323907)', diff saved to https://phabricator.wikimedia.org/P42349 and previous config saved to /var/cache/conftool/dbconfig/20221205-234425-ladsgroup.json
[23:44:29] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[23:44:59] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@1d3ba41]: import_cirrus: Update doc cleaning to match cirrus updates
[23:47:30] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@1d3ba41]: import_cirrus: Update doc cleaning to match cirrus updates (duration: 02m 30s)
[23:48:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42350 and previous config saved to /var/cache/conftool/dbconfig/20221205-234822-ladsgroup.json
[23:51:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42351 and previous config saved to /var/cache/conftool/dbconfig/20221205-235126-ladsgroup.json
[23:52:33] <wikibugs>	 (03PS3) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343)
[23:55:42] <wikibugs>	 (03PS4) 10Jberkel: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343)
[23:56:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[23:57:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[23:57:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[23:57:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[23:57:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T323907)', diff saved to https://phabricator.wikimedia.org/P42352 and previous config saved to /var/cache/conftool/dbconfig/20221205-235724-ladsgroup.json
[23:57:27] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[23:57:45] <tzatziki>	 !log removing 2 files for legal compliance
[23:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42353 and previous config saved to /var/cache/conftool/dbconfig/20221205-235932-ladsgroup.json