[00:22:53] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734435 [00:27:37] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734436 [00:29:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Effeietsanders) 05Open→03Resolved Thanks @Dzahn it looks like making the user explicit for bast did the trick, I'm in. [00:32:23] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734437 [00:34:53] (03PS1) 10PipelineBot: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734438 [00:38:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Dzahn) Great! Thanks for confirming and handling the ticket :) [00:59:21] (03CR) 10Krinkle: Renaming $wmfEtcdLastModifiedIndex to $wmgEtcdLastModifiedIndex (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734431 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [01:10:05] (03CR) 10Krinkle: [C: 03+2] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:10:15] (03CR) 10jerkins-bot: [V: 04-1] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:12:54] (03PS2) 10Krinkle: logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) [01:13:01] (03CR) 10Krinkle: [C: 03+2] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:13:03] (03CR) 10jerkins-bot: [V: 04-1] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:13:11] (03CR) 10jerkins-bot: [V: 04-1] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:14:56] (03PS3) 10Krinkle: logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) [01:15:00] (03CR) 10Krinkle: [C: 03+2] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:15:36] (03CR) 10jerkins-bot: [V: 04-1] logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:16:09] (03PS4) 10Krinkle: logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) [01:16:21] (03CR) 10Krinkle: [C: 03+2] "He, phpcs rules changed, ok, now it makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:17:04] (03Merged) 10jenkins-bot: logging: Remove host 'ip' field [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:18:28] * Krinkle staging on mwdebug1002 [01:20:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:41] (03CR) 10Krinkle: "Verified mwdebug1001 verbose logs contain the field, staged on x002 does not. No errors or warnings without verbose mode." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/730905 (https://phabricator.wikimedia.org/T114700) (owner: 10Krinkle) [01:24:04] !log krinkle@deploy1002 Synchronized wmf-config/logging.php: I0211e1c77 (duration: 00m 55s) [01:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:09] 10SRE, 10Performance-Team, 10serviceops, 10MW-1.36-notes, and 3 others: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) a:05Krinkle→03aaron Signing over to Aaron for removing some bits and pieces from WANObjectCache, which I'll rev... [01:43:26] 10SRE, 10Traffic-Icebox, 10MW-1.35-notes (1.35.0-wmf.40; 2020-07-07), 10Patch-For-Review, and 2 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [01:48:48] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734439 [01:52:28] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734440 [01:55:52] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734441 [02:00:04] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T0200) [02:03:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.6 [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734444 [02:06:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.6 [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734444 (owner: 10TrainBranchBot) [02:14:58] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:16:56] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:25:13] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.6 [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734444 (owner: 10TrainBranchBot) [02:31:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:05] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Kanban): db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down) - https://phabricator.wikimedia.org/T294295 (10Marostegui) Once the table check is completed, let's do a data check for some of the tables of the biggest wikis [04:16:26] (03PS1) 10Marostegui: db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/734448 (https://phabricator.wikimedia.org/T294295) [04:17:08] (03PS2) 10DLynch: Add event stream config for discussiontools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/731854 (https://phabricator.wikimedia.org/T286076) [04:17:15] (03CR) 10Marostegui: [C: 03+2] db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/734448 (https://phabricator.wikimedia.org/T294295) (owner: 10Marostegui) [04:31:46] (03PS1) 10Marostegui: wmnet: Decrease TTL for m5-master [dns] - 10https://gerrit.wikimedia.org/r/734449 (https://phabricator.wikimedia.org/T288093) [04:33:14] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 211 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:39:24] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:59:21] (03CR) 10Jforrester: "recheck" [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734444 (owner: 10TrainBranchBot) [05:00:01] (03CR) 10Jforrester: "check php" [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734444 (owner: 10TrainBranchBot) [05:03:47] (03CR) 10Jforrester: [C: 03+1] Add php 7.4 on buster images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732642 (https://phabricator.wikimedia.org/T293996) (owner: 10Giuseppe Lavagetto) [05:08:23] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, and 2 others: db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down) - https://phabricator.wikimedia.org/T294295 (10Marostegui) So the logs show lots of errors from previous days: ` Oct 19 06:45:09 db1112 kernel: [13398165.820227] mce: [Hardware Error]... [05:17:25] (03PS1) 10Marostegui: Revert "db1109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/734319 [05:21:03] (03CR) 10Marostegui: [C: 03+2] Revert "db1109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/734319 (owner: 10Marostegui) [05:21:09] (03PS1) 104nn1l2: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734450 [05:21:11] (03PS1) 104nn1l2: Temporarily change the votewiki lang to fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734451 (https://phabricator.wikimedia.org/T292685) [05:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17601 and previous config saved to /var/cache/conftool/dbconfig/20211026-052129-root.json [05:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:14] (03PS1) 10Effie Mouzeli: mwdebug: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/734452 [05:25:41] (03CR) 10Effie Mouzeli: "No need to keep 12 replicas around, merge it anytime" [deployment-charts] - 10https://gerrit.wikimedia.org/r/734452 (owner: 10Effie Mouzeli) [05:25:58] (03CR) 104nn1l2: "Hello," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734451 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [05:36:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17602 and previous config saved to /var/cache/conftool/dbconfig/20211026-053633-root.json [05:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17603 and previous config saved to /var/cache/conftool/dbconfig/20211026-055136-root.json [05:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17604 and previous config saved to /var/cache/conftool/dbconfig/20211026-060640-root.json [06:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17605 and previous config saved to /var/cache/conftool/dbconfig/20211026-062144-root.json [06:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1109 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P17606 and previous config saved to /var/cache/conftool/dbconfig/20211026-063647-root.json [06:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:00] (03CR) 10ArielGlenn: [C: 03+2] dumps: Improve enterprise index a bit more [puppet] - 10https://gerrit.wikimedia.org/r/732019 (owner: 10Legoktm) [06:45:45] (03Abandoned) 10ArielGlenn: Fix comment about file path being maintained by puppet [puppet] - 10https://gerrit.wikimedia.org/r/732020 (owner: 10Reedy) [06:54:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [07:05:43] !log pool wtp1026.eqiad.wmnet [07:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:16] !log pool mw1319 and mw1312 [07:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:12] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [07:09:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [07:12:49] (03PS1) 10Elukey: helmfile.d: test another STORAGE_URI for revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/734559 (https://phabricator.wikimedia.org/T294141) [07:19:15] (03CR) 10Elukey: [C: 03+1] Revert the active hive server to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/734342 (owner: 10Btullis) [07:19:32] (03CR) 10Elukey: [C: 03+2] helmfile.d: test another STORAGE_URI for revscoring-articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/734559 (https://phabricator.wikimedia.org/T294141) (owner: 10Elukey) [07:21:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:16] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:35:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:20] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10MGerlach) @Dzahn thank you for your help. I realized we also missed to ask for access to LDAP ([[ https://ldap.toolforge.org/group/nda | nda-group ]... [07:51:46] (03PS1) 10Urbanecm: Add purgeExpiredMentorStatus.php [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734321 (https://phabricator.wikimedia.org/T280307) [07:52:05] (03PS1) 10Urbanecm: Add purgeExpiredMentorStatus.php [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734322 (https://phabricator.wikimedia.org/T280307) [07:52:45] jouncebot: nowandnext [07:52:45] No deployments scheduled for the next 3 hour(s) and 7 minute(s) [07:52:46] In 3 hour(s) and 7 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1100) [07:52:55] (03CR) 10Urbanecm: [C: 03+2] Add purgeExpiredMentorStatus.php [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734321 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [07:53:06] (03CR) 10Urbanecm: [C: 03+2] Add purgeExpiredMentorStatus.php [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734322 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:09:16] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 155, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:10:28] (03CR) 10jerkins-bot: [V: 04-1] Add purgeExpiredMentorStatus.php [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734321 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:10:41] :( [08:10:56] * urbanecm hopes for gate-and-submit to do better [08:11:21] and https://integration.wikimedia.org/zuul/ doesn't make me happy [08:13:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31897/console" [puppet] - 10https://gerrit.wikimedia.org/r/734408 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [08:13:39] apparently...mwext-php74-phan-docker fails with suggestions requiring php version 7.4 or newer, while our very own servers still use php7.2? [08:14:13] An email went out [08:14:24] Spookreeeno: do you have a link or more details? [08:14:27] (or a phab ticket, ideally) [08:14:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [08:14:50] ...and browser tests fail with PHP Notice: Undefined index: rc_logid [08:15:10] urbanecm: https://phabricator.wikimedia.org/T293947#7457548 [08:15:34] well, let's do what James_F wants us to do there then :) [08:16:14] I think you'll need Zuul access to reload it [08:16:40] sure [08:17:31] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/734408 (https://phabricator.wikimedia.org/T294080) (owner: 10Dzahn) [08:17:33] urbanecm: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Deploy_configuration [08:18:12] (03CR) 10jerkins-bot: [V: 04-1] Add purgeExpiredMentorStatus.php [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734322 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:18:31] 10SRE, 10Observability-Metrics, 10Patch-For-Review: Occasional rsync race while syncing /var/lib/grafana - https://phabricator.wikimedia.org/T294080 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete thanks to the help from @Dzahn ! [08:18:31] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "tests are throwing a tantrum. This is adding a simple maint script, with zero chance of something breaking automagically. Forcemerging." [extensions/GrowthExperiments] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734321 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:19:22] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "tests are throwing a tantrum. This is adding a simple maint script, with zero chance of something breaking automagically. Forcemerging" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734322 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:19:34] let's do this in the meantime [08:20:17] this reminds me of https://bash.toolforge.org/quip/AU7VVvvk6snAnmqnK_z3 a lot [08:21:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:30] * urbanecm testing his backports [08:28:18] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10fgiunchedi) a:05fgiunchedi→03Papaul [08:30:19] works, syncing [08:31:29] (03PS1) 10Filippo Giunchedi: install_server: simplify custom prometheus.cfg [puppet] - 10https://gerrit.wikimedia.org/r/734564 (https://phabricator.wikimedia.org/T294302) [08:31:47] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/GrowthExperiments/maintenance: 91316ed5714c4426a29fefded5c4db08dbba48bb: Add purgeExpiredMentorStatus.php (T280307) (duration: 00m 56s) [08:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:54] T280307: Mentor dashboard: M2 mentor tools/settings - https://phabricator.wikimedia.org/T280307 [08:31:56] * urbanecm done [08:32:09] (03CR) 10Filippo Giunchedi: "Note I haven't tried it out yet, we will when the racking task is executed" [puppet] - 10https://gerrit.wikimedia.org/r/734564 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi) [08:33:04] !log upload varnish_6.0.8-1wm2 to component/varnish6 on apt.wm.org T293879 [08:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:11] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [08:33:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:55] (03PS1) 10Urbanecm: growthexperiments.pp: Remove absented job [puppet] - 10https://gerrit.wikimedia.org/r/734565 (https://phabricator.wikimedia.org/T278103) [08:36:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:37:10] (03CR) 10Kormat: [C: 03+1] wmnet: Decrease TTL for m5-master [dns] - 10https://gerrit.wikimedia.org/r/734449 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [08:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:08] (03PS1) 10Urbanecm: growthexperiments.pp: Run purgeExpiredMentorStatus.php twice a month [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) [08:40:51] (03CR) 10jerkins-bot: [V: 04-1] growthexperiments.pp: Run purgeExpiredMentorStatus.php twice a month [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:40:53] (03PS2) 10Urbanecm: growthexperiments.pp: Remove absented job [puppet] - 10https://gerrit.wikimedia.org/r/734565 (https://phabricator.wikimedia.org/T278103) [08:41:05] (03PS2) 10Urbanecm: growthexperiments.pp: Run purgeExpiredMentorStatus.php twice a month [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) [08:41:11] (03PS3) 10Urbanecm: growthexperiments.pp: Run purgeExpiredMentorStatus.php twice a month [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) [08:42:08] (03CR) 10jerkins-bot: [V: 04-1] growthexperiments.pp: Run purgeExpiredMentorStatus.php twice a month [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) (owner: 10Urbanecm) [08:42:36] (03PS4) 10Urbanecm: growthexperiments.pp: Run purgeExpiredMentorStatus.php twice a month [puppet] - 10https://gerrit.wikimedia.org/r/734568 (https://phabricator.wikimedia.org/T280307) [08:42:57] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] graphite: set CLUSTER_SERVERS empty with no remote servers [puppet] - 10https://gerrit.wikimedia.org/r/734225 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [08:43:41] jouncebot: next [08:43:41] In 2 hour(s) and 16 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1100) [08:44:22] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:46:26] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:52:38] (03PS2) 10Btullis: Revert the active hive server to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/734342 [08:56:15] 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 (10MatthewVernon) 05Open→03Resolved [08:58:26] (03CR) 10Btullis: [C: 03+2] Revert the active hive server to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/734342 (owner: 10Btullis) [09:02:22] (03CR) 10Jbond: [C: 03+1] growthexperiments.pp: Remove absented job [puppet] - 10https://gerrit.wikimedia.org/r/734565 (https://phabricator.wikimedia.org/T278103) (owner: 10Urbanecm) [09:04:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/734391 (https://phabricator.wikimedia.org/T294166) (owner: 10Herron) [09:05:25] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10Aklapper) Hi, please file a separate ticket under https://phabricator.wikimedia.org/tag/ldap-access-requests/ (different basket). Thanks! [09:07:07] (03CR) 10Btullis: [C: 03+1] "Looks good. +1" [puppet] - 10https://gerrit.wikimedia.org/r/734368 (https://phabricator.wikimedia.org/T291664) (owner: 10Ottomata) [09:11:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] "thanks" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734350 (https://phabricator.wikimedia.org/T199812) (owner: 10Legoktm) [09:15:26] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10cloud-services-team (Kanban): db1112 - DIMM replacement (was: db1112 (s3 contribs/rc replica) is down) - https://phabricator.wikimedia.org/T294295 (10Kormat) `mysqlcheck --all-databases` completed successfully. Started replication again. Will run `db-compare` agains... [09:17:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Effeietsanders - https://phabricator.wikimedia.org/T294038 (10MGerlach) @Aklapper thanks for the pointer. created separate ticket for ldap access T294328. [09:25:00] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "discovery: move read traffic to graphite2003" [dns] - 10https://gerrit.wikimedia.org/r/734277 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [09:25:03] (03PS2) 10Filippo Giunchedi: Revert "discovery: move read traffic to graphite2003" [dns] - 10https://gerrit.wikimedia.org/r/734277 (https://phabricator.wikimedia.org/T247963) [09:25:18] ACKNOWLEDGEMENT - MariaDB Replica Lag: s3 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 39370.70 seconds Kormat db1112 is catching up on replication T294295 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:27:12] !log move read traffic back to graphite1004 - T247963 [09:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:20] T247963: Migrate role::graphite::production to Bullseye - https://phabricator.wikimedia.org/T247963 [09:27:53] (03PS2) 10Filippo Giunchedi: Revert "monitoring: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/734279 (https://phabricator.wikimedia.org/T247963) [09:29:49] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "monitoring: check graphite2003 metrics" [puppet] - 10https://gerrit.wikimedia.org/r/734279 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [09:35:50] RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:36:00] RECOVERY - MariaDB Replica Lag: s3 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:36:24] RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:38:56] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "statsd: failover writes to graphite2003" [puppet] - 10https://gerrit.wikimedia.org/r/734278 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [09:39:15] (03PS2) 10Filippo Giunchedi: Revert "wmnet: move writes to graphite2003" [dns] - 10https://gerrit.wikimedia.org/r/734280 (https://phabricator.wikimedia.org/T247963) [09:39:44] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "wmnet: move writes to graphite2003" [dns] - 10https://gerrit.wikimedia.org/r/734280 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [09:40:19] !log flip back write traffic to graphite1004 (all but mediawiki) - T247963 [09:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:26] T247963: Migrate role::graphite::production to Bullseye - https://phabricator.wikimedia.org/T247963 [09:42:33] (03PS1) 10David Caro: sre.hosts.reimage: handle switches without virtual chassis [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) [09:44:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10dcaro) Sent a patch to fix the issue (the new script was expecting the switch to have a virtual chassis). [09:47:13] !log bounce navtiming on webperf1001 to pick up statsd changes - T247963 [09:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:20] T247963: Migrate role::graphite::production to Bullseye - https://phabricator.wikimedia.org/T247963 [09:49:16] !log bounce superset on an-tool1010 to pick up statsd changes - T247963 [09:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:24] !log bounce superset on an-tool1005 to pick up statsd changes - T247963 [09:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:49] dpifke btullis ^ FYI restarts of navtiming and superset for statsd/graphite failover [09:51:19] (03PS1) 10Zabe: Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) [09:58:37] jbond: thanks for reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/734565/, I'd appreciate a +2 though -- not a root, so can't do that myself :)) [09:59:26] jouncebot: next [09:59:26] In 1 hour(s) and 0 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1100) [10:00:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734281 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [10:00:58] (03PS2) 10Majavah: toolforge: Update to ingress-nginx v1.0 [puppet] - 10https://gerrit.wikimedia.org/r/734294 (https://phabricator.wikimedia.org/T292771) [10:01:10] (03Merged) 10jenkins-bot: Revert "ProductionServices: use graphite2003 for statsd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734281 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [10:01:32] <_joe_> that was fast :P [10:02:38] indeed, good job jenkins [10:03:51] (03PS1) 10Zabe: Migrate calls of wmf* constants to wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734573 (https://phabricator.wikimedia.org/T45956) [10:05:19] <_joe_> I just maligned its performance with jelto, that's why it wanted to stick it up to me [10:05:26] <_joe_> scapping [10:06:27] !log oblivian@deploy1002 Synchronized wmf-config/ProductionServices.php: Switching back graphite to eqiad (duration: 01m 04s) [10:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:10] <_joe_> godog: in a few seconds you should see the firehose switching sites [10:07:20] _joe_: thank you [10:07:22] "can't wait" [10:07:26] (not) [10:07:47] !log oblivian@deploy1002 Synchronized tests/WmfConfigServicesTest.php: Switching back graphite to eqiad (duration: 00m 55s) [10:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:09] for folks following at home, statsd traffic goes from 4MB/s to 60MB/s with the firehose on [10:08:15] that's not a typo, bytes not bits [10:08:20] <_joe_> lol [10:08:24] (03PS1) 10Zabe: Migrate wmfHostnames to wmgHostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) [10:08:45] <_joe_> ok all done [10:08:54] thanks _joe_, appreciate it [10:09:40] FWIW this failover was much smoother than the last, "only" three daemons to restart since they don't pick up dns changes [10:09:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:09:46] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-12), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10dom_walden) > [*] Beta cluster > [] QAed on the beta cluster > [] Prod deployment > [] Docker imag... [10:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] (03CR) 10Elukey: [C: 03+2] tlsproxy::localssl: acme_chief should notify nginx [puppet] - 10https://gerrit.wikimedia.org/r/732611 (https://phabricator.wikimedia.org/T293826) (owner: 10Elukey) [10:14:35] (03CR) 10Volans: "LGTM, small nit inline. I'd like Arzhel to also have a look to check it this will work also for the new network setup in drmrs and the eqi" [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) (owner: 10David Caro) [10:17:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:18:12] (03PS1) 10Jbond: changelog: update changelog [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734575 [10:18:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] changelog: update changelog [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734575 (owner: 10Jbond) [10:19:30] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:22:38] (03PS2) 10David Caro: sre.hosts.reimage: handle switches without virtual chassis [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) [10:22:49] (03CR) 10David Caro: sre.hosts.reimage: handle switches without virtual chassis (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) (owner: 10David Caro) [10:23:09] 10SRE, 10Traffic, 10observability, 10Discovery-Search (Current work): flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10elukey) The nginx problems should be fixed with https://gerrit.wikimedia.org/r/732611 in theory (so elastic/cloudelastic/relfo... [10:29:21] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) (owner: 10David Caro) [10:30:12] (03Abandoned) 10Zabe: test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734428 (owner: 10Zabe) [10:38:09] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-12), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [10:38:16] godog: with T288458 and T294001 both dealt with and replication / dispersion back, going to repool codfw swift and swift-ro [10:38:16] T294001: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T294001 [10:38:17] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [10:39:29] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=swift-ro [10:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:04] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=swift [10:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:20] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:01] (03PS1) 10Zabe: Consistently write to $wmgRealm the same value as to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) [10:46:12] !log upload cas_6.4.2-1+wmf10u2_amd64.deb [10:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:56] jouncebot: next [10:48:56] In 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1100) [10:49:34] (03CR) 10Urbanecm: "Definitely secure a +1 before requesting deployment here, please. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [10:51:13] (03CR) 10Urbanecm: [C: 04-1] "looks good mostly, the 15% increase is weird though. Can we split it to a separate patch, if the reason is not clear to you immediately?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) (owner: 10Zabe) [10:51:54] (03CR) 10Urbanecm: "I'd appreciate a CR+1 beforehand here -- thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734582 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [10:51:57] zabe: ^^ [10:54:28] (03PS5) 10Zabe: Fix HD logo size in some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) [10:55:43] (03CR) 10Zabe: "I removed the slwikiquote logo from this patch for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) (owner: 10Zabe) [10:56:20] (03PS1) 10Btullis: Add a temporary firewall rule to support cassandra3 migration [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1100). [11:00:04] zabe: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] o/ [11:00:19] i can deploy today [11:00:21] pls wait [11:00:23] ok [11:00:26] sure [11:00:27] (to both) [11:01:01] (03PS6) 10Zabe: Fix HD logo size in some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) [11:01:17] now I'm ready :) [11:01:34] let's do it then [11:01:39] (03CR) 10Urbanecm: [C: 03+2] Fix HD logo size in some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) (owner: 10Zabe) [11:02:14] zabe: I'm going to deploy only the logos patch. i'm not used to the other parts you change, so I'd appreciate a +1 before shipping. [11:02:22] (03Merged) 10jenkins-bot: Fix HD logo size in some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734429 (https://phabricator.wikimedia.org/T250731) (owner: 10Zabe) [11:02:31] that was quick [11:03:03] urbanecm: understandable, I already removed from calander. [11:03:10] thanks [11:03:18] zabe: patch is at mwdebug1001 [11:04:02] looking [11:05:39] (03PS1) 10Btullis: Purge any unmanaged files from /etc/security/keytabs [puppet] - 10https://gerrit.wikimedia.org/r/734612 (https://phabricator.wikimedia.org/T294124) [11:07:54] 10SRE-swift-storage: Monitoring (?+alerting) for Swift capacity - https://phabricator.wikimedia.org/T294019 (10LSobanski) [11:08:14] 10SRE-swift-storage: Swift-recon -d overstates disk capacity and usage - https://phabricator.wikimedia.org/T294016 (10LSobanski) [11:08:54] urbanecm: looks good to me, they obviouly get a bit smaller, but that is expected [11:09:02] thanks, let's sync [11:09:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:15] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 575a6a66b279c3d2d8974ffcc4911cc5b927be47: Fix HD logo size in some wikis (T250731; 1/2) (duration: 00m 57s) [11:12:11] !log urbanecm@deploy1002 Synchronized logos/config.yaml: 575a6a66b279c3d2d8974ffcc4911cc5b927be47: Fix HD logo size in some wikis (T250731; 2/2) (duration: 00m 55s) [11:12:19] zabe: done. lemme purge the cache now [11:12:21] anything else? [11:12:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:13:06] (03PS1) 10ArielGlenn: puppetize directories for Wikimedia Enterprise dumps [puppet] - 10https://gerrit.wikimedia.org/r/734613 (https://phabricator.wikimedia.org/T273585) [11:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:11] T250731: Change HD logos with incorrect size to match expectations - https://phabricator.wikimedia.org/T250731 [11:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:19] (03PS1) 10Zabe: Fix HD logo size at slwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734614 (https://phabricator.wikimedia.org/T250731) [11:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:40] no [11:13:48] (03CR) 10jerkins-bot: [V: 04-1] puppetize directories for Wikimedia Enterprise dumps [puppet] - 10https://gerrit.wikimedia.org/r/734613 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:14:03] I don't realy know why the slwikiquote one gets bigger, maybe the current one is just poor quality? [11:14:15] I'm not sure either [11:14:35] if it was like a 1%, I wouldn't care [11:14:39] but 15% is quite a lot [11:14:44] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Parsoid testing, original images can be found at [[https://people.wikimedia.org/~jiji/benchmarks-parsoid/ | https://people.wik... [11:14:54] agree [11:20:20] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:21:18] (03PS2) 10ArielGlenn: puppetize directories for Wikimedia Enterprise dumps [puppet] - 10https://gerrit.wikimedia.org/r/734613 (https://phabricator.wikimedia.org/T273585) [11:22:26] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:24:10] (03PS1) 10Urbanecm: Add namespace translations for [ami] Amis and [pwn] Paiwan [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734587 (https://phabricator.wikimedia.org/T292414) [11:24:27] (03PS1) 10Urbanecm: Add namespace translations for [ami] Amis and [pwn] Paiwan [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734588 (https://phabricator.wikimedia.org/T292414) [11:24:42] (03CR) 10Urbanecm: [C: 03+2] "needed in prod, as amiwiki/pwnwiki are getting live soon" [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734588 (https://phabricator.wikimedia.org/T292414) (owner: 10Urbanecm) [11:24:48] (03CR) 10Urbanecm: [C: 03+2] "needed in prod, as amiwiki/pwnwiki are getting live soon" [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734587 (https://phabricator.wikimedia.org/T292414) (owner: 10Urbanecm) [11:33:14] (03CR) 10Elukey: Add a temporary firewall rule to support cassandra3 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [11:34:29] (03PS3) 10Majavah: toolforge: Update to ingress-nginx v1.0 [puppet] - 10https://gerrit.wikimedia.org/r/734294 (https://phabricator.wikimedia.org/T292771) [11:35:26] (03CR) 10ArielGlenn: [C: 03+2] puppetize directories for Wikimedia Enterprise dumps [puppet] - 10https://gerrit.wikimedia.org/r/734613 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:42:52] Emperor: ack, thanks! [11:44:38] (03CR) 10jerkins-bot: [V: 04-1] Add namespace translations for [ami] Amis and [pwn] Paiwan [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734588 (https://phabricator.wikimedia.org/T292414) (owner: 10Urbanecm) [11:45:40] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10fgiunchedi) [11:46:05] (03PS1) 10Urbanecm: RecentChangeFactory: Add missing 'rc_logid' value [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734589 (https://phabricator.wikimedia.org/T293885) [11:46:32] (03CR) 10Urbanecm: [C: 03+2] RecentChangeFactory: Add missing 'rc_logid' value [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734589 (https://phabricator.wikimedia.org/T293885) (owner: 10Urbanecm) [11:47:08] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "T293885" [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734588 (https://phabricator.wikimedia.org/T292414) (owner: 10Urbanecm) [11:49:22] PROBLEM - snapshot of x1 in codfw on alert1001 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2021-10-23 11:24:40 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:49:27] !log urbanecm@deploy1002 Started scap: c131f32e5e0804c8f5c2ec768b334c81a1b35151: Add namespace translations for [ami] Amis and [pwn] Paiwan (T292414, T292415) [11:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:35] T292415: Create Wikipedia Paiwan - https://phabricator.wikimedia.org/T292415 [11:49:35] T292414: Create Wikipedia Amis - https://phabricator.wikimedia.org/T292414 [11:51:52] !log urbanecm@deploy1002 Finished scap: c131f32e5e0804c8f5c2ec768b334c81a1b35151: Add namespace translations for [ami] Amis and [pwn] Paiwan (T292414, T292415) (duration: 02m 25s) [11:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:07] that was...very quick [11:52:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:40] (03PS2) 10Filippo Giunchedi: install_server: simplify custom prometheus.cfg [puppet] - 10https://gerrit.wikimedia.org/r/734564 (https://phabricator.wikimedia.org/T294302) [11:55:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "LGTM, but the inlinied comment is important." [puppet] - 10https://gerrit.wikimedia.org/r/734294 (https://phabricator.wikimedia.org/T292771) (owner: 10Majavah) [12:03:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:04:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:48] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:06:50] (03Abandoned) 10Urbanecm: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734450 (owner: 104nn1l2) [12:06:55] (03PS2) 10Urbanecm: Temporarily change the votewiki lang to fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734451 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [12:07:08] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734451 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [12:07:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:30] (03PS1) 10Jbond: build.gradle: drop old netty libraries [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734621 [12:10:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] build.gradle: drop old netty libraries [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734621 (owner: 10Jbond) [12:10:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: remove duplicate listen 80 [puppet] - 10https://gerrit.wikimedia.org/r/734227 (owner: 10Majavah) [12:13:46] (03PS1) 10ArielGlenn: add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) [12:15:15] (03CR) 10jerkins-bot: [V: 04-1] add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [12:16:08] !log upload cas_6.4.2-1+wmf10u3_amd64 [12:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:49] (03PS2) 10Marostegui: wmnet: Decrease TTL for m5-master [dns] - 10https://gerrit.wikimedia.org/r/734449 (https://phabricator.wikimedia.org/T288093) [12:19:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Decrease TTL for m5-master [dns] - 10https://gerrit.wikimedia.org/r/734449 (https://phabricator.wikimedia.org/T288093) (owner: 10Marostegui) [12:20:42] (03PS1) 10Jbond: idp: move live service back to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/734625 [12:21:27] (03PS2) 10Jbond: idp: move live service back to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/734625 [12:21:44] (03CR) 10Jbond: [C: 03+2] idp: move live service back to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/734625 (owner: 10Jbond) [12:23:18] (03PS2) 10Btullis: Add a temporary firewall rule to support cassandra3 migration [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) [12:24:43] (03CR) 10Btullis: Add a temporary firewall rule to support cassandra3 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [12:25:35] (03PS2) 10Btullis: Add three more HDFS related checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/732993 (https://phabricator.wikimedia.org/T293399) [12:25:40] (03CR) 10Elukey: [C: 03+1] Add a temporary firewall rule to support cassandra3 migration [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [12:25:59] (03CR) 10Volans: add the Wikimedia Enterprise content downloader script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [12:26:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:28:00] (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:28:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:28:46] (03PS2) 10ArielGlenn: add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) [12:29:03] I am going to sync mediawiki to prepare 1.38.0-wmf.6 deployment which will happen tonight [12:29:30] (03Merged) 10jenkins-bot: RecentChangeFactory: Add missing 'rc_logid' value [extensions/Wikibase] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734589 (https://phabricator.wikimedia.org/T293885) (owner: 10Urbanecm) [12:30:44] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31899/console" [puppet] - 10https://gerrit.wikimedia.org/r/734612 (https://phabricator.wikimedia.org/T294124) (owner: 10Btullis) [12:31:08] !log scap prep 1.38.0-wmf.6 # T293947 [12:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:15] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [12:32:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] !log Applied security patches to 1.38.0-wmf.6 # T293947 [12:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:58] (03CR) 10Btullis: Add a temporary firewall rule to support cassandra3 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [12:34:01] (03CR) 10Btullis: [C: 03+2] Add a temporary firewall rule to support cassandra3 migration [puppet] - 10https://gerrit.wikimedia.org/r/734609 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [12:35:00] !log scap clean --delete 1.38.0-wmf.4 # T293947 [12:35:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:45] (03PS1) 10Jbond: admin.yaml: drop speed and function users [puppet] - 10https://gerrit.wikimedia.org/r/734626 [12:38:21] (03PS2) 10Jbond: puppetboard: add puppetboard as an active/active service [dns] - 10https://gerrit.wikimedia.org/r/734262 [12:40:11] (03PS3) 10ArielGlenn: add the Wikimedia Enterprise content downloader script [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) [12:41:12] (03CR) 10Jbond: [C: 03+1] Remove all remaining references to alluxio [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:42:20] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:42:46] Doing cxserver deployment. Anything on deploy1002 going on? [12:42:50] (03CR) 10Wolfgang Kandek: [V: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/734626 (owner: 10Jbond) [12:42:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/734612 (https://phabricator.wikimedia.org/T294124) (owner: 10Btullis) [12:44:31] (03CR) 10Jbond: "this has been deployed now, thanks 😊" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/734350 (https://phabricator.wikimedia.org/T199812) (owner: 10Legoktm) [12:44:51] (03CR) 10Btullis: [V: 03+1 C: 03+2] Purge any unmanaged files from /etc/security/keytabs [puppet] - 10https://gerrit.wikimedia.org/r/734612 (https://phabricator.wikimedia.org/T294124) (owner: 10Btullis) [12:45:21] (03CR) 10Btullis: Remove all remaining references to alluxio (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:45:26] (03CR) 10Btullis: [C: 03+2] Remove all remaining references to alluxio [puppet] - 10https://gerrit.wikimedia.org/r/732719 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:51:42] (03CR) 10David Caro: toolforge::cronrunner: disable cron on non-active hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah) [12:51:54] (03CR) 10Arturo Borrero Gonzalez: "hey folks, what are current plans to merge this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [12:52:14] OK. Going ahead.. [12:53:16] (03PS2) 10KartikMistry: Update cxserver to 2021-10-25-123807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734288 (https://phabricator.wikimedia.org/T217747) [12:53:58] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Update to CAS 6.4 - https://phabricator.wikimedia.org/T293186 (10jbond) 05Open→03Resolved p:05Triage→03Medium a:03jbond cats has now been upgraded to 6.4.2. [12:55:11] (03PS1) 10Filippo Giunchedi: hieradata: disable profile::base::production in Pontoon [puppet] - 10https://gerrit.wikimedia.org/r/734632 [12:55:34] jbond dcaro ^ [12:56:18] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/734632 (owner: 10Filippo Giunchedi) [12:56:25] godog: done :) [12:56:56] cheers [12:57:02] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: disable profile::base::production in Pontoon [puppet] - 10https://gerrit.wikimedia.org/r/734632 (owner: 10Filippo Giunchedi) [12:58:10] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-10-25-123807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734288 (https://phabricator.wikimedia.org/T217747) (owner: 10KartikMistry) [12:59:32] (03PS2) 10Jbond: admin.yaml: drop speed and function users [puppet] - 10https://gerrit.wikimedia.org/r/734626 [12:59:40] (03CR) 10Jbond: [C: 03+2] admin.yaml: drop speed and function users [puppet] - 10https://gerrit.wikimedia.org/r/734626 (owner: 10Jbond) [13:01:54] godog: happy for me to merge yours? [13:02:13] jbond: yes please, sorry I forgot [13:02:18] np merged [13:02:57] (03Merged) 10jenkins-bot: Update cxserver to 2021-10-25-123807-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/734288 (https://phabricator.wikimedia.org/T217747) (owner: 10KartikMistry) [13:03:02] (03PS3) 10Majavah: toolforge::cronrunner: disable cron on non-active hosts [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) [13:03:13] (03CR) 10Majavah: toolforge::cronrunner: disable cron on non-active hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/732986 (https://phabricator.wikimedia.org/T284767) (owner: 10Majavah) [13:05:06] !log hashar@deploy1002 Pruned MediaWiki: 1.38.0-wmf.4 (duration: 31m 07s) [13:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:34] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [13:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:59] (03PS6) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 (https://phabricator.wikimedia.org/T222826) [13:06:54] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Effeietsanders - https://phabricator.wikimedia.org/T294328 (10Ottomata) Also approved. [13:13:32] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:08] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:29] 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Kormat) [13:21:21] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) [13:21:53] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) Moved the dc-ops request to a subtask {T294345} to simplify tracking for them. [13:24:09] !log Updated cxserver to 2021-10-25-123807-production (T217747, T218217, T292421) [13:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:18] T292421: Post-creation work for amiwiki - https://phabricator.wikimedia.org/T292421 [13:24:19] T217747: cxserver's swagger spec fails to validate - https://phabricator.wikimedia.org/T217747 [13:24:19] T218217: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 [13:27:08] 10SRE, 10CX-cxserver, 10Citoid, 10Math, and 9 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10KartikMistry) [13:30:03] (03Abandoned) 10Odder: Add mobile wordmark for Meetei (Manipuri) Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734359 (https://phabricator.wikimedia.org/T294189) (owner: 10Odder) [13:31:46] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Effeietsanders - https://phabricator.wikimedia.org/T294328 (10ssingh) 05Open→03Resolved a:03ssingh @Effeietsanders: you have been added to the `nda` group. Please let me know if you have any questions or if it doesn't work, thank you! [13:35:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM for the alerts itself, can't comment effectively on the semantics" [alerts] - 10https://gerrit.wikimedia.org/r/732993 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [13:38:05] (03PS3) 10Odder: Add mobile wordmark for Meetei (Manipuri) Wikipedia to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) [13:40:17] !log ran "Capirca Host Definition" script on netbox-next to get up-to-date aqs_group host definition - result https://netbox-next.wikimedia.org/extras/scripts/results/894348/ [13:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:27] jouncebot: nowandnext [13:43:27] No deployments scheduled for the next 2 hour(s) and 16 minute(s) [13:43:27] In 2 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1600) [13:45:05] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.5/extensions/Wikibase: 7723cf724df9ede49129443e43336e93efcd7a41: RecentChangeFactory: Add missing rc_logid value (T293885) (duration: 01m 02s) [13:45:09] * urbanecm done [13:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:12] T293885: wmf-quibble-selenium-php72-docker jobs failing repeatedly: Undefined index: rc_logid - https://phabricator.wikimedia.org/T293885 [13:49:26] PROBLEM - SSH on puppetmaster1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:50:31] !log ran "Capirca Host Definition" script on netbox - output https://netbox.wikimedia.org/extras/scripts/results/1787315/ [13:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:42] (03PS4) 10Odder: Add mobile wordmark for Meetei (Manipuri) Wikipedia to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734361 (https://phabricator.wikimedia.org/T294189) [14:00:48] (03PS8) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) [14:01:37] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [14:04:14] (03PS9) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) [14:05:14] (03PS1) 10Btullis: Add access to port 7000 on aqs_group temporarily [homer/public] - 10https://gerrit.wikimedia.org/r/734643 (https://phabricator.wikimedia.org/T291472) [14:05:53] (03CR) 10Jbond: "Thanks for all the responses and sorry for the delay in refreshing. the latest commit i believe includes all previous comments and ready " [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [14:09:29] (03PS1) 10Jbond: hiera - cloud: add defaults for P:puppetboard::ng [puppet] - 10https://gerrit.wikimedia.org/r/734646 [14:10:01] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add defaults for P:puppetboard::ng [puppet] - 10https://gerrit.wikimedia.org/r/734646 (owner: 10Jbond) [14:11:46] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:18:56] (03PS1) 10Volans: setup.py: include type hints for dependencies [software/homer] - 10https://gerrit.wikimedia.org/r/734648 [14:18:58] (03PS1) 10Volans: pylint: fixed newly reported issues [software/homer] - 10https://gerrit.wikimedia.org/r/734649 [14:19:00] (03PS1) 10Volans: transports: catch connection error [software/homer] - 10https://gerrit.wikimedia.org/r/734650 [14:20:13] (03CR) 10Cathal Mooney: [C: 03+1] "Seems fairly straightforward yep. LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/734643 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [14:20:50] (03CR) 10Btullis: [C: 03+2] Add access to port 7000 on aqs_group temporarily [homer/public] - 10https://gerrit.wikimedia.org/r/734643 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [14:21:59] (03Merged) 10jenkins-bot: Add access to port 7000 on aqs_group temporarily [homer/public] - 10https://gerrit.wikimedia.org/r/734643 (https://phabricator.wikimedia.org/T291472) (owner: 10Btullis) [14:23:33] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) >>! In T288851#7456964, @Ottomata wrote: > As we make these decisions, I'd love if we could keep {T291645} in mind. > >> What topic shoul... [14:23:49] (03CR) 10jerkins-bot: [V: 04-1] setup.py: include type hints for dependencies [software/homer] - 10https://gerrit.wikimedia.org/r/734648 (owner: 10Volans) [14:25:10] (03CR) 10Btullis: [C: 03+2] Add three more HDFS related checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/732993 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:26:28] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:27:24] (03Merged) 10jenkins-bot: Add three more HDFS related checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/732993 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:29:20] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31900/console" [puppet] - 10https://gerrit.wikimedia.org/r/733033 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [14:29:26] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) @colewhite my logs will (primarily) come from kubernetes; I don't see any `kubernetes.*` in the ECS docs, but I do need to add those tags l... [14:29:34] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] thumbor: Remove conditionalization for stretch [puppet] - 10https://gerrit.wikimedia.org/r/733033 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [14:31:57] (03CR) 10Jbond: "this is also currently running on pdev-ppc.puppet-dev.eqiad1.wikimedia.cloud let me know if you want access to take a look" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [14:33:27] (03CR) 10Volans: "CI failure is expected and fixed in the next CR, if I invert them the other one would fail for mypy and I thought it would be easier to re" [software/homer] - 10https://gerrit.wikimedia.org/r/734648 (owner: 10Volans) [14:34:06] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) 05Open→03Stalled >>! In T294295#7457805, @Kormat wrote: > `mysqlcheck --all-databases` completed successfully. Started replication again. Will run `db-compare` agai... [14:41:20] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Marostegui) +1 I would even leave it running till Monday and issue a mariadb restart on Monday issuing this first: ` stop slave; SET GLOBAL innodb_buffer_pool_dump_at_shutdown... [14:45:19] (03CR) 10Alexandros Kosiaris: service::docker: enhance volume support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605343 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:47:12] (03PS2) 10Addshore: Increase concurrency for EntityChangeNotification job [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [14:47:52] (03PS1) 10Cathal Mooney: Enabling OSPF in home config data for temp GRE tunnel from cr3-esams to asw1-b12-drmrs. This is work-around until transport link from Telxius is working. [homer/public] - 10https://gerrit.wikimedia.org/r/734655 (https://phabricator.wikimedia.org/T278394) [14:48:25] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) ` Hi Papaul, Please let me know if we can have someone re-seating the PEM 1. Let me know if re-seat clears the alarm. If this does not clear the alarm, then please provide requeste... [14:50:07] (03CR) 10Ayounsi: [C: 03+1] Enabling OSPF in home config data for temp GRE tunnel from cr3-esams to asw1-b12-drmrs. This is work-around until transport link from Telxi [homer/public] - 10https://gerrit.wikimedia.org/r/734655 (https://phabricator.wikimedia.org/T278394) (owner: 10Cathal Mooney) [14:50:18] (03CR) 10Andrew Bogott: [C: 03+2] Openstack haproxy: Revise keystone internal port [puppet] - 10https://gerrit.wikimedia.org/r/732087 (owner: 10Andrew Bogott) [14:50:20] RECOVERY - SSH on puppetmaster1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:52:09] (03CR) 10Cathal Mooney: [C: 03+2] Enabling OSPF in home config data for temp GRE tunnel from cr3-esams to asw1-b12-drmrs. This is work-around until transport link from Telxi [homer/public] - 10https://gerrit.wikimedia.org/r/734655 (https://phabricator.wikimedia.org/T278394) (owner: 10Cathal Mooney) [14:52:44] (03Merged) 10jenkins-bot: Enabling OSPF in home config data for temp GRE tunnel from cr3-esams to asw1-b12-drmrs. This is work-around until transport link from Telxius is working. [homer/public] - 10https://gerrit.wikimedia.org/r/734655 (https://phabricator.wikimedia.org/T278394) (owner: 10Cathal Mooney) [14:55:32] !log Adding static route on cr3-esams to asw1-b12-drmrs Telia link IP to allow GRE to be built. [14:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:10] (03Abandoned) 10Andrew Bogott: codfw1dev.wikimediacloud.org: Add new hostnames for tls openstack endpoints [dns] - 10https://gerrit.wikimedia.org/r/730879 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [15:02:41] !log cdanis@cumin1001 START - Cookbook sre.network.cf [15:02:44] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [15:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:35] (03CR) 10Andrew Bogott: [C: 03+1] wmcs-srpeadcheck-tools: add new shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:05:39] (03CR) 10Andrew Bogott: [C: 03+1] "This lgtm. As far as I can tell this will be the first time we have multiple prefixes sharing the same classifiers so we'll need to keep a" [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:07:54] !log Running homer against cr3-esams to create new temp GRE tunnel to asw1-b12-drmrs [15:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:10] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:17:22] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/719368 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [15:23:00] (03PS1) 10Cwhite: opensearch roles: apply profile::base classes according to realm [puppet] - 10https://gerrit.wikimedia.org/r/734658 (https://phabricator.wikimedia.org/T288618) [15:23:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "I’d like to try deploying this later (to get some experience with k8s deployments), but first a +1 from a WMF person would be nice :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [15:24:46] (03CR) 10Cwhite: "These are new roles defined prior to removing profile::standard. Is this the way you would expect them to be configured using the new pro" [puppet] - 10https://gerrit.wikimedia.org/r/734658 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [15:27:34] !log cdanis@cumin1001 START - Cookbook sre.network.cf [15:27:37] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [15:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:11] (03CR) 10Ppchelko: [C: 03+1] "WMF person +1s ;)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [15:28:50] jouncebot: nowandnext [15:28:50] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [15:28:50] In 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1600) [15:29:10] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: handle switches without virtual chassis [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) (owner: 10David Caro) [15:29:51] (03CR) 10Volans: [C: 03+2] "Thanks for the patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) (owner: 10David Caro) [15:29:55] (03PS1) 10Ottomata: airflow - Allow access to webserver port [puppet] - 10https://gerrit.wikimedia.org/r/734661 [15:30:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Increase concurrency for EntityChangeNotification job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [15:31:16] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31903/console" [puppet] - 10https://gerrit.wikimedia.org/r/734661 (owner: 10Ottomata) [15:32:09] (03CR) 10Ottomata: airflow - Allow access to webserver port [puppet] - 10https://gerrit.wikimedia.org/r/734661 (owner: 10Ottomata) [15:32:51] (03PS1) 10Btullis: Remove three more HDFS checks from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/734662 (https://phabricator.wikimedia.org/T293399) [15:32:55] (03CR) 10AOkoth: "Going to abandon this change and create a new one." [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [15:33:11] (03Merged) 10jenkins-bot: sre.hosts.reimage: handle switches without virtual chassis [cookbooks] - 10https://gerrit.wikimedia.org/r/734571 (https://phabricator.wikimedia.org/T284471) (owner: 10David Caro) [15:33:15] (03PS2) 10Ottomata: airflow - Allow access to webserver port [puppet] - 10https://gerrit.wikimedia.org/r/734661 [15:33:26] (03Abandoned) 10AOkoth: gitlab: disable puppet and rename files [puppet] - 10https://gerrit.wikimedia.org/r/734339 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [15:35:30] (03PS3) 10Ottomata: airflow - Allow access to webserver port [puppet] - 10https://gerrit.wikimedia.org/r/734661 [15:36:17] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31905/console" [puppet] - 10https://gerrit.wikimedia.org/r/734661 (owner: 10Ottomata) [15:36:23] (03Merged) 10jenkins-bot: Increase concurrency for EntityChangeNotification job [deployment-charts] - 10https://gerrit.wikimedia.org/r/731098 (owner: 10Michael Große) [15:36:29] (03PS1) 10AOkoth: gitlab: rename config & secrets backup file [puppet] - 10https://gerrit.wikimedia.org/r/734664 (https://phabricator.wikimedia.org/T283076) [15:38:02] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow - Allow access to webserver port [puppet] - 10https://gerrit.wikimedia.org/r/734661 (owner: 10Ottomata) [15:38:55] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [15:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:31] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:11] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:56] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:49:51] topranks: ^ [15:49:53] I assume you [15:51:18] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Ottomata) > Just to clarify, this is not a Mediawiki-generated trace, but rather something we obtain from php-fpm. I was thinking of using somet... [15:54:38] Spookreeeno: Yes my bad I'll fix up now. [15:54:41] thanks! [15:55:44] topranks: I assumed you were just distracted [15:57:37] well yeah lots of things moving around alright :) [15:59:09] (03PS1) 10David Caro: ceph::osd: add cinder backup hosts to ferm [puppet] - 10https://gerrit.wikimedia.org/r/734690 (https://phabricator.wikimedia.org/T292546) [15:59:36] (03CR) 10Jbond: "see comment and feel free to ping on irc" [puppet] - 10https://gerrit.wikimedia.org/r/734658 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1600). [16:00:05] MichaelG_WMDE and Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] o/ [16:00:22] 👋 [16:00:22] oops, we forgot to take out the first change [16:00:28] * Lucas_WMDE edits [16:03:04] 10SRE, 10Analytics, 10Event-Platform, 10Observability-Logging, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10Ottomata) @colewhite, in https://phabricator.wikimedia.org/T288851#7456931 you said: > topics prefixed by rsyslog- will be automatically picked... [16:04:04] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:13] (03PS1) 10Giuseppe Lavagetto: mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) [16:07:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:14] (03PS1) 10Cwhite: base: amend notify_maintainers to decode ldap member and email ldap responses [puppet] - 10https://gerrit.wikimedia.org/r/734693 [16:12:27] (03CR) 10Cwhite: "Ran into this using notify_maintainers.py:" [puppet] - 10https://gerrit.wikimedia.org/r/734693 (owner: 10Cwhite) [16:12:30] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) The solution I adopted for the stacktrace is to register the head `file:line:function` triplet in `error.message` so it's easy to aggregate... [16:14:01] (03PS1) 10Ahmon Dancy: threedtopng::deploy: Only install nodejs-legacy on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/734694 (https://phabricator.wikimedia.org/T294148) [16:14:55] do we have someone for the puppet window? [16:18:48] guess they are not here but I might be able to help [16:19:23] looking at it. though I think regular review process might be better [16:20:30] looks at https://doc.wikimedia.org/Wikibase/master/php/ResubmitChanges_8php.html [16:20:41] (03PS2) 10Ahmon Dancy: threedtopng::deploy: Only install nodejs-legacy on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/734694 (https://phabricator.wikimedia.org/T294148) [16:21:26] (03CR) 10Dzahn: [C: 03+2] Regularly resubmit changes that might be stuck in wb_changes [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [16:22:08] mutante: Can you have a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/734694 while you're doing puppet stuff? [16:22:39] no, sorry, I can do one thing at a time. I can do the next thing after a meeting [16:23:07] 👍🏾 I can wait. Thanks! [16:23:08] deploying the new maintenance job [16:23:13] thanks! [16:23:48] !log mwmaint1002 - running puppet, created new mw periodic job from gerrit:732972 (T294031) [16:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:54] T294031: Run ResubmitChanges.php maint script regularly on Wikidata - https://phabricator.wikimedia.org/T294031 [16:24:45] !log [mwmaint1002:~] $ sudo systemctl start mediawiki_job_wikidata_resubmit_changes_for_dispatch [16:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:04] Lucas_WMDE: Main PID: 8128 (code=exited, status=0/SUCCESS) [16:25:11] :) looks good [16:25:24] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:25:30] !log cdanis@cumin1001 START - Cookbook sre.network.cf [16:25:31] if it would fail in the future it would create systemd alert in icinga [16:25:33] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:45] dancy: thanks, i got it in a little while [16:25:50] gotta run for now, bbiaw [16:25:56] See ya [16:26:23] mutante: ok, thanks! [16:26:25] (03CR) 10Dzahn: "16:24 < mutante> !log [mwmaint1002:~] $ sudo systemctl start mediawiki_job_wikidata_resubmit_changes_for_dispatch" [puppet] - 10https://gerrit.wikimedia.org/r/732972 (https://phabricator.wikimedia.org/T294031) (owner: 10Michael Große) [16:45:10] (03PS1) 10Zabe: Disable Education Program namespaces in eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734697 (https://phabricator.wikimedia.org/T294365) [16:50:45] (03PS1) 10Cwhite: add stack.head field for aggregating events by stack head [software/ecs] - 10https://gerrit.wikimedia.org/r/734698 (https://phabricator.wikimedia.org/T288851) [17:00:05] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1700). [17:01:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31908/" [puppet] - 10https://gerrit.wikimedia.org/r/734690 (https://phabricator.wikimedia.org/T292546) (owner: 10David Caro) [17:05:15] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@e908052] (wcqs): Deploy 0.3.90 to WCQS (duration: 1100m 51s) [17:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:36] RECOVERY - Blazegraph process -wcqs-blazegraph- on wcqs1001 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:06:00] RECOVERY - Blazegraph Port for wcqs-blazegraph on wcqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:06:08] (03PS1) 10Arturo Borrero Gonzalez: cumin: aliases: add cloud ceph codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/734699 [17:06:09] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10colewhite) >>! In T288851#7458556, @Joe wrote: > @colewhite my logs will (primarily) come from kubernetes; I don't see any `kubernetes.*` in the... [17:06:45] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@e908052] (wcqs): Deploy 0.3.90 to WCQS [17:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:13] (03CR) 10Volans: [C: 03+1] "LGTM syntax wise" [puppet] - 10https://gerrit.wikimedia.org/r/734699 (owner: 10Arturo Borrero Gonzalez) [17:09:12] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:23] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@e908052] (wcqs): Deploy 0.3.90 to WCQS (duration: 02m 37s) [17:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:12] mutante: ah thanks! I checked on the deployment calendar a few minutes before the window but I guess I was too early :) sorry Lucas_WMDE [17:15:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks @volans for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/734699 (owner: 10Arturo Borrero Gonzalez) [17:16:06] (03CR) 10Cwhite: opensearch roles: apply profile::base classes according to realm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734658 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:16:11] (03Abandoned) 10Cwhite: opensearch roles: apply profile::base classes according to realm [puppet] - 10https://gerrit.wikimedia.org/r/734658 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [17:16:14] (03CR) 10RLazarus: [C: 03+2] threedtopng::deploy: Only install nodejs-legacy on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/734694 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [17:16:36] Thanks Reuven! [17:18:35] no worries! [17:18:53] How long does it take for the change to arrive on the puppet masters? [17:19:12] I just merged it at the puppetmaster -- it'll roll out to individual hosts gradually over the next 30 minutes [17:19:21] thx [17:19:21] but I can force it to any individual machine sooner, if you like [17:20:21] Sure. `deployment-imagescaler04.deployment-prep.eqiad1.wikimedia.cloud` I have root there and just tried `puppet agent -t` but it still behaves the same as before. [17:25:12] dancy: the cronjob on wmcs puppetmasters to update ops/puppet runs every 10 minutes [17:25:26] aha! [17:25:41] I will wait 5 more minutes [17:26:53] fun trivia: it's like every minute for the shared puppetmaster, but every 10 mins for the per-project ones [17:28:12] (03PS1) 10Ebernhardson: query_service: jvm defines are provided with -D [puppet] - 10https://gerrit.wikimedia.org/r/734702 (https://phabricator.wikimedia.org/T280006) [17:28:18] dancy: deployment-puppetmaster seems to be up to date [17:28:31] (03CR) 10jerkins-bot: [V: 04-1] query_service: jvm defines are provided with -D [puppet] - 10https://gerrit.wikimedia.org/r/734702 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [17:28:32] thx.. Making another attempt. [17:29:51] Ok. One issue down.. One to go. [17:34:00] (03PS2) 10Ryan Kemper: query_service: jvm defines are provided with -D [puppet] - 10https://gerrit.wikimedia.org/r/734702 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [17:36:45] 10ops-eqiad, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10RobH) [17:39:17] 10ops-eqiad, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10RobH) a:03hnowlan Please note some of the racking details (hostnames, assuming same networking as other restbase, default of bullseye... [17:39:34] 10ops-eqiad, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10RobH) [17:40:01] (03CR) 10Ryan Kemper: [C: 03+2] query_service: jvm defines are provided with -D [puppet] - 10https://gerrit.wikimedia.org/r/734702 (https://phabricator.wikimedia.org/T280006) (owner: 10Ebernhardson) [17:41:41] (03PS1) 10Dduvall: hiera: Add hostname based lookup to secret hierarchy under labs [puppet] - 10https://gerrit.wikimedia.org/r/734703 [17:47:25] (03PS3) 10Legoktm: mediawiki::packages::fonts: replace fonts-liberation with fonts-liberation2 [puppet] - 10https://gerrit.wikimedia.org/r/728568 (https://phabricator.wikimedia.org/T253600) (owner: 10AntiCompositeNumber) [17:47:30] rzl: no problem [17:49:26] (03CR) 10Legoktm: [C: 03+2] mediawiki::packages::fonts: replace fonts-liberation with fonts-liberation2 [puppet] - 10https://gerrit.wikimedia.org/r/728568 (https://phabricator.wikimedia.org/T253600) (owner: 10AntiCompositeNumber) [17:50:49] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@e908052] (wcqs): Deploy 0.3.90 to WCQS [17:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:43] 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10RobH) [17:51:56] (03PS1) 10Volans: style: adopt f-strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/734707 [17:51:58] (03PS1) 10Volans: Adopt pathlib.Path everywhere [software/pywmflib] - 10https://gerrit.wikimedia.org/r/734708 [17:52:03] 10SRE, 10ops-codfw, 10Patch-For-Review: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Dzahn) a:05Dzahn→03None Chatted about this a bit. While it's low prio we would like the server back eventually. Whether it's through buying a mainboard or just replacing... [17:52:18] 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10RobH) a:03hnowlan Please note some of the racking details (hostnames, assuming same networking as other restbase, default of bullseye... [17:52:23] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@e908052] (wcqs): Deploy 0.3.90 to WCQS (duration: 01m 34s) [17:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:43] 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10RobH) [17:53:03] 10SRE, 10Performance-Team, 10Thumbor, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Jdforrester-WMF) a:05cmassaro→03None [17:53:37] (03Abandoned) 10Dzahn: conftool-data: remove mw2280.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/732756 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [17:53:58] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Patch-For-Review, 10User-notice: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10Legoktm) 05Open→03Resolved a:03AntiCompositeNumber Done! [17:54:29] legoktm, ty! [17:54:40] yw, thanks for the patch and the research :) [17:57:17] (03CR) 10Dduvall: "According to the Puppet docs, there's also a trusted fact available in v5 called `trusted.certname`. Let me know if this would be a better" [puppet] - 10https://gerrit.wikimedia.org/r/734703 (owner: 10Dduvall) [17:57:58] 10SRE, 10serviceops: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Legoktm) [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1800) [18:02:29] (03CR) 10Nskaggs: "Just some minor typos to point out" [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [18:03:59] (03CR) 10Dzahn: [C: 03+1] "Yes, lgtm. This appears to match the comments from Jelto on https://phabricator.wikimedia.org/T283076#7454868" [puppet] - 10https://gerrit.wikimedia.org/r/734664 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:04:01] (03CR) 10Nskaggs: add the Wikimedia Enterprise content downloader script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734622 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [18:04:24] (03CR) 10Dzahn: [C: 03+2] gitlab: rename config & secrets backup file [puppet] - 10https://gerrit.wikimedia.org/r/734664 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:09:01] (03CR) 10Dzahn: "this just affects the restore-from-backup-script on gitlab-replica" [puppet] - 10https://gerrit.wikimedia.org/r/734664 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:09:42] (03CR) 10Dzahn: "@AOkoth deployed on gitlab2001, feel free to test" [puppet] - 10https://gerrit.wikimedia.org/r/734664 (https://phabricator.wikimedia.org/T283076) (owner: 10AOkoth) [18:10:13] (03PS1) 10Ssingh: dnsdist: update configuration template for dnsdist 1.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/734711 [18:10:49] dancy: I was about to get at it but rzl solved it already :) thanks rzl [18:11:10] Thanks for the consideration mutante. [18:11:53] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31910/console" [puppet] - 10https://gerrit.wikimedia.org/r/734711 (owner: 10Ssingh) [18:13:49] (03CR) 10Ssingh: [V: 03+1] "[Please don't merge yet.]" [puppet] - 10https://gerrit.wikimedia.org/r/734711 (owner: 10Ssingh) [18:15:08] (03PS2) 10Dduvall: hiera: Add hostname based lookup to secret hierarchy under labs [puppet] - 10https://gerrit.wikimedia.org/r/734703 (https://phabricator.wikimedia.org/T294050) [18:24:36] (03PS1) 10Ahmon Dancy: Thumbor: Choose python-logstash package based on distro [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) [18:25:07] (03CR) 10jerkins-bot: [V: 04-1] Thumbor: Choose python-logstash package based on distro [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [18:25:34] (03PS2) 10Ahmon Dancy: Thumbor: Choose python-logstash package based on distro [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) [18:29:24] (03PS3) 10Ahmon Dancy: Thumbor: Choose python-logstash package based on distro [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) [18:29:54] 10SRE-Access-Requests: (WIP) Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10RKemper) [18:30:37] 10SRE-Access-Requests: (WIP) Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10RKemper) [18:31:08] (03PS4) 10Ahmon Dancy: Thumbor: Choose python-logstash package based on distro [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) [18:32:24] 10SRE-Access-Requests: (WIP) Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10ssingh) a:03ssingh [18:40:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:41:22] 10SRE-Access-Requests: (WIP) Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10RKemper) I can upload some initial patches for the membership changes in a couple hours No action required by the SRE on clinic duty yet (until I remove the WIP label) [18:43:15] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) [18:44:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:44:37] (03PS1) 10Ssingh: dnsrecursor: prepare pdns-recursor for the 4.5.5 release [puppet] - 10https://gerrit.wikimedia.org/r/734714 [18:46:27] 10SRE, 10SRE-Access-Requests: (WIP) Requesting access to production for ejoseph - https://phabricator.wikimedia.org/T294379 (10Gehel) As Emmanuel's manager: approved We probably want approval from @odimitrijevic for the analytics access [18:47:05] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31913/console" [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [18:47:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [18:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:28] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [18:49:02] (03CR) 10Ssingh: "No changes for existing hosts as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/734714 (owner: 10Ssingh) [18:49:04] (03CR) 10Ahmon Dancy: [C: 03+1] "Ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [18:54:58] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Cmjohnson) This is the DIMM information. I am not sure if we have one in a spare server but I will spot a check a few. @wiki_willy we may want to purchase a DIMM. emory Device Array Handle: 0x1... [18:55:07] (03PS1) 10Andrew Bogott: nova first_boot vendor data: set PUPPET_LOCK inside puppet_is_running [puppet] - 10https://gerrit.wikimedia.org/r/734716 [18:56:51] (03PS1) 10Mbch331: Add language codes agq and mcn to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734717 (https://phabricator.wikimedia.org/T288335) [18:57:12] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10wiki_willy) a:03Cmjohnson Let me know if you're able to find a spare @Cmjohnson. If not, we can order one with @RobH. Thanks, Willy [18:59:57] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) @Jclark-ctr I noticed this moved out of D6, can you update task and netbox when you get a chance [19:00:04] twentyafterfour and hashar: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1900). [19:00:24] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:29] (03CR) 10Andrew Bogott: [C: 03+2] nova first_boot vendor data: set PUPPET_LOCK inside puppet_is_running [puppet] - 10https://gerrit.wikimedia.org/r/734716 (owner: 10Andrew Bogott) [19:09:08] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:09:33] uhm, I was going to deploy the train but maybe not without logstash? [19:09:54] hmm well it's working despite the alert [19:11:25] twentyafterfour: ignore that, it's just the mgmt interface of it and the alerts are flapping because it needs a firmware upgrade of the DRAC [19:11:40] as long as it has the .mgmt at the end dont worry [19:11:46] thanks mutante, I figured it was no big deal [19:11:49] yep [19:11:53] Cool [19:12:10] I just got paged for db1112 again... [19:12:10] (03PS1) 1020after4: testwikis wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734719 [19:12:12] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734719 (owner: 1020after4) [19:12:16] 👋 just got paged -- is that the downtime expiring? [19:12:54] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734719 (owner: 1020after4) [19:12:59] !log twentyafterfour@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.6 refs T293947 [19:13:00] hello [19:13:01] I don't see it in icinga though [19:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:05] rzl: I guess so? I don't see any new recent alerts other than logstash [19:13:06] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [19:13:08] Yeah, that host is depooled and also notifications are disabled [19:13:14] I wonder why it paged [19:13:26] ACKNOWLEDGEMENT - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:13:28] ah yeah, VO says "acknowledgement expired" [19:13:31] ACKNOWLEDGEMENT - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T283582 [19:13:36] so it must have been the ack on the VO side that ran out [19:13:44] Ah could be [19:14:02] (03CR) 10Jbond: setup.py: include type hints for dependencies (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/734648 (owner: 10Volans) [19:14:06] Notifications on icinga are definitely disabled - just checked [19:14:15] I marked it as explicitly resolved [19:14:18] yeah, explains why there was nothing in IRC [19:14:22] legoktm: thanks [19:14:38] i do not understand the VO model [19:14:53] icinga is all green with disabled notifications. ACK, now it's not enough anymore if we clean icinga/puppet though I guess. a small drawback of the external paging [19:15:17] I think it would have been fine if we had downtimed on icinga and *resolved* on VO [19:15:28] *nod* makes sense [19:15:34] but downtiming on icinga doesn't actually resolve the VO incident, I guess [19:15:42] I didn't expect that but it explains the facts [19:15:43] (03CR) 10Jbond: opensearch roles: apply profile::base classes according to realm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/734658 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [19:15:53] more like "ACKing on VO is not the same as resolved on VO", right? [19:15:57] right exactly [19:16:11] acknowledging in VO means it'll page again if the incident is still going, from VO's perspective [19:16:19] Ah I see... [19:16:32] Anyways, I am going back to the sofa. Thanks all! [19:16:34] similar to ACKing in Icinga means "stop talking about it but only until next time things change" [19:16:37] maybe someone from o11y can confirm though, I'm just reverse-engineering from the loud noises my phone makes :) [19:16:42] so we should always just resolve? [19:16:57] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [19:17:01] I think so kormat, yes, or we should use both, first ACK and later resolve when it's closed for real [19:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:02] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [19:17:03] mutante: yeah, but it had notifications disabled on icinga since the EU morning when I disabled them [19:17:30] acking will be an important step When We Have A Real Oncall Rotation[tm], because it's what tells VO "don't escalate this to the next tier, I'm looking at it" [19:17:31] marostegui: yes, in Icinga all is as it should be. I was merely comparing what ACK means in VO compared to Icinga [19:17:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:17:36] we just don't care about that yet because there's no next tier [19:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:45] typically icinga will auto-resolve incidents in VO, it's just in this case where we downtimed/disabled the alert rather than having it be resolved [19:18:15] marostegui: well, except a tiny difference that additionally acking them also makes them "handled" and not "unhandled" but doesnt influence paging part, just web UI [19:18:30] also while people are looking, I quickly wrote up https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-25_s3_db_recentchanges_replica [19:20:37] seems good to me as a short summary [19:20:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:51] * mutante looks at Icinga logs for that one one more time [19:21:06] impact is missing that it caused wikireplica lag until fixed, otherwise lgtm [19:21:27] something about affecting cloud users [19:22:25] while it's right that Icinga did not seem to have noticed the HOST down, it did this: [19:22:37] Service Ok[2021-10-26 09:36:04] SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;OK;HARD;10;OK slave_sql_lag Replication lag: 0.17 seconds [19:22:41] Service Critical[2021-10-26 09:17:46] SERVICE ALERT: db1112;MariaDB sustained replica lag on s3;CRITICAL;HARD;5;4.998e+04 ge 2 [19:23:03] the interesting part here is that you can see "#page" in the first line but not the second [19:23:14] and that replica lag means paging [19:23:24] "one host being down" isn't supposed to page all by itself [19:23:29] but this is [19:23:54] so that's a little different from "why did Icinga not page for host down". that's as configured [19:24:19] right, the replica lag check should have fired when the host went down [19:24:20] and of course "why does OK page us but CRIT does not" [19:24:25] this is the question too [19:24:28] let me reword [19:24:30] misconfig? [19:25:15] it did fire, it did not page though [19:25:22] the recovery then did page [19:25:36] that's like it should be the other way around [19:26:04] ah, you know. it's cause it never got to HARD state [19:26:30] it does like 3 checks and if they fail its in SOFT state.. only a couple times later it becomes HARD state and that triggers paging [19:27:02] https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation%2F2021-10-25_s3_db_recentchanges_replica&type=revision&diff=1930407&oldid=1930405 [19:27:15] so what happened: replica check does trigger but only "soft", so no paging yet. then before it gets to repeat the check 3 times with a waiting period of X in between, [19:27:37] huh [19:27:39] the thing recovers and the state change from CRIT to OK also triggers paging. regardless of the previous state [19:27:53] well, it did page as a problem [19:27:59] [19:11:11] PROBLEM - MariaDB Replica SQL: s3 #_page on db1112 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:28:03] this is per https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=db1112 [19:28:52] I did not see a line that has it all, #page and HARD and CRIT [19:29:09] but the recovery line does [19:29:11] that line is just out of the IRC logs [19:29:35] but didnt you say this also is what happened in reality? we got paged only for recovery [19:29:46] no, we got paged for the problem [19:29:50] so it's not a matter of missing log lines [19:29:51] ok [19:29:54] but the page only happened after the host was rebooted [19:30:38] so host went down, fired host down alert (no page), you rebooted the host, it came back up, replica alert fires "could not connect" triggering page [19:30:47] also I think there are two different alerts [19:30:48] but not because the reboot made the "HOST DOWN" check recover, those never page. it was the replication lag check that paged us [19:31:19] 1) MariaDB Replica Lag: s3 #_page (what ended up paging) 2) "MariaDB sustained replica lag on s3" [19:31:30] right [19:32:00] yea, so it just takes some time for the lag to build up from the moment it goes down [19:32:17] so I rephrased it as "db1112 being down did not trigger any alert that paged until the host was brought back up" [19:32:27] and that was simply the time the lag became too large [19:34:55] ACK, (so.. I think now if the replica check gets an actual "slave_sql_state could not connect" that makes it page right away but that only happens for a moment during reboot .. unlike when it tried while the host was crashed and it just timed out..maybe ) [19:35:10] anyways :) [19:37:30] to be honest I think DBs should page on HOST DOWN [19:38:12] active DC that is [19:38:27] !log twentyafterfour@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.6 refs T293947 (duration: 25m 28s) [19:38:32] don't know how difficult that is with our current puppet code [19:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:33] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [19:40:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:37] marostegui: medium difficult but possible.. i would say, heh [19:42:49] (03PS1) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T273627) [19:45:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:58] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [19:46:40] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10wiki_willy) Equinix Ticket #1-213247924142 submitted to reseat the power supply. Thanks, Willy [19:47:13] (03PS2) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [19:47:26] (03CR) 10Volans: "Reply inline" [software/homer] - 10https://gerrit.wikimedia.org/r/734648 (owner: 10Volans) [19:48:34] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:49:15] (03PS1) 1020after4: group0 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734725 [19:49:17] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734725 (owner: 1020after4) [19:50:17] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.6 refs T293947 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734725 (owner: 1020after4) [19:50:36] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:51:26] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.6 refs T293947 [19:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:33] T293947: 1.38.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T293947 [19:54:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:55:23] jouncebot: nowandnext [19:55:23] For the next 1 hour(s) and 4 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T1900) [19:55:23] In 3 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T2300) [19:55:35] oh, new feature? [19:55:43] Kinda :) [19:55:48] I got tired of doing it seperately [19:55:48] :) [19:56:12] (03PS1) 10Reedy: ApiQueryImageInfo: don't show empty comments as deleted [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734681 (https://phabricator.wikimedia.org/T293783) [19:56:35] (03PS1) 10Reedy: ApiQueryImageInfo: don't show empty comments as deleted [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734682 (https://phabricator.wikimedia.org/T293783) [19:56:38] jouncebot should respond to "what's up?" [19:56:56] so should icinga-wm [19:57:05] +1 [19:57:11] and if the topic has "Status: Up" it needs to respond with "Wikipedia" [19:57:18] heheh [19:57:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:57:45] "the sky" "birds" "satellites" [19:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:58] jouncebot: where do you live? (gives me git pull link) [19:59:04] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:00:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:37] mutante: https://github.com/wikimedia/wikimedia-bots-jouncebot [20:04:54] urbanecm: heh, thanks [20:05:05] realized it might not be an actual question seconds later :D [20:05:16] kind of was [20:05:27] but normally I would try to search wikitech [20:05:47] yeah, you should find https://wikitech.wikimedia.org/wiki/Tool:Jouncebot there [20:08:19] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Volans) From the logs the cookbook was not able to find a reboot after the host was up in the Debian Installer environment. That usually means that the host got s... [20:09:19] "Problems encountered installing commit-msg hook [20:09:22] :p [20:15:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31914/" [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [20:18:13] (03CR) 10Dzahn: "noop on thumbor1001 in prod" [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [20:20:37] legoktm: added "(we get paged for replication lag but not for host down, Marostegui said for DB hosts we should start paging on HOST down which we normally don't do. This would require a puppet change." I dunno, it's long now but that is the actionable :) [20:21:05] :thumbsup: [20:23:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:26] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:35:12] (03CR) 10Reedy: [C: 03+2] ApiQueryImageInfo: don't show empty comments as deleted [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734681 (https://phabricator.wikimedia.org/T293783) (owner: 10Reedy) [20:35:14] (03CR) 10Reedy: [C: 03+2] ApiQueryImageInfo: don't show empty comments as deleted [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734682 (https://phabricator.wikimedia.org/T293783) (owner: 10Reedy) [20:36:32] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:43:29] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734735 [20:47:53] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734737 [20:51:17] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734738 [20:53:53] (03CR) 10Ahmon Dancy: "Thanks Dzahn!" [puppet] - 10https://gerrit.wikimedia.org/r/734712 (https://phabricator.wikimedia.org/T294148) (owner: 10Ahmon Dancy) [20:54:35] (03Merged) 10jenkins-bot: ApiQueryImageInfo: don't show empty comments as deleted [core] (wmf/1.38.0-wmf.6) - 10https://gerrit.wikimedia.org/r/734681 (https://phabricator.wikimedia.org/T293783) (owner: 10Reedy) [20:54:41] (03Merged) 10jenkins-bot: ApiQueryImageInfo: don't show empty comments as deleted [core] (wmf/1.38.0-wmf.5) - 10https://gerrit.wikimedia.org/r/734682 (https://phabricator.wikimedia.org/T293783) (owner: 10Reedy) [21:00:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:00:35] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.5/includes/api/ApiQueryImageInfo.php: T293783 (duration: 01m 03s) [21:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:46] T293783: ImageInfo iiprop=comment query returns empty comment as hidden - https://phabricator.wikimedia.org/T293783 [21:01:45] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.6/includes/api/ApiQueryImageInfo.php: T293783 (duration: 01m 03s) [21:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:41] (03PS3) 10Dduvall: hiera: Add hostname/certname based lookup to secret hierarchy under labs [puppet] - 10https://gerrit.wikimedia.org/r/734703 (https://phabricator.wikimedia.org/T294050) [21:03:26] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.6/tests/phpunit/includes/api/query/ApiQueryImageInfoTest.php: T293783 (duration: 01m 02s) [21:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:03:41] testing in production, srs bizness [21:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:38] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.5/tests/phpunit/includes/api/query/ApiQueryImageInfoTest.php: T293783 (duration: 01m 02s) [21:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:20] (03CR) 10Dduvall: "I tested the `%{::trusted.certname}` based lookup on gitlab-runners-puppetmaster-01 and it worked well, so I changed the implementation to" [puppet] - 10https://gerrit.wikimedia.org/r/734703 (https://phabricator.wikimedia.org/T294050) (owner: 10Dduvall) [21:10:54] (03CR) 10Dzahn: "would it make sense to do "labtest" before "lab"? not sure, just the name seems to imply it." [puppet] - 10https://gerrit.wikimedia.org/r/734703 (https://phabricator.wikimedia.org/T294050) (owner: 10Dduvall) [21:12:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1021.eqiad.wmnet with OS bullseye [21:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:24] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye [21:13:48] (03CR) 10Andrew Bogott: "I like the idea of using trusted_certname here. I think the project-specific lookup is harmless but unnecessary, since any project-local p" [puppet] - 10https://gerrit.wikimedia.org/r/734703 (https://phabricator.wikimedia.org/T294050) (owner: 10Dduvall) [21:14:14] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) thanks, @volans for reminding me that I had to redo the raid configuration with the new controller. [21:15:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:17] (03CR) 10Dduvall: hiera: Add hostname/certname based lookup to secret hierarchy under labs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/734703 (https://phabricator.wikimedia.org/T294050) (owner: 10Dduvall) [21:23:30] (03PS1) 10AOkoth: gitlab: disable puppet in script and change timer [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) [21:29:54] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1021.eqiad.wmnet with OS bullseye [21:29:58] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1021.eqiad.wmnet with OS bullseye executed with error... [21:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:09] (03PS2) 10AOkoth: gitlab: disable puppet in script and change timer [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) [21:47:40] 10SRE: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10dancy) [21:48:11] 10SRE: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10dancy) Hi @jijiki. I'm pinging you first since you usually take care of Scap debs for us. [21:48:28] 10SRE: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10dancy) [21:49:07] (03CR) 10Dzahn: "out of curiosity, why are you making the change to the timer (and add it here)?" [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [21:49:46] (03CR) 10Dzahn: "nitpick: add a little more detail to commit message why you are doing these things (I know, but others won't)" [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [21:56:06] (03PS3) 10AOkoth: gitlab: disable puppet in script and change timer [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) [22:05:29] (03CR) 10Dzahn: [C: 03+2] "thanks! merging" [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [22:07:22] (03CR) 10Dzahn: "deployed. no change on gitlab1001 as expected. gitlab2001 has changes. you can test again" [puppet] - 10https://gerrit.wikimedia.org/r/734741 (https://phabricator.wikimedia.org/T285867) (owner: 10AOkoth) [22:23:34] 10SRE, 10serviceops: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10dancy) [22:24:32] 10SRE, 10serviceops: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10dancy) I've been informed that @jijiki is unavailable for a while so looking for others. [22:37:44] 10SRE, 10serviceops: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10Legoktm) I can do the repackage for you, but if the goal is to get thumbor to run on buster, that's a much more complicated task AFAIK, I'll comment on the parent task. [22:56:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10Dzahn) a:03ssingh @ssingh This is like a follow-up to T293455 and T294231 so you probably don't have to go through everything again like a n... [22:59:05] !log uploaded python-logstash to buster-wikimedia for T294393 [22:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:12] T294393: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 [22:59:52] 10SRE, 10serviceops: Build and publish python-logstash deb for Buster - https://phabricator.wikimedia.org/T294393 (10Legoktm) 05Open→03Resolved a:03Legoktm [22:59:58] dancy: should https://gerrit.wikimedia.org/r/c/operations/puppet/+/734712/ be reverted? [23:00:05] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211026T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:00:11] legoktm: Yes please [23:00:34] (03PS1) 10Legoktm: Revert "Thumbor: Choose python-logstash package based on distro" [puppet] - 10https://gerrit.wikimedia.org/r/734747 [23:00:56] oh [23:01:16] I see, you already uploaded the package, wow [23:01:30] (03CR) 10Legoktm: [C: 03+2] Revert "Thumbor: Choose python-logstash package based on distro" [puppet] - 10https://gerrit.wikimedia.org/r/734747 (owner: 10Legoktm) [23:01:57] yeah, it was a pretty simple no change rebuild [23:02:20] nice [23:04:46] no-op in prod, as expected [23:20:32] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1244.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:23:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Production Shell Groups & JupyterHub for echetty - https://phabricator.wikimedia.org/T294229 (10ssingh) >>! In T294229#7460173, @Dzahn wrote: > @ssingh This is like a follow-up to T293455 and T294231 so you probably don't have to go throu... [23:25:02] ^ db2141 - NOT host down, "just a slave", QPS: 1 in dbtree ... so not like we have to call DBAs, right [23:30:34] also its in codfw [23:31:39] I did go through the suggested check whether a disk is about to fail. no "critical disks" or "failed disks" https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Replication_lag [23:32:26] also megacli -PDList -aALL | grep "Firmware state" = Firmware state: Online, Spun Up (10 times) [23:34:00] can't be sure per "sadly, disks fail in a very creative way" [23:37:14] speaking of DB monitoring, if we want to start paging for HOST down alerts like was suggested.. then while at it we should also somehow add "but not if it's a slave in codfw" or however we separate very important from just important servers. and then it could be a different level of notification of course [23:37:41] gotta go to an appointment. ouch, already late. afk [23:44:20] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734780 [23:47:59] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734781 [23:51:21] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734782 [23:53:46] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at reading authorization packet, system error: 104 Connection reset by peer https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:54:08] (03CR) 10Legoktm: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [23:55:02] (03CR) 10Cwhite: [C: 03+1] rsyslog: centralize remote_syslog_tls lookups into single location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/734401 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [23:55:52] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:56:42] (03CR) 10Cwhite: "LGTM, but others with more partman experience with should have a look too." [puppet] - 10https://gerrit.wikimedia.org/r/734564 (https://phabricator.wikimedia.org/T294302) (owner: 10Filippo Giunchedi)