[00:05:38] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:24] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:28] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.151 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [02:10:34] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 8.283e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:11:16] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [02:18:35] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Platonides) It's still sending the announcement-only mail, but the cron is working now. :-) :-) he code was already almost working. I simply changed a debugging... [03:13:26] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 7.738e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:17:04] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 7.734e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:48:49] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui) [04:50:49] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui) @Bstorm clouddb1019 and clouddb1020 are on this rack. @razzi dbstore1007 is on this rack [04:52:32] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui) [04:53:28] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) [04:54:48] 10SRE, 10Analytics, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) @Bstorm I assume we are ok with having a glitch on clouddb1017 and 1018? @razzi dbstore1005 is on this rack. [04:55:11] 10SRE, 10Analytics, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui) [04:57:23] 10SRE, 10Analytics, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) [05:00:51] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [06:43:08] ACKNOWLEDGEMENT - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service Muehlenhoff Used for Buster migration https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:15] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175 [06:44:17] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175 (owner: 10QChris) [06:47:44] <_joe_> !log restart wdqs-updater on wdqs1007 [06:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:45] !log start rasdaemon on sretest1001, didn't start after last reboot from a week ago [06:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:28] (03PS2) 10QChris: Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175 [06:49:47] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703175 (owner: 10QChris) [06:50:16] (03PS1) 10QChris: Import done. Revoke import grants [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703176 [06:50:20] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/elasticsearch/madvise] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/703176 (owner: 10QChris) [06:50:32] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/701512 (https://phabricator.wikimedia.org/T244840) (owner: 10Muehlenhoff) [07:00:04] Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210705T0700) [07:03:54] <_joe_> uhm I think I need to completely restart blazegraph on wdqs1007 [07:04:27] <_joe_> !log restarting blazegraph, then restarting the updater again [07:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:02] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 9.114e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:08:04] <_joe_> ok that's good news, it started processing again [07:17:52] 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Joe) 05Open→03Resolved I don't think the changelog-creating part should really be part of this task. What we've done now should be enough for the original intended scope of the task.... [07:29:30] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) Looking at Ganeti VMs, they broadly fall under three/four categories: SPOF, will need a maintenance window decl... [07:35:08] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Marostegui) [07:35:49] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [07:36:21] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) [07:38:02] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Marostegui) p:05Triage→03Medium [07:55:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on releases2002.codfw.wmnet with reason: bump CPU count [07:55:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases2002.codfw.wmnet with reason: bump CPU count [07:55:35] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10ops-monitoring-bot) Icinga downtime set by jmm@cumin2002 for 0:30:00 1 host(s) and their services with reason: bump CPU count ` releases2002.codfw.wmnet ` [07:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on releases1002.eqiad.wmnet with reason: bump CPU count [08:03:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases1002.eqiad.wmnet with reason: bump CPU count [08:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:41] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10ops-monitoring-bot) Icinga downtime set by jmm@cumin2002 for 0:30:00 1 host(s) and their services with reason: bump CPU count ` releases1002.eqiad.wmnet ` [08:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:48] 10SRE: Request for more CPU and RAM for releases1002/2002 - https://phabricator.wikimedia.org/T284772 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @dancy: Ack, I've bumped the CPU count to 8. If there's still performance bottlenecks going forward, please reopen the task ` jmm@cumin2002:~$... [08:15:21] !log rolling out debmonitor-client 0.3.0 [08:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:16] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:45:56] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:10:22] (03PS1) 10Kosta Harlan: GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) [09:10:50] (03CR) 10Kosta Harlan: [C: 04-2] "Do not merge until we've gotten approval in T284481." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [09:11:54] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:24:30] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) Looking at Ganeti VMs, they fall under three/four categories: SPOF, will need a maintenance window declared: * otrs1001 * an-tool1008 * an-tool10... [09:30:38] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) Looking at Ganeti VMs, they fall under three/four categories: SPOF, will need a maintenance window declared: * an-tool1005 * an-to... [09:35:32] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [09:36:39] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [09:40:10] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [09:41:44] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [09:45:51] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [09:55:31] (03CR) 10Jbond: [C: 03+1] "> Patch Set 9:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [10:00:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/702883 (owner: 10Volans) [10:04:03] (03CR) 10Jbond: [C: 03+1] "lgtm optional comment" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/702884 (owner: 10Volans) [10:05:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/702885 (owner: 10Volans) [10:10:41] (03PS1) 10Majavah: Read running tools from grid-webservices tool [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/703189 (https://phabricator.wikimedia.org/T284564) [10:10:49] (03CR) 10Jbond: [C: 03+1] "LGTM," [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [10:11:07] (03PS2) 10Jbond: Use IcingaHosts instead of Icinga (generic) [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [10:11:08] (03CR) 10Majavah: [C: 04-2] "until `webservice` changes are deployed" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/703189 (https://phabricator.wikimedia.org/T284564) (owner: 10Majavah) [10:12:12] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10jbond) [10:12:35] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10jbond) I think this would potentially bve a good candidate to port to the concat module [10:13:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:20:23] (03CR) 10H.krishna123: "Thanks Jaime for review. Good point, I will look into running pylint locally as well. Most of the errors seem to be on the SQL query, so I" [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123) [10:22:50] (03CR) 10H.krishna123: "> Patch Set 2:" (033 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123) [10:24:09] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7161333, @Legoktm wrote: > @joe is everything in this ticket now covered by Shellbox? No, this ticket is about adding an Ingress controller i... [10:27:00] !log disable puppet fleet wide to preforem puppetdb change [10:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb::app: Use seperate user for the read databse [puppet] - 10https://gerrit.wikimedia.org/r/701936 (https://phabricator.wikimedia.org/T285666) (owner: 10Jbond) [10:29:31] !log Optimize ruwiki.logging on s6 eqiad with replication T286102 [10:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:39] T286102: Please optimize logging table in ruwiki - https://phabricator.wikimedia.org/T286102 [10:37:44] !log enable puppet fleet wide to post puppetdb change [10:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:54] !log upgrading PHP on miscweb* [10:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:38] !log installing tiff security updates on stretch [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:27] !log installing openexr security updates on stretch [11:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:48] (03PS1) 10Marostegui: db1122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703204 [11:45:49] (03CR) 10Marostegui: [C: 03+2] db1122: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703204 (owner: 10Marostegui) [12:14:32] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:16] (03PS1) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) [12:48:08] (03PS1) 10Marostegui: mariadb: Move db1125 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/703207 (https://phabricator.wikimedia.org/T286042) [12:48:51] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1125 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/703207 (https://phabricator.wikimedia.org/T286042) (owner: 10Marostegui) [12:50:21] !log Stop MySQL on db1117:3321 to clone db1125 T286042 [12:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:31] T286042: Move db1124 and db1125 to misc services temporarily - https://phabricator.wikimedia.org/T286042 [12:55:20] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:55:33] ^ me [12:56:16] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:56:25] ^ me [12:59:07] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) [13:01:24] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Joe) [13:02:34] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Joe) [13:20:00] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:20:56] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:49:01] (03PS1) 10Marostegui: db1141: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/703210 [13:49:58] (03CR) 10Marostegui: [C: 03+2] db1141: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/703210 (owner: 10Marostegui) [13:53:05] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) >>! In T280497#7194633, @wkandek wrote: > Is this the dashboard? https://grafana.wikimedia.org/d/U7JT--knk/joe-k8s-mwdebug?viewPanel=70&orgId=1&from=1625227688488&to=1625246654342... [13:54:17] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) a:03Joe **Architecture:** Contour works decoupling the management layer (contour itself) from the proxying one (using envoy): the first is deployed as a deployment in kubernetes, and needs t... [14:01:48] !log uploaded nginx 1.13.9-1+wmf3 for stretch-wikimedoa [14:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:33] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) **Docker images needed**: We get to pick two directions: * If we use the helm chart, we would need a recent version of envoy (easy) and one image for contour. * If we want to use the operatior... [14:15:31] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate contour as an ingress - https://phabricator.wikimedia.org/T286196 (10Joe) **Deployment** There are two options, similar to istio: * use contour-operator, which also allows to control the runtime status of the cluster (for instance allowing zero-downtime envoy upgrad... [14:17:25] 10SRE, 10Infrastructure-Foundations: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10MoritzMuehlenhoff) [14:18:56] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:25] (03PS3) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) [14:22:27] (03PS1) 10Muehlenhoff: Add separate role for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703213 (https://phabricator.wikimedia.org/T286206) [14:22:44] (03PS2) 10Muehlenhoff: Add separate role for Ganeti test cluster [puppet] - 10https://gerrit.wikimedia.org/r/703213 (https://phabricator.wikimedia.org/T286206) [14:22:55] (03CR) 10jerkins-bot: [V: 04-1] Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:38:46] 10SRE, 10vm-requests: eqiad/codfw: 1 of VMs requested for MX - https://phabricator.wikimedia.org/T286208 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:45:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:46] (03PS2) 10Martaannaj: Add config for updated PropertySuggester beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703205 (https://phabricator.wikimedia.org/T285098) [15:09:12] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10wkandek) [15:24:46] <_joe_> !log leaving wdqs1007 depooled so that the updater can recover faster, now at 16.5 hours of lag [15:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:34] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:12] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Legoktm) a:03Legoktm [17:23:04] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10bd808) @aborrero may be able to provide some information from his past work to setup ingress-nginx for Toolforge. [17:29:29] (03PS1) 10Legoktm: nodejs10-devel/stretch: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212) [17:37:02] (03PS2) 10Legoktm: nodejs10-devel/stretch: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212) [17:37:10] (03CR) 10Legoktm: [V: 03+2 C: 03+2] nodejs10-devel/stretch: Pin apt so nodejs is installed from nodesource [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/703222 (https://phabricator.wikimedia.org/T286212) (owner: 10Legoktm) [17:40:42] !log published fixed docker-registry.discovery.wmnet/nodejs10-devel:0.0.4 image (T286212) [17:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:52] T286212: docker-registry.wikimedia.org/nodejs10-devel container after 0.0.3 does not include `npm` - https://phabricator.wikimedia.org/T286212 [17:55:27] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman doesn't replace email in notice when changing subscription email - https://phabricator.wikimedia.org/T286149 (10Legoktm) [17:56:22] 10SRE, 10Wikimedia-Mailing-lists: Redirect https://lists.wikimedia.org/pipermail/foobar/ to https://lists.wikimedia.org/hyperkitty/list/foobar@lists.wikimedia.org/ - https://phabricator.wikimedia.org/T285949 (10Legoktm) 05Open→03Resolved [18:00:04] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: lists-next: bad name in “welcome” email - https://phabricator.wikimedia.org/T278433 (10Legoktm) 05Open→03Resolved a:03Legoktm This was fixed when we upgraded Postorius. [18:02:27] 10SRE, 10Wikimedia-Mailing-lists: lists-next: no clickable link in “confirm” email - https://phabricator.wikimedia.org/T278432 (10Legoktm) 05Open→03Resolved a:03Legoktm I fixed this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/683555/ [18:04:30] 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Legoktm) [18:06:08] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: In Mailman3, users cannot change their display name from the web - https://phabricator.wikimedia.org/T283128 (10Legoktm) [18:20:04] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:20:58] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:19] (03PS1) 10Legoktm: mailman3: Discard all mails with a X-Spam-Score >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218) [22:34:55] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:14] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:29:49] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) I couldn't figure out how to spoof bounces either, so I subscribed `doesnotexist@wikimedia.org` and lowered the bounce threshold on test4, and manipulated the dates in... [23:39:54] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Enable verp probes in mailman3 - https://phabricator.wikimedia.org/T285361 (10Legoktm) >>! In T285361#7198288, @Legoktm wrote: > But per https://polymorphic.lists.wmcloud.org/postorius/lists/test4.polymorphic.lists.wmcloud.org/members/member/ it doesn't l... [23:40:48] (03CR) 10Legoktm: [C: 03+1] "Appears to mostly work in VPS. If what I wrote on T285361#7198289 sounds good to you then I think we're ready to enable this." [puppet] - 10https://gerrit.wikimedia.org/r/701658 (https://phabricator.wikimedia.org/T285361) (owner: 10Ladsgroup) [23:47:59] (03PS2) 10Legoktm: mailman3: Discard all mails with a X-Spam-Score >= 6 [puppet] - 10https://gerrit.wikimedia.org/r/703252 (https://phabricator.wikimedia.org/T286218)