diff options
| author | Craig Jennings <c@cjennings.net> | 2026-01-31 16:23:00 -0600 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2026-01-31 16:23:00 -0600 |
| commit | cad8146f1bfe6224ad476f33e3087b2e2074c717 (patch) | |
| tree | 9264d3294d96a380f8a5ec4e852565d78d4fbf2c /docs | |
| parent | 8b2a1ffce5cbd3c2be2498a7a86e02469787e68b (diff) | |
| download | archangel-cad8146f1bfe6224ad476f33e3087b2e2074c717.tar.gz archangel-cad8146f1bfe6224ad476f33e3087b2e2074c717.zip | |
docs: add new workflows and AMD GPU workaround
- Add email workflow (msmtp direct sending)
- Add assemble-email workflow (document gathering for manual send)
- Add retrospective workflow
- Add AMD GPU suspend workaround notes
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org | 217 | ||||
| -rw-r--r-- | docs/workflows/assemble-email.org | 181 | ||||
| -rw-r--r-- | docs/workflows/email.org | 198 | ||||
| -rw-r--r-- | docs/workflows/retrospective-workflow.org | 90 |
4 files changed, 686 insertions, 0 deletions
diff --git a/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org b/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org new file mode 100644 index 0000000..46e403d --- /dev/null +++ b/docs/2026-01-27-ratio-amd-gpu-suspend-workaround.org @@ -0,0 +1,217 @@ +#+TITLE: Ratio AMD GPU Suspend Freeze - Workaround & Fix Tracking +#+DATE: 2026-01-27 + +* Summary + +Ratio (Framework Desktop, AMD Ryzen AI Max / Strix Halo) freezes hard on +resume from suspend due to a VPE power gating race condition in the amdgpu +driver. The freeze requires a hard power cycle, which causes journal +corruption and can leave the btrfs filesystem read-only. + +As of 2026-01-27, the proper kernel fix exists (merged in 6.18) but is +unusable due to separate CWSR bugs in 6.18+. Ratio runs kernel 6.12 LTS, +which does not have the fix and will not receive a backport. + +A systemd suspend mask is applied as a workaround to prevent the system from +ever entering the suspend/resume path. + +* The Bug + +** What Happens + +~8% of suspend/resume cycles on Strix Halo result in a hard system freeze +approximately 1 second after the screen turns on during resume. + +** Root Cause: VPE Power Gating Race Condition + +The freeze is caused by a race condition in the amdgpu driver's VPE (Video +Processing Engine) power management during resume: + +1. System resumes from suspend. +2. amdgpu schedules =amdgpu_device_delayed_init_work_handler= (2s delay) to + run self-tests, including =vpe_ring_test_ib= which briefly powers on VPE. +3. The ring buffer test is very short. VPE goes idle. +4. After 1 second of idle, =vpe_idle_work_handler= fires and tells the SMU + (System Management Unit) to power gate (shut down) VPE. +5. *But VPE is still at a high DPM level.* Newer VPE firmware only drops DPM + back to the lowest level (DPM0) after a workload has run for 2+ seconds. + The ring buffer test was too short to trigger that drop. +6. The SMU tries to power gate VPE while it's at a high DPM level. On Strix + Halo, this hangs the SMU. +7. The SMU hang cascades -- VCN, JPEG, and other GPU IPs can't be managed. + Half the GPU is frozen. +8. The thread that issued the SMU command is stuck. System is locked up. + No further logging is possible. + +It only triggers on resume because that's when the driver runs the ring +buffer self-test. During normal operation, VPE either isn't used or has had +enough time to settle its DPM level before power gating. + +** Error Messages (if visible before freeze) + +#+begin_example +SMU: I'm not done with your previous command +Failed to power gate VPE! +Dpm disable vpe failed, ret = -62 +Failed to power gate JPEG +Failed to power gate VCN instance 0 +Dpm disable uvd failed +#+end_example + +** References + +- [[https://lkml.org/lkml/2025/8/24/139][Original VPE_IDLE_TIMEOUT patch (LKML, Aug 2025)]] +- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130657.html][VPE DPM0 fix v5 (amd-gfx, Oct 2025)]] +- [[https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg130804.html][Follow-up: missing return statement fix]] +- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop bug #4615]] +- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community: Critical 6.18/6.19 CWSR bugs]] + +* Kernel Fix Status + +** The Proper Fix + +Mario Limonciello (AMD) wrote =drm/amd: Check that VPE has reached DPM0 in +idle handler= -- makes the idle handler check that VPE has actually reached +DPM0 before attempting the power gate. Targets VPE 6.1.1 (Strix Halo) with +firmware versions below =0x0a640500=. + +Merged into Linux 6.18 during the RC phase (drm-fixes-6.18, Oct 29, 2025). +Closes freedesktop bug #4615. + +** Why We Can't Use 6.18 + +Kernel 6.18.x and 6.19.x have critical CWSR (Compute Wavefront Save/Restore) +bugs that cause hard GPU hangs on RDNA3/RDNA4 during compute workloads. The +Framework Community recommends staying on 6.15-6.17 for Strix Halo until +AMD resolves both VPE and CWSR issues in the same kernel. + +** Backport Status + +The fix was tagged =Cc: stable@vger.kernel.org= for backport but has NOT +appeared in any 6.12 LTS release as of 6.12.67. It likely won't be +backported to 6.12 due to infrastructure differences. + +** When to Check Again + +Monitor these for resolution: +- Arch =linux-lts= package updates (=pacman -Si linux-lts=) +- [[https://cdn.kernel.org/pub/linux/kernel/v6.x/][Kernel.org changelogs]] for 6.12.x stable releases +- [[https://community.frame.work/t/attn-critical-bugs-in-amdgpu-driver-included-with-kernel-6-18-x-6-19-x/79221][Framework Community thread]] for CWSR resolution status +- [[https://gitlab.freedesktop.org/drm/amd/-/issues/4615][Freedesktop #4615]] for any further developments + +* What We Applied (2026-01-27) + +** Workaround: Disable Suspend via systemd + +Prevents the system from entering the suspend/resume path entirely. +The GPU bug is still present but never triggered. + +#+begin_src bash +# Applied 2026-01-27: +sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target +#+end_src + +Effects: +- hypridle can no longer suspend the system +- Screen stays on at idle (active power draw) +- No more freeze → hard reboot → filesystem corruption cycle + +** Kernel Parameters NOT Applied + +The following parameters were identified as fixes but caused boot failures +on ratio when previously attempted (twice): + +#+begin_example +amdgpu.pg_mask=0 # Disables all GPU power gating +amdgpu.cwsr_enable=0 # Disables Compute Wavefront Save/Restore +#+end_example + +It is unclear whether the boot failures were caused by the parameters +themselves or by a corrupted initramfs from running mkinitcpio while the +GPU was in a bad state. Testing via the GRUB =e= key (temporary, no +permanent change) is planned but deferred. + +** Current Kernel Command Line (for reference) + +#+begin_example +BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=5b9f7f7f-2477-488f-8fb1-52b5c7d90e98 +rw rootflags=subvol=@ console=tty0 console=ttyS0,115200 rw loglevel=2 +rd.systemd.show_status=auto rd.udev.log_level=2 nvme.noacpi=1 +mem_sleep_default=deep nowatchdog random.trust_cpu=off quiet splash +#+end_example + +* How to Undo When a Fixed Kernel Arrives + +** Step 1: Verify the Fix is in the New Kernel + +Check that the VPE DPM0 fix is present: + +#+begin_src bash +# Check kernel version +uname -r + +# Search for the fix in the changelog +# Look for "VPE" or "DPM0" or "vpe_idle" in the relevant changelog: +# https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-<version> + +# Or check the source directly: +grep -r "vpe_need_dpm0_at_power_down\|vpe_get_dpm_level" /usr/src/linux/drivers/gpu/drm/amd/ 2>/dev/null +#+end_src + +Also verify that CWSR bugs are resolved (check Framework Community thread). + +** Step 2: Unmask Suspend Targets + +#+begin_src bash +sudo systemctl unmask sleep.target suspend.target hibernate.target hybrid-sleep.target +#+end_src + +** Step 3: Test Suspend/Resume + +#+begin_src bash +# Test a single suspend/resume cycle +sudo systemctl suspend + +# If system resumes cleanly, test a few more times +# The original bug had ~8% failure rate, so test at least 20 cycles +#+end_src + +** Step 4: If Kernel Parameters Were Applied + +If =amdgpu.pg_mask=0= and =amdgpu.cwsr_enable=0= were added to GRUB, remove +them once the kernel fix is confirmed working: + +#+begin_src bash +# Edit GRUB config +sudo vim /etc/default/grub +# Remove amdgpu.pg_mask=0 and amdgpu.cwsr_enable=0 from GRUB_CMDLINE_LINUX_DEFAULT + +# Rebuild GRUB config +sudo grub-mkconfig -o /boot/grub/grub.cfg + +# Reboot and test suspend +#+end_src + +* Log Evidence (2026-01-27 Investigation) + +** System Info + +- Machine: Framework Desktop (AMD Ryzen AI Max 300 Series) +- Hostname: ratio +- Kernel: 6.12.67-1-lts +- Filesystem: btrfs RAID1 on 2x NVMe (nvme0n1p2 + nvme1n1p2) +- GPU: AMD Strix Halo (RDNA 3.5) + +** Findings + +- 13 boots between Jan 25-27, most ending in suspend then hard freeze +- Journal corruption on boots -5, -3, and -7 (unclean shutdown) +- =mc= (Midnight Commander) stuck in D state (uninterruptible I/O) during + failed freeze attempts, in =io_schedule → folio_wait_bit_common → + filemap_read= path +- Suspend freeze pattern: =PM: suspend entry (deep)= → =PM: suspend exit= → + =PM: suspend entry (s2idle)= → no more logs → hard reboot required +- =mu= database corruption (error 121) from repeated unclean shutdowns +- btrfs device stats: zero errors on both NVMe drives +- No explicit BTRFS read-only event logged (freeze kills logging before it + can be recorded) diff --git a/docs/workflows/assemble-email.org b/docs/workflows/assemble-email.org new file mode 100644 index 0000000..bae647f --- /dev/null +++ b/docs/workflows/assemble-email.org @@ -0,0 +1,181 @@ +#+TITLE: Email Assembly Workflow +#+AUTHOR: Craig Jennings & Claude +#+DATE: 2026-01-29 + +* Overview + +This workflow assembles documents for an email that will be sent via Craig's email client (Proton Mail). It creates a temporary workspace, gathers relevant documents, drafts the email, and cleans up after sending. + +Use this workflow when Craig needs to send an email with multiple attachments that require gathering from various locations in the project. + +* When to Use This Workflow + +When Craig says: +- "assemble an email" or "email assembly workflow" +- "gather documents for an email" +- "I need to send [person] some documents" + +* The Workflow + +** Step 1: Create Temporary Workspace + +Create a temporary folder at the project root: + +#+begin_src bash +mkdir -p ./tmp +#+end_src + +This folder will hold: +- Copies of all attachments +- The draft email text + +** Step 2: Identify Required Documents + +Discuss with Craig what documents are needed. Common categories: +- Legal documents (deeds, certificates, agreements) +- Financial documents (statements, invoices) +- Correspondence (prior emails, letters) +- Identity documents (death certificates, ID copies) + +For each document: +1. Locate it in the project +2. Confirm with Craig it's the right one +3. Open it in zathura for Craig to verify if needed + +** Step 3: Copy Documents to Workspace + +**IMPORTANT: Always COPY, never MOVE documents.** + +#+begin_src bash +cp /path/to/original/document.pdf ./tmp/ +#+end_src + +After copying, list the workspace contents to confirm: + +#+begin_src bash +ls -lh ./tmp/ +#+end_src + +** Step 4: Draft the Email + +Create a draft email file in the workspace: + +#+begin_src bash +./tmp/email-draft.txt +#+end_src + +Include: +- To: (recipient email) +- Subject: (clear, descriptive subject line) +- Body: (context, list of attachments, contact info) + +The body should: +- Provide context for why documents are being sent +- List all attachments with brief descriptions +- Include Craig's contact information + +** Step 5: Open Draft in Emacs + +Open the draft for Craig to review and edit: + +#+begin_src bash +emacsclient -n ./tmp/email-draft.txt +#+end_src + +Wait for Craig to finish editing before proceeding. + +** Step 6: Craig Sends Email + +Craig will: +1. Open his email client (Proton Mail) +2. Create a new email using the draft text +3. Attach documents from the tmp folder +4. Send the email + +** Step 7: Process Sent Email + +Once Craig confirms the email was sent: + +1. Craig saves the sent email to the inbox +2. Use the extraction script to process it: + +#+begin_src bash +python3 docs/scripts/extract_attachments.py "./inbox/[email-file].eml" +#+end_src + +3. Read the extracted content to verify +4. Rename and refile the email appropriately: + +#+begin_src bash +mv "./inbox/[email-file].eml" ./[appropriate-folder]/YYYY-MM-DD-email-to-[recipient]-[topic].eml +#+end_src + +5. Delete any duplicate extracted attachments from inbox + +** Step 8: Clean Up Workspace + +Delete the temporary folder: + +#+begin_src bash +rm -rf ./tmp/ +#+end_src + +* Best Practices + +** Document Verification + +Before copying documents: +- Open each one in zathura for Craig to verify +- Confirm it's the correct version +- Check that sensitive information is appropriate to send + +** Email Draft Structure + +A good email draft includes: + +#+begin_example +To: recipient@example.com +Subject: [Clear Topic] - [Property/Case Reference] + +Hi [Name], + +[Opening - context for why you're sending this] + +[Middle - explanation of what's attached and why] + +Attached are the following documents: + +1. [Document name] - [brief description] +2. [Document name] - [brief description] +3. [Document name] - [brief description] + +[Closing - next steps, request for confirmation, offer to provide more] + +Thank you, + +Craig Jennings +510-316-9357 +c@cjennings.net +#+end_example + +** Filing Conventions + +When refiling sent emails: +- Use format: YYYY-MM-DD-email-to-[recipient]-[topic].eml +- File in the most relevant project folder. +- Remove duplicate attachments extracted to inbox + +* Example Usage + +Craig: "I need to send Seabreeze the documents for the HOA refund" + +Claude: +1. Creates ./tmp/ folder +2. Discusses needed documents (death certificate, closing docs, purchase agreement) +3. Locates and opens each document for verification +4. Copies verified documents to ./tmp/ +5. Drafts email and opens in emacsclient +6. Craig edits, then sends via Proton Mail +7. Craig saves sent email to inbox +8. Claude extracts, reads, renames, and refiles email +9. Claude deletes ./tmp/ folder diff --git a/docs/workflows/email.org b/docs/workflows/email.org new file mode 100644 index 0000000..cfd7adf --- /dev/null +++ b/docs/workflows/email.org @@ -0,0 +1,198 @@ +#+TITLE: Email Workflow +#+AUTHOR: Craig Jennings & Claude +#+DATE: 2026-01-26 + +* Overview + +This workflow sends emails with optional attachments via msmtp using the cmail account (c@cjennings.net via Proton Bridge). + +* When to Use This Workflow + +When Craig says: +- "email workflow" or "send an email" +- "email [person] about [topic]" +- "send [file] to [person]" + +* Required Information + +Before sending, gather and confirm: + +1. **To:** (required) - recipient email address(es) +2. **CC:** (optional) - carbon copy recipients +3. **BCC:** (optional) - blind carbon copy recipients +4. **Subject:** (required) - email subject line +5. **Body:** (required) - email body text +6. **Attachments:** (optional) - file path(s) to attach + +* The Workflow + +** Step 1: Gather Missing Information + +If any required fields are missing, prompt Craig: + +#+begin_example +To send this email, I need: +- To: [who should receive this?] +- Subject: [what's the subject line?] +- Body: [what should the email say?] +- Attachments: [any files to attach?] +- CC/BCC: [anyone to copy?] +#+end_example + +** Step 2: Validate Email Addresses + +Look up all recipient names/emails in the contacts file: + +#+begin_src bash +grep -i "[name or email]" ~/sync/org/contacts.org +#+end_src + +**Note:** If contacts.org is empty, check for sync-conflict files: +#+begin_src bash +ls ~/sync/org/contacts*.org +#+end_src + +For each recipient: +1. Search contacts by name or email +2. Confirm the email address matches +3. If name not found, ask Craig to confirm the email is correct +4. If multiple emails for a contact, ask which one to use + +** Step 3: Confirm Before Sending + +Display the complete email for review: + +#+begin_example +Ready to send: + +From: c@cjennings.net +To: [validated email(s)] +CC: [if any] +BCC: [if any] +Subject: [subject] + +[body text] + +Attachments: [list files if any] + +Send this email? [Y/n] +#+end_example + +** Step 4: Send the Email + +Use Python to construct MIME message and pipe to msmtp: + +#+begin_src python +python3 << 'EOF' | msmtp -a cmail [recipient] +import sys +from email.mime.multipart import MIMEMultipart +from email.mime.text import MIMEText +from email.mime.application import MIMEApplication +from email.utils import formatdate +import os + +msg = MIMEMultipart() +msg['From'] = 'c@cjennings.net' +msg['To'] = '[to_address]' +# msg['Cc'] = '[cc_address]' # if applicable +# msg['Bcc'] = '[bcc_address]' # if applicable +msg['Subject'] = '[subject]' +msg['Date'] = formatdate(localtime=True) + +body = """[body text]""" +msg.attach(MIMEText(body, 'plain')) + +# For each attachment: +# pdf_path = '/path/to/file.pdf' +# with open(pdf_path, 'rb') as f: +# attachment = MIMEApplication(f.read(), _subtype='pdf') +# attachment.add_header('Content-Disposition', 'attachment', filename='filename.pdf') +# msg.attach(attachment) + +print(msg.as_string()) +EOF +#+end_src + +**Important:** When there are CC or BCC recipients, pass ALL recipients to msmtp: +#+begin_src bash +python3 << 'EOF' | msmtp -a cmail to@example.com cc@example.com bcc@example.com +#+end_src + +** Step 5: Verify Delivery + +Check the msmtp log for confirmation: + +#+begin_src bash +tail -3 ~/.msmtp.cmail.log +#+end_src + +Look for: ~smtpstatus=250~ and ~exitcode=EX_OK~ + +** Step 6: Sync to Sent Folder (Optional) + +If Craig wants the email in his Sent folder: + +#+begin_src bash +mbsync cmail +#+end_src + +* msmtp Configuration + +The cmail account should be configured in ~/.msmtprc: + +#+begin_example +account cmail +tls_certcheck off +auth on +host 127.0.0.1 +port 1025 +protocol smtp +from c@cjennings.net +user c@cjennings.net +passwordeval "cat ~/.config/.cmailpass" +tls on +tls_starttls on +logfile ~/.msmtp.cmail.log +#+end_example + +**Note:** ~tls_certcheck off~ is used because Proton Bridge uses self-signed certificates on localhost. + +* Attachment Handling + +** Supported Types + +Common MIME subtypes: +- PDF: ~_subtype='pdf'~ +- Images: ~_subtype='png'~, ~_subtype='jpeg'~ +- Text: ~_subtype='plain'~ +- Generic: ~_subtype='octet-stream'~ + +** Multiple Attachments + +Add multiple attachment blocks before ~print(msg.as_string())~ + +* Troubleshooting + +** Password File Missing +Ensure ~/.config/.cmailpass exists with the Proton Bridge SMTP password. + +** TLS Certificate Errors +Use ~tls_certcheck off~ in msmtprc for Proton Bridge (localhost only). + +** Proton Bridge Not Running +Start Proton Bridge before sending. Check if port 1025 is listening: +#+begin_src bash +ss -tlnp | grep 1025 +#+end_src + +* Example Usage + +Craig: "email workflow - send the November 3rd SOV to Christine" + +Claude: +1. Searches contacts for "Christine" -> finds cciarmello@gmail.com +2. Asks for subject and body if not provided +3. Locates the SOV file in assets/ +4. Shows confirmation +5. Sends via msmtp +6. Verifies delivery in log diff --git a/docs/workflows/retrospective-workflow.org b/docs/workflows/retrospective-workflow.org new file mode 100644 index 0000000..440c14e --- /dev/null +++ b/docs/workflows/retrospective-workflow.org @@ -0,0 +1,90 @@ +#+TITLE: Retrospective Workflow +#+DESCRIPTION: How to run a retrospective after major problem-solving sessions + +* When to Run a Retrospective + +Run after: +- Major debugging/troubleshooting sessions +- Complex multi-step implementations +- Any session where significant friction occurred +- Sessions lasting more than an hour with trial-and-error + +* The Process + +** 1. Trigger the Retrospective + +Either party can say: "Let's do a retrospective" or "Retrospective time" + +** 2. Answer These Questions (Both Parties) + +*** What went well? +Identify patterns worth reinforcing. Be specific. + +*** What didn't go well? +Identify friction points, mistakes, wasted time. No blame, just facts. + +*** What behavioral changes should we make? +Focus on *how we work*, not technical facts. +- Good: "Confirm before rebooting" +- Not behavioral: "AMD needs firmware 20260110" + +*** What would we do differently next time? +Specific scenarios and better approaches. + +*** Any new principles to add? +Distill lessons into short, actionable principles for retrospective/PRINCIPLES.org. + +** 3. Copy and Update retrospectives/PRINCIPLES.org + +Copy the template retrospectives/PRINCIPLES.org. + +Using the copied template, add new behavioral principles learned. Keep them: +- Short and actionable +- Focused on behavior, not facts +- Easy to remember and apply + +** 4. Create Retrospective Record + +Save to =docs/retrospectives/YYYY-MM-DD-topic.org= with: +- Summary of what happened +- Answers to the questions above +- Link to detailed session doc if exists + +** 5. Commit Changes + +Commit PRINCIPLES.org updates and retrospective record. + +* PRINCIPLES.org Structure + +#+BEGIN_SRC org +,* How We Work Together +,** Principle Name +- Bullet points explaining the principle +- When it applies +- Why it matters + +,* Checklists +,** Checklist Name +- [ ] Step 1 +- [ ] Step 2 +#+END_SRC + +* Integration with Session Startup + +Add to project's protocols.org or session startup: +- Check if PRINCIPLES.org was updated since last session +- Review any new principles before starting work + +* Example Principles (Starters) + +** Sync Before Action +- Confirm before destructive or irreversible actions +- State what you're about to do and wait for go-ahead + +** Verify Assumptions +- When something "should work" but doesn't, question the assumption +- Test one variable at a time + +** Clean Up After Yourself +- Reset temporary changes before finishing +- Verify system is in expected state |
