<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
  <title>Yoichiro Hasebe</title>
  <link>https://yohasebe.com/</link>
  <atom:link href="https://yohasebe.com/feed.xml" rel="self" type="application/rss+xml" />
  <description>Posts by Yoichiro Hasebe</description>
  <item>
  <title>Music Analyst: Measuring and Evaluating a Recording</title>
  <link>https://yohasebe.com/posts/2026-06-13-music-analyst-guitar-solo/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-06-13-music-analyst-guitar-solo/</guid>
  <pubDate>Sat, 13 Jun 2026 00:00:00 +0000</pubDate>
  <description>A look at Music Analyst, a Monadic Chat app that runs two tools over a recording: a quantitative analysis (librosa + madmom) for measured</description>
  <content:encoded><![CDATA[<p>I recently added an app to <a href="https://github.com/yohasebe/monadic-chat">Monadic Chat</a> called <strong>Music Analyst</strong>. It runs two tools over an uploaded recording and keeps their results apart: a quantitative analysis and a qualitative evaluation.</p>

<ul>
  <li><strong>Quantitative analysis</strong> (<code>analyze_audio_features</code>): a Python pipeline using <a href="https://librosa.org/">librosa</a> and <a href="https://github.com/CPJKU/madmom">madmom</a> measures duration, tempo, key, time signature, chord progression, and sections. These are numbers, computed from the signal.</li>
  <li><strong>Qualitative evaluation</strong> (<code>critique_audio</code>): the same audio goes to Gemini's audio-understanding endpoint for a critique: character and mood, genre, instrumentation, and performance qualities like expression, phrasing, dynamics, and timing.</li>
</ul>

<p>The qualitative side rests on a documented capability: among the use cases Gemini's <a href="https://ai.google.dev/gemini-api/docs/audio">audio documentation</a> lists is to "Detect emotion in speech and music." The recording I use to show it is my own playing, a rough solo from <a href="../2026-03-15-cover-session-spring/index.html">a cover session</a>, included so the feature comes with audio you can actually hear. Please excuse the rough playing.</p>

<h3 id="how-it-fits-together">How it fits together</h3>

<div style="max-width:400px;margin:0 auto">
<img src="mermaid-1.svg" alt="Music Analyst architecture: an orchestration agent dispatching to the quantitative and qualitative tools, then assembling the reply" style="width:100%" />
</div>

<p>A chat model, the orchestration agent, receives the upload and, for a normal analysis, runs both tools. The two are different in kind: <code>analyze_audio_features</code> is a deterministic Python step, not a language model, while <code>critique_audio</code> is a separate Gemini call that receives only the audio. It does not see the measured numbers, so its impressions come from the sound rather than from the quantitative output. Both results return to the orchestration agent, which assembles the reply.</p>

<p>The qualitative prompt is kept within what the model can actually support: it does not judge audio fidelity, mix, or stereo image (per the same documentation, Gemini folds multi-channel audio down to a single channel and downsamples it), it leaves exact tempo and key to the quantitative tool, and it names only instruments it can hear, treating genre as a tentative guess rather than a firm label.</p>

<h3 id="trying-it-on-a-recording">Trying it on a recording</h3>

<p>I ran it with almost no context attached. No song title, no artist, no genre:</p>

<blockquote>
  <p>Please analyze solo-sample.mp3. It is a short electric-guitar solo I recorded in a session with friends. Run both of your analyses on it: first the quantitative, measured features (tempo, key, time signature, chord progression), then a qualitative evaluation of the performance, focusing on the lead guitar's expression, phrasing, and timing/groove. Please identify the genre/style and instrumentation you hear as part of the evaluation, and present the quantitative features and the qualitative evaluation as clearly separated sections.</p>
</blockquote>

<p>The recording:</p>

<audio controls="" src="solo-sample.mp3" style="width:100%"></audio>

<h3 id="quantitative-features">Quantitative features</h3>

<ul>
  <li><strong>Duration:</strong> 0:36</li>
  <li><strong>Tempo:</strong> 73.8 BPM</li>
  <li><strong>Key:</strong> E minor</li>
  <li><strong>Time signature:</strong> 4/4</li>
  <li><strong>Chord progression:</strong> centered on E minor (Em, C, G, Am, with a passing D# and A)</li>
</ul>

<p>The tempo and key match the recording. Because this is signal processing rather than judgement, the numbers are basically the same on every run.</p>

<h3 id="qualitative-evaluation">Qualitative evaluation</h3>

<p>This is Gemini's qualitative evaluation, formed from the audio. Here are excerpts from the report:</p>

<p><strong>Genre, style, and instrumentation</strong></p>

<blockquote>
  <p>The critique hears the track leaning strongly toward a high-energy style of hard rock, perhaps drawing heavily from late-1960s or 70s psychedelic rock, blues-rock, or early heavy metal. The instrumentation appears to be a classic power-trio lineup: an overdriven, highly sustaining electric lead guitar as the primary voice; a driving bass guitar providing low-end propulsion; a steady, fast-paced drum kit keeping a relentless, straight-ahead rock beat.</p>
</blockquote>

<p><strong>Expression</strong></p>

<blockquote>
  <p>The guitarist's expressive vocabulary is highly physical and deeply vocal. The critique notes a rich use of wide, aggressive vibrato, dramatic pitch bends, and slides that give the guitar a singing, almost human quality. The sharp, dig-in pick attack drives the amplifier into a rich saturation, creating a raw, tactile connection that makes every note feel urgent and deeply felt.</p>
</blockquote>

<p><strong>Phrasing</strong></p>

<blockquote>
  <p>The phrasing operates on a tension-and-release dynamic using familiar blues-scale and pentatonic frameworks. The performance contrasts long, soaring sustained notes with rapid, explosive flurries of ornamentation. It has a spontaneous, improvisational feel, though the critique notes that the phrasing sometimes leans heavily on continuous, busy activity, leaving little room for silence or breath.</p>
</blockquote>

<p><strong>Timing and groove</strong></p>

<blockquote>
  <p>Set against a driving rhythm section, the lead guitar rides on top of the beat with a highly pushy, forward-leaning attitude. While this adds frantic, exciting energy, it also presents a challenge: during the faster, subdivided runs, the guitarist has a tendency to rush, momentarily crowding the pocket and pulling slightly ahead of the drummer's steady groove.</p>
</blockquote>

<h3 id="how-far-to-trust-it">How far to trust it</h3>

<p>That the quantitative analysis is deterministic does not make it correct: it returns the same values for the same input, but it can still mis-read; chord detection in particular sometimes mistakes the harmony. The qualitative evaluation is shakier: it runs only on Gemini for now, and depending on how the prompt is built it can name an instrument that is not in the recording. In earlier iterations, before I added those constraints to the prompt, the qualitative tool sometimes mentioned a rhythm guitar that was not actually on the track. The performance may not be quite sufficient yet. Still, as better models arrive, having AI evaluate recorded music like this should become a realistic and genuinely useful practice.</p>
]]></content:encoded>
</item>
<item>
  <title>Reading Anthropic on Recursive Self-Improvement</title>
  <link>https://yohasebe.com/posts/2026-06-06-reading-anthropic-rsi/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-06-06-reading-anthropic-rsi/</guid>
  <pubDate>Sat, 06 Jun 2026 00:00:00 +0000</pubDate>
  <description>Notes after reading Anthropic's piece on AI building AI, on why I keep Monadic Chat multi-provider, and on what is left of the human side</description>
  <content:encoded><![CDATA[<p>Anthropic recently posted a piece called <a href="https://www.anthropic.com/institute/recursive-self-improvement">"When AI builds itself"</a> on their institute site. It is about recursive self-improvement: the point at which an AI system can autonomously design and develop its own successor. They are explicit that we are not there yet, and that this is not inevitable. The piece also lays out evidence that the direction is real. As of May 2026 they report that more than 80% of the code merged into Anthropic's codebase is written by Claude, and the length of tasks AI can reliably complete is doubling roughly every four months. Research taste and alignment, they note, are the parts that remain hardest for AI.</p>

<p>On policy, Anthropic say they would support a coordinated international pause under certain conditions, specifically if verification systems existed so frontier developers could confirm others had actually stopped. A unilateral pause by one lab, they argue, would mainly just change who the front-runner is.</p>

<p>I work on <a href="https://github.com/yohasebe/monadic-chat">Monadic Chat</a>, a locally hosted AI chat platform. I have kept it multi-provider deliberately. Today it routes to seven hosted providers and one local runtime:</p>

<ul>
  <li>OpenAI (US)</li>
  <li>Google (US)</li>
  <li>xAI (US)</li>
  <li>Anthropic (US)</li>
  <li>Cohere (Canada)</li>
  <li>Mistral (France)</li>
  <li>DeepSeek (China)</li>
  <li>Ollama (local, not tied to any jurisdiction)</li>
</ul>

<p><img src="monadic-chat-providers.png" alt="Monadic Chat's System Settings panel: the Base App dropdown lists the supported providers (OpenAI, Anthropic, xAI, Google, Cohere, Mistral, DeepSeek, Ollama). To the right is an agent flow diagram: User Input goes through System Prompt to gpt-5.4, with Tools (Library Search, File Operations), Message History, and Features (Web Search) panels feeding in and out" /></p>

<p>The reason is less a principle than a practical hedge. Without an international coordination framework in place, regulation, if and when it comes, is likely to take different shapes across providers, jurisdictions, and timelines. If a user has only one route to the AI work they want to do, their work can stall when those rules change first. More routes mean more flexibility as things change.</p>

<p>At the same time, what Anthropic are calling for seems reasonable to me. Paying attention to the trajectory and setting up structures that would let humans intervene later is hardly an overreaction.</p>

<p>Over the last few years I have been watching the layers AI handles in development grow. Snippet suggestions, then function coding, then codebase design, technology choice, execution, testing, debugging. I built <a href="https://github.com/yohasebe/tmux-ccm">ccm for tmux</a> on the working belief that the last thing left on the human side is attention and judgment. The Anthropic piece argues that even that layer is starting to move.</p>

<p>There is not much an individual developer can do at this scale. Most of what I build either reinvents something that already exists or gets superseded by a larger project. I keep building anyway. Honestly, the AI built on top of large language models is becoming a more and more complete black box for ordinary users. Before it fully gets there, I want to grasp what is happening, as far as I can, for myself. I cannot enter the model development race. But building my own tools is itself a way of keeping some control on my side, however small. At least I can see what is happening firsthand.</p>
]]></content:encoded>
</item>
<item>
  <title>Looking for a Multi-Account Email Solution</title>
  <link>https://yohasebe.com/posts/2026-06-05-email-aggregation/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-06-05-email-aggregation/</guid>
  <pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate>
  <description>Multi-account email aggregation used to be a normal thing. It quietly stopped being normal. Notes on what changed, and on reviving a small</description>
  <content:encoded><![CDATA[<p>There was a time when running several email accounts through one inbox was the obvious thing to do. Gmail would pull mail from your Yahoo account, or your school or work address, over POP, and you could send from those addresses too. One inbox, many addresses.</p>

<p>That was a few years ago. Gmail has been winding down its "Check mail from other accounts" feature and signaled that further changes are coming. Microsoft has tightened authentication, and the old way of pulling Outlook addresses into Gmail no longer works. Each provider has retreated into its own walled garden.</p>

<p>For someone with email spread across several services, there are few good options now. Cloud-side aggregators have mostly vanished, and local clients come with their own trade-offs, most notably being tied to a specific device.</p>

<p>Back in 2022 I worked through this and moved to <a href="https://fastmail.com">Fastmail</a>. It is paid, but it handles the multi-address case cleanly. It is independent of the big platforms, lets you host your own domain, and sticks to open standards. I also built a small Chrome extension on the side, <a href="https://chromewebstore.google.com/detail/fastmail-plus/ibgnnkojbkconppocnmdobeodcaijmfm">Fastmail Plus</a>, to smooth over a few UI quirks of my own. It had sat untouched for a long stretch. I brought it up to date with the current UI this week.</p>

<p><img src="fastmail-plus.png" alt="Fastmail Plus browser extension banner: large &quot;Fastmail Plus&quot; logo on a blue background, with a small screenshot of the Fastmail web UI showing the extension's panel listing features like Quick search mode switch, Cursor key navigation, Extra shortcut keys, Non-clutter viewing mode, and Reading pane control buttons. The tagline reads &quot;A browser extension to make Fastmail web UI more usable and productive&quot;" /></p>

<p>Authentication standards have been tightening across the industry, and POP-based bridges are being phased out. Email itself is not going anywhere, but the way to handle several addresses now is to pick a single host and live there.</p>
]]></content:encoded>
</item>
<item>
  <title>Image Generation as a Classroom Visual Aid</title>
  <link>https://yohasebe.com/posts/2026-05-29-classroom-image-generation/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-05-29-classroom-image-generation/</guid>
  <pubDate>Fri, 29 May 2026 00:00:00 +0000</pubDate>
  <description>Using Monadic Chat's Image Generator to make grid visuals on the fly for student talks, when topics are picked individually and cannot be</description>
  <content:encoded><![CDATA[<p>At my institution I teach several first-year English communication courses for students from a range of faculties. In several of them, <a href="https://github.com/yohasebe/monadic-chat">Monadic Chat</a> has come to play something like an assistant-teacher role. Voice conversations with students, on-the-fly vocabulary or idiom examples, cultural commentary on items that come up in reading materials. The teacher's moderation remains essential, but recent high-end LLMs have the capability needed for this kind of use.</p>

<p>One format I have been running lately, with groups of around 15 or 16 students, is short one-minute English talks on a given theme. "A foreign city I would like to visit." "The greatest invention or discovery in human history." Themes broad enough that each student can choose something personal, specific enough that they have something to say.</p>

<p>Some kind of visual aid usually makes these sessions more engaging. But the talks happen one at a time, students pick their own topics, and I do not know what all of them are until they speak. There is no preparing a slide deck in advance.</p>

<p>Monadic Chat's Image Generator, backed by OpenAI's <code>gpt-image-2</code>, solves this. A prompt like:</p>

<blockquote>
  <p>Create a matrix-style image of the following N items, with numbered labels.</p>
</blockquote>

<p>together with the list of student topics produces a single grid image on the spot, which I project on the classroom screen as the talks proceed.</p>

<p><img src="inventions-grid.png" alt="A grid image of seventeen inventions and discoveries chosen by students, with numbered cells: Electricity, Alarm Clock, Instant Noodle, Emoji, Medicine, Railway, Lens, Paper, Plastic, Language, Camera, Airplane, Ship, Air Conditioner, Cup Ramen, Writing System, Internet" /></p>

<p class="caption">Theme: "The greatest invention or discovery in human history."</p>

<p><img src="cities-grid.png" alt="A grid image of sixteen foreign cities chosen by students, with numbered cells: Orlando, Helsinki, Stockholm, Venice, Stroud, Banff, Lucerne, Seattle, Ho Chi Minh, Prince Edward Island, Seoul, Vancouver, Reykjavik, Paris, Cologne (Köln), Rovaniemi" /></p>

<p class="caption">Theme: "A foreign city I would like to visit."</p>

<p>ChatGPT or Gemini can do this just as well, and plenty of teachers already work this way. The shape of classroom teaching is certainly shifting.</p>
]]></content:encoded>
</item>
<item>
  <title>Conversation Memory That Stays on Your Laptop</title>
  <link>https://yohasebe.com/posts/2026-05-25-monadic-chat-local-memory/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-05-25-monadic-chat-local-memory/</guid>
  <pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate>
  <description>Monadic Chat now uses a local multilingual embedding model and a local vector DB. Indexing and retrieval of past chats happen entirely on</description>
  <content:encoded><![CDATA[<p>You remember talking with the AI assistant about something months back, like a config you sorted out or a paper you summarized. But finding it again means figuring out which past chat to even open. Most LLM clients don't search across your past chats at all, and the ones that do tend to ship your messages to a cloud embedding API just to make them searchable.</p>

<p>The latest <a href="https://github.com/yohasebe/monadic-chat">Monadic Chat</a>, an AI chat platform I work on, moves this end to end onto your machine.</p>

<p>The earlier setup used OpenAI's <code>text-embedding-3-large</code> for the help system and PDF knowledge base, with PGVector for storage. It worked for static documents, but extending the same path to conversations would have meant every chat passing through a cloud embedding endpoint just to be findable later. I wanted to avoid that. The new embedding pipeline is fully local:</p>

<ul>
  <li>An embeddings container running <a href="https://huggingface.co/intfloat/multilingual-e5-base"><code>multilingual-e5-base</code></a> (sentence-transformers)</li>
  <li>A <a href="https://qdrant.tech/">Qdrant</a> container holding the vectors</li>
  <li>A two-level embedding scheme: a summary per conversation, plus the individual turns inside it</li>
</ul>

<p>Qdrant is an open-source database designed for embeddings: it stores them and searches by similarity in meaning rather than exact wording.</p>

<p>The same store also holds PDF and document content if you've imported any, so search works uniformly across whatever you've put into the knowledge base.</p>

<p>Search is cascade: a query first hits conversation summaries, then drills into the matching conversations turn by turn. The retrieval is wired up as a RAG tool too, so past content can be pulled into the current chat as context when the model decides it's useful.</p>

<p>And if you pick an Ollama-backed model for the chat itself, it runs locally too.</p>
]]></content:encoded>
</item>
<item>
  <title>The Cloud Sees PERSON, You See Your Name</title>
  <link>https://yohasebe.com/posts/2026-05-18-monadic-chat-privacy-filter/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-05-18-monadic-chat-privacy-filter/</guid>
  <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
  <description>Monadic Chat detects personal info locally and replaces it with typed placeholders before sending to cloud LLMs. You still see your</description>
  <content:encoded><![CDATA[<p>You have something personal to ask the AI assistant. A draft email with names in it. A summary of a doc with email addresses or phone numbers. You want the LLM's help but pasting it straight into a cloud provider stops you. So you redact by hand, paste the masked version, mentally translate back when reading the response. Tedious, and the LLM doesn't know what kind of thing you redacted.</p>

<p>The latest <a href="https://github.com/yohasebe/monadic-chat">Monadic Chat</a>, an AI chat platform I work on, does this for you, invisibly. You type the original. Before send, a local detector replaces PII (personally identifiable information) with typed placeholders. The LLM gets <code>&lt;&lt;PERSON_1&gt;&gt;</code>; you keep seeing the real name on screen. When the response comes back, any placeholders it preserved are restored to the original values for display.</p>

<p><img src="privacy-filter.png" alt="Monadic Chat Chat Plus conversation: a user message asking for a draft email to &quot;Kevin Park&quot; with office number &quot;555-123-4567&quot;, and the assistant's email reply containing the same values. Both occurrences of the name and phone number are visibly highlighted, indicating the privacy filter detected them as PII" /></p>

<p><img src="privacy-dashboard.png" alt="Privacy Registry modal listing two mappings: PERSON_1 to Kevin Park (PERSON), PHONE_NUMBER_1 to 555-123-4567 (PHONE_NUMBER). A note above the table reads &quot;This list is held in memory only and is cleared when the conversation is closed. Copy is intentionally disabled to avoid accidental disclosure.&quot;" /></p>

<p>The detector runs in its own Docker container with <a href="https://microsoft.github.io/presidio/">Microsoft Presidio</a> + <a href="https://spacy.io/">spaCy</a>. Presidio handles pattern-based categories (email, phone, credit card, SSN, IP, URL, and others); a local spaCy NER model handles context-dependent ones (people, organizations). Detection is multilingual, supporting 9 languages.</p>

<p>A few properties that matter:</p>

<ul>
  <li><strong>Per-session</strong>: each chat gets its own placeholder registry, in memory. Nothing accumulates as a growing on-disk record of who you've talked about. If you export an encrypted session and import it again later, the registry is restored alongside the conversation, so masking keeps working on follow-up turns.</li>
  <li><strong>Per-app</strong>: the toggle shows up only on apps where masking makes sense (Chat Plus, Mail Composer, Translate, Second Opinion). Apps focused on code generation or media output don't expose it, since their inputs rarely carry personal information.</li>
  <li><strong>Fail-closed</strong>: in the unlikely event the privacy container can't run, the main chat still works, but background calls that would send extra text to the cloud (title suggestion, second opinion) are skipped rather than sent without masking.</li>
  <li><strong>Detection is probabilistic</strong>: pattern-based categories (emails, phones, etc.) are caught reliably, but the NER paths for names and organizations will sometimes miss novel forms or edge cases. The filter substantially reduces what leaves the machine; it doesn't guarantee zero leakage.</li>
</ul>

<p>The LLM only sees the type label (<code>&lt;&lt;PERSON_1&gt;&gt;</code>, <code>&lt;&lt;EMAIL_1&gt;&gt;</code>), not the actual value. It knows what kind of thing is hidden, but not which specific one. For rewriting, summarizing, translating, drafting, that level of knowledge is enough. For tasks where the specific value matters, you can just turn the filter off for that session.</p>
]]></content:encoded>
</item>
<item>
  <title>ccm: An Attention Manager for Parallel Claude Code Sessions</title>
  <link>https://yohasebe.com/posts/2026-05-04-ccm-attention-manager/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-05-04-ccm-attention-manager/</guid>
  <pubDate>Mon, 04 May 2026 00:00:00 +0000</pubDate>
  <description>When you run Claude Code across several projects in parallel, the bottleneck stops being compute and starts being attention. ccm is a tmux</description>
  <content:encoded><![CDATA[<p>When you run Claude Code for one project at a time, switching contexts feels manageable. You wait for it to finish, you respond, you move on. Push that up to four or five projects in parallel and the picture changes. The compute is no longer the bottleneck. Your attention is.</p>

<p>I have been building <a href="https://github.com/yohasebe/tmux-ccm">ccm for tmux</a> over the past few months as an answer to this. It does several things now, but the pieces that defined the project from the start trace back to a single question: <em>which window should I be looking at right now?</em></p>

<h2 id="the-mental-model">The mental model</h2>

<p><img src="mental-model.svg" alt="ccm mental model: a tmux session contains windows, each window equals one project (one folder), every window holds Claude Code plus other panes (shell, dev server, any long-running process), the state of the Claude Code pane (PERMIT, BUSY, IDLE) aggregates back to the window, and ccm switches between windows with prefix + Tab" /></p>

<p>ccm assigns each project to its own tmux window. The window can hold Claude Code alongside a shell, a dev server, or any other long-running process. The state of the Claude Code pane (PERMIT, BUSY, IDLE) propagates up to the window itself, and ccm lets you move between windows with <code>prefix + Tab</code>.</p>

<h2 id="states-that-map-to-urgency">States that map to urgency</h2>

<p>Each Claude Code session moves through phases. It is reasoning or producing output (BUSY). It pauses to ask permission before taking some action (PERMIT). It finishes and waits for the next prompt (IDLE). Or nothing is currently running in the window (SHELL). ccm reads each tmux pane's contents in the background and tags the window with one of these states.</p>

<p>The four states sort cleanly by urgency:</p>

<blockquote>
  <p>PERMIT &gt; BUSY &gt; IDLE &gt; SHELL</p>
</blockquote>

<p>PERMIT is the most demanding: a session has presented options and is waiting for you. Miss it and the session sits frozen. BUSY is next – a session is producing output that you may want to follow or course-correct. IDLE is comfortable: the session has done what was asked, and there is no pressure. SHELL is the lowest priority; nothing is running, so nothing needs you unless you decide to start something.</p>

<p>ccm also moves windows to SHELL on its own. A session that has been idle for a while is auto-exited to free memory and CPU, then restarted with <code>--continue</code> (resuming the previous context) when you switch back. The state ordering is what the dashboard surfaces:</p>

<p><img src="dashboard.png" alt="ccm dashboard showing multiple projects with states, branches, and ports" /></p>

<p>PERMIT rows land at the top regardless of original window order; the eye moves down from there. Switching to a project becomes a single keystroke once the decision is made.</p>

<h2 id="coordination-across-projects">Coordination across projects</h2>

<p>A second feature applies the same state model across projects. <code>ccm send &lt;project&gt; &lt;message&gt;</code> queues a prompt into another project's session, with safety gating based on the target's current state. PERMIT-state windows are refused unconditionally – typing into a permission dialog could accidentally approve or reject a tool call. BUSY-state targets need explicit <code>--force</code> (the input is queued rather than mixed mid-turn). SHELL-state windows can auto-start Claude Code first.</p>

<p>This turns inter-project coordination into something close to message passing. The session in project A can – through me, as the human in the loop – ask the session in project B about its current state, share findings, request follow-up work. The same states that organize my attention also gate what flows between projects.</p>

<h2 id="with-agent-teams">With Agent Teams</h2>

<p>Once <code>ccm send</code> makes inter-project coordination routine, a natural next question follows: could the coordination itself be delegated to AI? Within a single project, Claude Code already does this through <a href="https://code.claude.com/docs/en/agent-teams">Agent Teams</a>, where one session coordinates teammates running in parallel panes within a single window (tmux or iTerm2).</p>

<p>Because the two structures (panes within a window, windows across a session) do not overlap, they compose naturally. A ccm-managed window can host an Agent Team, and ccm aggregates state across the team's panes using the same priority order. If any teammate is at PERMIT, the whole window appears as PERMIT in the dashboard – so a single agent waiting for permission surfaces immediately even when focus is on a different teammate, and the human-side question stays the same: which window should I be looking at right now?</p>

<h2 id="attention-as-the-human-ai-interface">Attention as the human-AI interface</h2>

<p>The opening observation – attention, not compute, is the bottleneck – seems to sharpen as more work moves to agents. Attention may be one of the places where humans and AI still meet, with the human role on that surface narrowing toward what to bring forward, what to ignore, when to interrupt. ccm is a small piece of that surface, and the rest is something I want to keep thinking about.</p>
]]></content:encoded>
</item>
<item>
  <title>Watching Words Appear: Real-time STT and L2 Listening</title>
  <link>https://yohasebe.com/posts/2026-04-25-speechdock-listening-mirror/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-25-speechdock-listening-mirror/</guid>
  <pubDate>Sat, 25 Apr 2026 00:00:00 +0000</pubDate>
  <description>Why captions can hurt listening practice, and how the lag in real-time STT might be useful</description>
  <content:encoded><![CDATA[<p>When we learn a foreign language, listening is, for many learners, the skill that resists practice the most. Reading and writing happen on our schedule. Speaking we control. But listening means catching speech as it goes by, in real time, with no rewind.</p>

<p>The most common aid is captions. Watch a video with subtitles and you can follow along. But captions can undermine the very thing we are trying to practice. Visual information tends to take priority over audio in our cognitive processing, and when the two arrive together, the work of extracting meaning from sound alone is reduced. Since captions are timed to the audio – or slightly ahead – the listener ends up reading rather than listening.</p>

<p>To address part of this, <a href="https://yohasebe.com/tcse/">TCSE</a> has long had a fullscreen mode (described in <a href="../2026-04-18-tcse-fullscreen/index.html">an earlier post</a>) where no transcript shows during playback; only the previous and current lines appear the moment you pause. The listener has to attempt interpretation first, and only then check. But the answer-checking happens at sentence boundaries. You hear a whole sentence, decide what you think it said, then pause to verify. The granularity is coarse.</p>

<p><a href="https://github.com/yohasebe/speechdock">SpeechDock</a> is a Mac app I have been building for speech-to-text and text-to-speech tasks. Unlike most STT apps, it can take its input from the microphone, the system audio mix, or the audio from a specific application. One of its modes runs that input through real-time STT (either macOS-native or a cloud provider) and displays the result as a HUD overlay on top of whatever is on screen.</p>

<video controls="" preload="metadata" src="ask-nasa-transcription.mp4" style="max-width:100%;"></video>

<p><em>SpeechDock real-time transcription overlaid on NASA's <a href="https://images.nasa.gov/details/NHQ_2020_0127_AskNASA┃%20How%20Will%20Astronauts%20Dig%20on%20the%20Moon">Ask NASA: How Will Astronauts Dig on the Moon?</a> (00:16–00:46, public domain). The demo uses macOS's native STT, with VLC's audio output specified as the SpeechDock input source.</em></p>

<p>Real-time STT cannot show text at the same moment as the audio. There is always some lag in our verbal communication, for that matter: speech has to be heard, segmented, recognized, and interpreted. And there is always the possibility of revision when later input changes the parse.</p>

<p>Real-time STT makes this incremental, revisable parsing visible. The HUD is not a transcript that hands you the answer; it shows a process that mirrors the cognition you are trying to develop, unfolding before your eyes.</p>

<p>Whether this kind of tool actually helps L2 learners improve their listening is an open empirical question. If anyone wants to design a proper study together, please get in touch.</p>
]]></content:encoded>
</item>
<item>
  <title>Filing in Trees, Finding in Fragments</title>
  <link>https://yohasebe.com/posts/2026-04-25-fzf-alfred-workflow/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-25-fzf-alfred-workflow/</guid>
  <pubDate>Sat, 25 Apr 2026 00:00:00 +0000</pubDate>
  <description>Why filing and finding pull in different directions, and what fzf in Alfred does about</description>
  <content:encoded><![CDATA[<p>When I look for something on my computer, the things I remember come in pieces. The folder it might be in. A word from the filename. A topic. They arrive in no particular order. But the standard way to find a file – open a folder window, descend one level, then another – demands that I produce them in exactly the right sequence.</p>

<p>I only notice this when I am finding things. When I am putting them away, I am happy to take my time, placing each file somewhere sensible. Filing and finding feel like different kinds of work.</p>

<h2 id="filing-can-wait-finding-cannot">Filing can wait, finding cannot</h2>

<p>Filing can wait. I have a moment to decide which project, which year, which subfolder. Getting it slightly wrong is fine – I can move the file later.</p>

<p>Finding cannot wait. By the time I notice I want a file, I want it now. Whatever fragments come to mind should be enough.</p>

<p>There are two common ways to organize files in the face of this. Neither works for me.</p>

<p>The first abandons hierarchy: keep everything in one flat space, tag each file, and search by tag. Retrieval is great. But filing becomes a chore. There is no obvious "right place" for a new file, and tagging well takes more discipline than I have. So I file less.</p>

<p>The second commits to hierarchy in both directions: organize as a tree, find by descending the tree. macOS Finder columns invite this. But finding becomes expensive. I have to know the tree to use it, and the fragments that come to mind first are not always at the top. I open the wrong folder, back out, try again – searching by walking, the slow way.</p>

<p>What both share is the assumption that storage and retrieval should mirror each other. But filing and finding are not the same activity, and forcing them into the same shape makes both worse.</p>

<h2 id="a-way-that-respects-both">A way that respects both</h2>

<p>What I want is to keep the hierarchy I already have, and search across it with whatever fragments come to mind, in any order.</p>

<p>On macOS, the tool that does this for me is <a href="https://github.com/yohasebe/fzf-alfred-workflow">fzf-alfred-workflow</a>, a small Alfred workflow I wrote some years ago. It plugs <a href="https://github.com/junegunn/fzf">fzf</a> and <a href="https://github.com/sharkdp/fd">fd</a> together so any space-separated fragments I type are matched against every path under my home directory, in any order.</p>

<p><img src="fzf-demo.gif" alt="fzf in Alfred, narrowing results as keywords are added" /></p>

<p>The pleasure of using it is less about speed than about a small relief: I no longer have to remember where things are. Filing stays hierarchical, because there is no rush when I file. Finding takes whatever fragments are in my head.</p>

<p>The same asymmetry shows up in language. Lexical concepts are organized hierarchically – an efficient data structure for organizing the world. But real-time lexical use does not behave like a tree search; it unfolds on a reference-point network, jumping from one anchor to another. The tool feels right because it stops trying to make filing and finding match. They were never going to.</p>
]]></content:encoded>
</item>
<item>
  <title>Monadic Chat: Expressive Speech</title>
  <link>https://yohasebe.com/posts/2026-04-21-expressive-speech/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-21-expressive-speech/</guid>
  <pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate>
  <description>Letting the chat model direct its own delivery -- tags and instructions across four TTS providers in Monadic Chat's Voice Chat</description>
  <content:encoded><![CDATA[<p><a href="https://yohasebe.github.io/monadic-chat/">Monadic Chat</a>'s voice interaction runs on a pipeline: user audio → STT → chat model → TTS → assistant audio. The new <strong>Expressive Speech</strong> feature uses that pipeline by having the chat model emit both the reply text and a set of delivery cues (laugh here, whisper there, sound warm throughout) in the same generation, and passing them to the TTS engine together.</p>

<p>In Monadic Chat's Voice Chat apps, this effectively lets the assistant act as an AI agent that also directs its own delivery – deciding where to laugh, where to whisper, and what overall color the voice should have, all at reply-generation time.</p>

<pre><code class="language-mermaid">flowchart LR
    user["`🎤 **User audio**`"]:::io --&gt; stt["`✍️ **Speech-to-Text**
(API call)`"]:::api
    stt --&gt; llm["`🤖 **Chat model**
(single API call)`"]:::model
    llm --&gt; text["`💬 **Reply text**`"]:::data
    llm --&gt; cues["`🎭 **Delivery cues**
(tags or instructions)`"]:::data
    text --&gt; tts["`🗣️ **Text-to-Speech**
(API call)`"]:::api
    cues --&gt; tts
    tts --&gt; out["`🔊 **Assistant audio**`"]:::io

    classDef io fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    classDef api fill:#ccfbf1,stroke:#0d9488,color:#134e4a
    classDef model fill:#dcfce7,stroke:#16a34a,color:#14532d
    classDef data fill:#ede9fe,stroke:#7c3aed,color:#4c1d95
</code></pre>

<p>Monadic Chat currently supports four TTS providers for this, and their APIs fall into two broad approaches.</p>

<h2 id="inline-markers">Inline markers</h2>

<p>xAI Grok, ElevenLabs v3, and Google Gemini all interpret tags embedded directly in the text. The engine consumes the bracketed tokens as stage directions instead of reading them aloud.</p>

<p><strong>xAI Grok</strong> (model <code>grok-tts</code>, voice <code>ara</code>) is the only provider among the four that pairs point markers (<code>[inhale]</code>, <code>[sigh]</code>, <code>[laugh]</code>) with <em>range markers</em> – opening and closing tags that apply an effect to the enclosed span and then return to normal voice. The wrapping set includes <code>&lt;whisper&gt;</code>, <code>&lt;slow&gt;</code>, <code>&lt;loud&gt;</code>, <code>&lt;high&gt;</code>, and <code>&lt;sing&gt;</code> (<a href="https://docs.x.ai/developers/model-capabilities/audio/text-to-speech#speech-tags">xAI docs</a>).</p>

<pre><code>[inhale] &lt;slow&gt;Let me think about that for a second.&lt;/slow&gt;
&lt;whisper&gt;Actually, between you and me, I already know the answer.&lt;/whisper&gt;
</code></pre>

<audio controls="" src="xai.mp3"></audio>

<p><strong>ElevenLabs v3</strong> (model <code>eleven_v3</code>, voice <code>Rachel</code>) uses single-token square brackets for everything. The vocabulary is emotional (<code>[excited]</code>, <code>[curious]</code>, <code>[sad]</code>, <code>[sarcastic]</code>) and performative (<code>[sings]</code>, <code>[giggles]</code>, <code>[sobs]</code>) (<a href="https://elevenlabs.io/docs/overview/capabilities/text-to-dialogue#emotional-deliveries-with-audio-tags">ElevenLabs docs</a>).</p>

<pre><code>[excited] You won't believe this! [giggles] Oh my goodness.
[sings] La la la, la la la la!
</code></pre>

<audio controls="" src="elevenlabs.mp3"></audio>

<p><strong>Google Gemini</strong> (model <code>gemini-2.5-flash-preview-tts</code>, voice <code>Zephyr</code>) uses syntax similar to ElevenLabs v3, but with a distinctive vocabulary of situational moods: <code>[mischievously]</code>, <code>[panicked]</code>, <code>[amazed]</code>, <code>[trembling]</code>, <code>[gasp]</code>, <code>[shouting]</code>, <code>[tired]</code> (<a href="https://ai.google.dev/gemini-api/docs/speech-generation?hl=en#audio-tags">Gemini speech generation</a>).</p>

<pre><code>[amazed] Look at that cake!
[mischievously] I wonder if anyone would notice a missing slice.
[panicked] Quick, someone's coming!
</code></pre>

<audio controls="" src="gemini.mp3"></audio>

<h2 id="instructions">Instructions</h2>

<p>OpenAI's <code>gpt-4o-mini-tts</code> splits text and direction. The reply text stays clean, and a separate <code>instructions</code> parameter in the TTS API call specifies how to deliver the whole utterance. Because an instruction shapes the entire delivery rather than a moment, even a short text with a bit of emotional range is enough to hear the effect (<a href="https://www.openai.fm/">openai.fm</a> is a quick playground).</p>

<p>Because the instruction is free-form English, it is not tied to a fixed tag vocabulary. Voice quality, emotional arc, and pacing can all be described the way a director might sketch a scene.</p>

<p><strong>OpenAI</strong> (model <code>gpt-4o-mini-tts</code>, voice <code>ballad</code>):</p>

<pre><code>Input:
  "Oh my goodness, you are not going to believe this!
   You know that guitar piece I've been working on for months?
   I just played it all the way through -- no mistakes, clean as
   anything! I am so happy right now, you have no idea!"

Instructions:
  Voice: bright, giddy, absolutely bursting with joy.
  Tone: pure, uncontainable excitement from start to finish; each
        sentence should sound like it's barely holding the happiness in.
  Pacing: quick and breathless; strong emphasis on "no mistakes" and
          "so happy"; a triumphant beat on "clean as anything".
  Emotion: unfiltered delight and pride; the voice should practically
           glow.
  Delivery: a small laugh or giggle of disbelief should slip through
            after "all the way through"; the voice climbs in intensity
            toward the final line.
</code></pre>

<audio controls="" src="openai.mp3"></audio>

<p>One thing worth noting: with OpenAI, the choice of voice makes a surprising difference to how instructions land. Neutral voices like <code>alloy</code> respond less dramatically, while <code>coral</code> and <code>ballad</code> pick up expressive directives much more readily. The sample above uses <code>ballad</code>.</p>

<p>Gemini also accepts natural-language direction placed at the front of the prompt, in addition to the tag-based approach seen earlier. Tags and directive can even be combined in a single prompt – the directive shapes the overall atmosphere while the tags pin down specific moments.</p>

<p><strong>Google Gemini</strong> (model <code>gemini-2.5-flash-preview-tts</code>, voice <code>Zephyr</code>) in hybrid form (same utterance as the OpenAI sample above, delivered with a directive plus inline tags):</p>

<pre><code>Input:
  "Say with this voice and style:
   Voice: bright, giddy, absolutely bursting with joy.
   Tone: pure, uncontainable excitement from start to finish.
   Pacing: quick and breathless; strong emphasis on 'no mistakes' and 'so happy'.
   Emotion: unfiltered delight and pride.

   Oh my goodness, you are not going to believe this! [giggles]
   You know that guitar piece I've been working on for months?
   I just played it all the way through -- no mistakes, clean as anything!
   [laughs] I am so happy right now, you have no idea!"
</code></pre>

<audio controls="" src="gemini-instruction.mp3"></audio>

<p>Fully model-native realtime voice APIs – OpenAI Realtime, Gemini Live, and similar – respond with lower latency and a more natural turn-taking rhythm, but they fold content, voice, and timing into a single model that the application cannot easily redirect. Monadic Chat's pipeline keeps the chat model and the TTS engine as separate stages, so any chat provider can be paired with any TTS provider, and the model – acting as an agent – can shape the content and its delivery in the same reply. The trade-off is immediacy for directorial control.</p>
]]></content:encoded>
</item>
<item>
  <title>TCSE: A Fullscreen Mode for Listening Practice</title>
  <link>https://yohasebe.com/posts/2026-04-18-tcse-fullscreen/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-18-tcse-fullscreen/</guid>
  <pubDate>Sat, 18 Apr 2026 00:00:00 +0000</pubDate>
  <description>A less visible TCSE feature: a fullscreen player that hides the transcript during playback, showing it only when</description>
  <content:encoded><![CDATA[<p>Over the past month I have written about a couple of recent changes to <a href="https://yohasebe.com/tcse">TCSE</a>: <a href="../2026-03-17-tcse-update/index.html">entity search and the 6,400-talk milestone</a>, and the <a href="../2026-04-11-tcse-export/index.html">new export feature</a>. This post is about the education side of the project rather than the research side – an older, less visible feature in the video player for second-language learners who use TED talks for English listening practice.</p>

<h2 id="normal-and-fullscreen-viewing-modes">Normal and fullscreen viewing modes</h2>

<p>When you play a talk in TCSE, the default view shows the video above a scrolling transcript, with translations side by side if a translation language is selected.</p>

<p><img src="normal-mode.jpg" alt="Normal viewing mode with transcript and translation below the video" style="max-width:80%" /></p>

<p class="caption">Normal viewing mode</p>

<p>Clicking the fullscreen icon at the bottom right of the player (or pressing <code>ESC</code>) switches into fullscreen mode. The transcript and translation disappear from the screen during playback. Press the spacebar to pause, and the text for the current segment appears in yellow and the previous segment in cyan, with the translation (if selected) below. Press space again to resume, and the text goes away.</p>

<p><img src="fullscreen-mode.png" alt="Fullscreen viewing mode with the current segment in yellow and the previous segment in cyan when paused" style="max-width:80%" /></p>

<p class="caption">Fullscreen viewing mode (paused)</p>

<p>A few shortcuts are useful during listening practice: <code>R</code> to repeat the current segment, <code>A</code> to toggle "Study Mode" (auto-pause at every segment boundary), and <code>T</code> to show or hide the translation. The full list is on the <a href="https://yohasebe.github.io/tcse-doc/playing-video/player-functions-and-shortcut-keys/">shortcut keys page</a>; the feature itself is documented under <a href="https://yohasebe.github.io/tcse-doc/using-tcse-for-language-learning-and-education/using-pause-and-check/">"pause-and-check"</a>.</p>

<h2 id="why-this-mode-exists">Why this mode exists</h2>

<p>The design rests on a view of L2 listening practice I have held for a long time. If text is on screen the whole time, processing gets pulled toward the visual channel and the learner ends up <em>hearing</em> without really <em>listening</em>; but removing all text does not help either, since comprehension depends on context, and a learner who loses the thread early tends to spend the rest of the talk processing fragments. What is needed, and what pause-and-check tries to support, is <em>spot</em> access to the transcript – available when the thread breaks, not otherwise. The video plays without text in the way, and a tap of the spacebar brings just enough of the transcript back (the current and previous segments) to repair comprehension before resuming.</p>

<p>TCSE is, at its core, a search engine for language research and teaching, and this listening mode is not central to what it does. It is not the feature I would name first if asked what TCSE is for. But it is one of my personal favorites, and I hope language learners who land on a talk through search might find it useful once they have already found what they were looking for.</p>
]]></content:encoded>
</item>
<item>
  <title>What &quot;Monadic&quot; in Monadic Chat Means</title>
  <link>https://yohasebe.com/posts/2026-04-13-monadic-chat-name/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-13-monadic-chat-name/</guid>
  <pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate>
  <description>The origin of the name 'Monadic Chat', connecting the functional programming monad to Langacker's current discourse</description>
  <content:encoded><![CDATA[<p><a href="https://github.com/yohasebe/monadic-chat">Monadic Chat</a> is a project I started in 2022, when the GPT-3 text completion API was the most capable thing one could call from a script. The OpenAI Chat API did not yet exist, and there was no built-in notion of conversational context. To make a chatbot that remembered what had been said, you had to manage the context yourself, on your own terms. I built Monadic Chat as one way of doing that. The name has stuck through several major rewrites, and I would like to explain where it came from – and why it still seems to fit.</p>

<p><img src="architecture.png" alt="Monadic Chat architecture overview" /></p>

<p class="caption">Current Monadic Chat architecture</p>

<h2 id="a-name-from-functional-programming">A name from functional programming</h2>

<p>In functional programming, a <em>monad</em> is a way of wrapping a value together with some surrounding context, such that you can keep operating on the value without having to manage the context by hand. The classic metaphor is "a value in a box": you put <code>a</code> into a box, you transform the contents of the box into <code>b</code>, and the box – whatever it contains – comes along for the ride.</p>

<p>When I was thinking about how to maintain conversational context across stateless API calls, this metaphor seemed to fit. Each user turn could be seen as a value, the conversation history as the surrounding context, and the act of producing a response as a transformation that happens <em>inside</em> that context rather than outside of it. The first version of Monadic Chat was a tiny Ruby program that did exactly this with JSON templates: each request carried an <code>input</code>, an <code>output</code>, and an <code>accumulator</code>, and the response itself became the next input. I wrote a <a href="https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q12-9.pdf">short paper</a> about it for the 2023 annual meeting of the Association for Natural Language Processing.</p>

<h2 id="discourse-as-monad">Discourse as monad</h2>

<p>The reason the monad metaphor felt right was not only that it gave me a useful way to think about the software architecture. It also matched the way cognitive linguists describe what speakers and hearers actually do when they hold a conversation.</p>

<p>Ronald Langacker has written extensively about what he calls the <em>current discourse space</em> – a structured representation of the immediately relevant context that participants share at any moment in a conversation. Each utterance updates this space: new entities are introduced, previous ones recede, the focus shifts, and the updated space becomes the ground for the next utterance. The structure has some parallels with how a monadic computation passes its environment forward through each step.</p>

<p>The monad in functional programming is a precisely defined mathematical object, while the current discourse space is a theoretical construct in cognitive linguistics. I do not want to conflate the two, but the <em>shape</em> is similar enough that bringing them into conversation seemed worthwhile. I have explored the linguistic side of this elsewhere – in a <a href="assets/docs/iclc16-hasebe-2023.pdf">talk at ICLC 16 in 2023</a> and in <em>Ninchi Gengogaku Ronkou</em> (Studies in Cognitive Linguistics) the year after, which I <a href="../2026-01-05-stack-model/index.html">introduced on this blog earlier</a>.</p>

<p>What matters, I think, is that how human speakers actually hold and manage context remains very much an open question in cognitive linguistics, and conversational AI may give us a new kind of object to think with – not only as a practical tool but as something that could feed back into theoretical work on discourse and cognition.</p>

<h2 id="where-the-project-is-going">Where the project is going</h2>

<p>Monadic Chat in 2026 is more eclectic than the original prototype. It speaks to multiple LLM providers, runs containerized tools, and handles images, audio, and video as well as text. It has grown into more of a multi-tool than a focused one. The <a href="https://yohasebe.github.io/monadic-chat/">project page</a> has the current details.</p>

<p>Real discourse is typically multimodal: gesture, intonation, and the visible environment all tend to play a part in how people make meaning, even though formal linguistic theory has often lacked the tools to handle them and has tended to set them aside in analysis. Monadic Chat's handling of images, audio, and video is still only partial, but my hope is that it will become not only a chat tool that gradually grows into such material, but also a medium through which I can come to understand more concretely what context actually is, and what the current discourse space really looks like.</p>
]]></content:encoded>
</item>
<item>
  <title>Whisper Stream: A Unix Building Block for Speech</title>
  <link>https://yohasebe.com/posts/2026-04-12-whisper-stream/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-12-whisper-stream/</guid>
  <pubDate>Sun, 12 Apr 2026 00:00:00 +0000</pubDate>
  <description>Whisper Stream: a single-file bash script that turns a microphone into a text stream via OpenAI or whisper.cpp, for Unix</description>
  <content:encoded><![CDATA[<p><a href="https://github.com/yohasebe/whisper-stream">Whisper Stream</a> is a bash script that records speech, detects silence between utterances using sox, and sends each chunk to either OpenAI's transcription API or a local whisper.cpp model for transcription. I originally wrote it in the summer of 2023 and have not touched it much since. Recently I dusted it off and pushed a v2.0.0 update, mainly to make it more useful as a component in Unix pipelines.</p>

<h2 id="what-it-does">What it does</h2>

<p>The script listens to the microphone, uses sox's silence detection to segment the audio into natural utterances, and transcribes each one. The key design constraint is that it is a single bash file with no dependencies beyond sox and curl (or whisper.cpp for local use). No Python, no Node, no virtual environment.</p>

<p><img src="whisper-stream-demo.gif" alt="Whisper Stream in action" /></p>

<p>The v2.0.0 update adds:</p>

<ul>
  <li><strong><code>--stdout</code> and <code>--jsonl</code> modes</strong> that skip all side effects (clipboard, file save, banners) and emit transcriptions directly to stdout. This makes the script pipe-native.</li>
  <li><strong>Local backend</strong> via whisper.cpp — runs offline with no API cost, making always-on dictation practical.</li>
  <li><strong>Speaker diarization</strong> through OpenAI's newer transcription models.</li>
</ul>

<h2 id="why-it-matters-as-a-building-block">Why it matters as a building block</h2>

<p>The <code>--stdout</code> and <code>--jsonl</code> modes are the important part. A speech-to-text tool that writes to stdout and nothing else can be composed with other programs. Because it is a regular Unix filter, you can build things on top of it without touching the source:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># send a desktop notification when someone says "urgent"</span>
whisper-stream <span class="nt">--stdout</span> | <span class="nb">grep</span> <span class="nt">--line-buffered</span> <span class="nt">-i</span> <span class="s2">"urgent"</span> | <span class="k">while </span><span class="nb">read</span> <span class="nt">-r</span> line<span class="p">;</span> <span class="k">do </span>notify-send <span class="s2">"</span><span class="nv">$line</span><span class="s2">"</span><span class="p">;</span> <span class="k">done</span>

<span class="c"># feed every utterance into an LLM</span>
whisper-stream <span class="nt">--jsonl</span> | jq <span class="nt">-r</span> <span class="s1">'.text'</span> | your-llm-cli
</code></pre></div></div>

<p>A slightly more involved example: a wake-word assistant. The script listens continuously. When it hears a keyword, it plays a short beep to signal readiness, then sends the next utterance to an LLM.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>whisper-stream <span class="nt">--stdout</span> | <span class="k">while </span><span class="nb">read</span> <span class="nt">-r</span> line<span class="p">;</span> <span class="k">do
  if </span><span class="nb">echo</span> <span class="s2">"</span><span class="nv">$line</span><span class="s2">"</span> | <span class="nb">grep</span> <span class="nt">-qi</span> <span class="s2">"hey computer"</span><span class="p">;</span> <span class="k">then</span>
    <span class="c"># next line from the pipe is the follow-up utterance</span>
    <span class="nb">read</span> <span class="nt">-r</span> <span class="nb">command
    echo</span> <span class="s2">"</span><span class="nv">$command</span><span class="s2">"</span> | your-llm-cli
  <span class="k">fi
done</span>
</code></pre></div></div>

<p>There is a brief pause between saying the wake word and seeing a response – a few seconds while the utterance is transcribed. But it runs on macOS and Linux alike, and I can see it working on a Raspberry Pi with a microphone.</p>
]]></content:encoded>
</item>
<item>
  <title>TCSE: Exporting Search Results</title>
  <link>https://yohasebe.com/posts/2026-04-11-tcse-export/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-11-tcse-export/</guid>
  <pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate>
  <description>TCSE now exports search results as TSV or JSON, page by page, with match context and metadata for close reading and</description>
  <content:encoded><![CDATA[<p>Over the past year or two, I have received a handful of emails asking for the same thing: could TCSE let users download search results, instead of forcing them to copy and paste hits one by one into a spreadsheet? I agreed every time that it was a reasonable request, put it on my to-do list, and never quite got around to it. That has now finally changed.</p>

<p><a href="https://yohasebe.com/tcse">TCSE</a> now has an export feature, currently in beta. You can find the documentation <a href="https://yohasebe.github.io/tcse-doc/searching-for-words/export-search-results/">here</a>.</p>

<h2 id="page-by-page-export">Page-by-page export</h2>

<p>Each export returns the current page of search results – around 200 items, matching what you see on the screen. The interface already paginates results 200 per page, so "export what I am looking at" is the natural model.</p>

<p>I deliberately avoided a single-request bulk download of thousands of rows. Both server load and the appropriate use of a corpus built on publicly accessible transcripts argued against it, and keeping each export scoped to one page keeps each request as light as an ordinary page view.</p>

<p>If you need the full set of hits for a high-frequency query, you can simply page through and export each page in turn. A short five-second cooldown sits between exports as a simple throttle against accidental double-clicks and overly eager scripts, but it should not get in anyone's way in normal use.</p>

<h2 id="what-each-row-contains">What each row contains</h2>

<p>On any search results page, a small "Export" button offers two formats:</p>

<ul>
  <li><strong>ZIP</strong> — a TSV file containing the rows, plus a <code>metadata.json</code> describing the query and the export</li>
  <li><strong>JSON</strong> — everything in a single JSON file</li>
</ul>

<p>Each row is not just the matched segment but the matched segment plus <strong>two segments of context before and after</strong>, along with the talk metadata (title, speaker, year, URL, duration). When you search with the Advanced Search syntax, the TSV also includes the part-of-speech, lemma, and dependency label of the matched token. When you have a translation language selected, the translation text is included too.</p>

<p>The TSV is UTF-8 with a byte-order mark, so Excel opens it without garbling non-ASCII characters.</p>

<h2 id="use-cases">Use cases</h2>

<p>A few things this unlocks for TCSE users:</p>

<ul>
  <li><strong>Close reading of hits</strong>: download the hits for a word, construction, or discourse marker and work through them in a spreadsheet or a notebook, annotating by hand or running scripts</li>
  <li><strong>Statistical work</strong>: load the TSV into R or Python, group by year or speaker, count co-occurrences, build visualizations</li>
  <li><strong>Teaching material</strong>: build a small custom dataset for a class assignment without having to set up your own corpus pipeline</li>
</ul>

<h2 id="beta-status">Beta status</h2>

<p>The feature is live as a beta release. The core functionality is in place, but there may be rough edges to smooth out based on how people actually use it. If you try it out and run into anything odd, or if there is metadata you wish were included but is not, let me know.</p>
]]></content:encoded>
</item>
<item>
  <title>Paradocs: Sentence-by-Sentence Presentations</title>
  <link>https://yohasebe.com/posts/2026-04-07-paradocs/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-07-paradocs/</guid>
  <pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate>
  <description>Paradocs 1.0, a presentation tool that highlights text sentence by sentence -- originally built for ESL reading</description>
  <content:encoded><![CDATA[<p>I released version 1.0 of <a href="https://yohasebe.com/paradocs/">Paradocs</a>, a presentation tool. The name stands for paragraph-oriented document presentation system, reflecting its focus on handling text at the paragraph and sentence level. I originally built it in 2018 for ESL reading classes. The core feature is sentence-by-sentence highlighting as you walk through a text.</p>

<p>In a language class, keeping everyone on the same sentence at the same time is surprisingly hard with existing slide tools. PowerPoint and Google Slides operate at the slide level. Moving focus from one sentence to the next within a single paragraph is not something they are designed to do well. Paradocs was built to solve exactly that.</p>

<p>It runs entirely in the browser with no server required, hosted as a static site. Text data stays in the browser unless you opt in to cloud-based speech synthesis.</p>

<p><img src="images/paradocs-initial.png" alt="Paradocs initial screen" /></p>

<p>Each press of a key advances the highlight to the next sentence. Here is what it looks like in presentation mode:</p>

<p><img src="images/paradocs-demo.gif" alt="Sentence-by-sentence highlighting in Paradocs" /></p>

<p>With v1.0.0, the following features are now in place, marking a milestone:</p>

<ul>
  <li>Text-to-speech. Works for free using the browser's built-in speech synthesis, with optional support for OpenAI and ElevenLabs cloud voices. Word-level highlighting during playback.</li>
  <li>Fill-in-the-blank and multiple-choice quizzes.</li>
  <li>Multilingual UI: English, Japanese, Chinese, and Korean.</li>
  <li>Export as a standalone HTML file for offline use.</li>
  <li>Live preview with filmstrip thumbnails.</li>
  <li>Dark mode for both the editor and the presentation.</li>
</ul>

<p>Paradocs was made for language education, but I imagine there are other situations where stepping through text one sentence at a time is useful. If you find such a use, I would be glad to hear about it.</p>
]]></content:encoded>
</item>
<item>
  <title>RSyntaxTree: Left-to-Right Trees</title>
  <link>https://yohasebe.com/posts/2026-04-06-rsyntaxtree-ltr/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-06-rsyntaxtree-ltr/</guid>
  <pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate>
  <description>RSyntaxTree 1.5 adds left-to-right tree layout, useful for classification hierarchies and deep dependency</description>
  <content:encoded><![CDATA[<p><a href="https://github.com/yohasebe/rsyntaxtree">RSyntaxTree</a> 1.5 is out. The main addition is left-to-right tree layout, where the root sits on the left and leaves expand to the right.</p>

<p>The standard top-to-bottom layout works well for phrase structure trees in generative grammar, where the depth is moderate and the width grows with sentence length. But for classification hierarchies, taxonomies, and certain dependency structures, horizontal trees are a more natural fit. They read like an outline and make better use of screen space when the tree is deep but narrow.</p>

<p>Here is the same sentence tree rendered in both directions. The input is bracket notation:</p>

<pre><code class="language-text">[S [NP [D the] [N cat]] [VP [V sat] [PP [P on] [NP [D the] [N mat]]]]]
</code></pre>

<p>First, the conventional top-to-bottom:</p>

<p><img src="images/ttb-example.svg" alt="Top-to-bottom tree" /></p>

<p>And the same structure laid out left-to-right with <code>-d ltr</code>:</p>

<p><img src="images/ltr-example.svg" alt="Left-to-right tree" /></p>

<p>The feature is available from the command line with the <code>-d ltr</code> flag, and via the web interface.</p>

<p>Where horizontal layout really shines is in classification trees. A taxonomy like the one below becomes unwieldy when drawn top-to-bottom, but reads cleanly from left to right. The <code>-p on</code> flag draws polyline connectors instead of the default diagonal lines, which suits classification diagrams better:</p>

<p><img src="images/classification-ltr.svg" alt="Classification tree in LTR layout" /></p>

<p>Internally, the tree is always computed top-to-bottom, then coordinates are transformed at the rendering stage. The same approach could support other directions. Right-to-left might be useful for visualizing structures in Arabic or Hebrew. Bottom-to-top, while rare in theoretical linguistics, could work well for phylogenetic trees in linguistics or biology. Neither is implemented yet, but both are worth considering.</p>

<p>Install or update: <code>gem install rsyntaxtree</code></p>

<ul>
  <li><a href="https://yohasebe.com/rsyntaxtree">RSyntaxTree web app</a></li>
  <li><a href="https://yohasebe.github.io/rsyntaxtree/examples">RSyntaxTree example gallery</a></li>
  <li><a href="https://yohasebe.github.io/rsyntaxtree/documentation">RSyntaxTree documentation</a></li>
</ul>
]]></content:encoded>
</item>
<item>
  <title>How This Site Works</title>
  <link>https://yohasebe.com/posts/2026-04-05-how-this-site-works/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-05-how-this-site-works/</guid>
  <pubDate>Sun, 05 Apr 2026 00:00:00 +0000</pubDate>
  <description>Under the hood of this site: a custom Ruby static site generator with Mermaid pre-rendering, EXIF stripping, and client-side</description>
  <content:encoded><![CDATA[<p>As I wrote in the <a href="../2026-01-01-hello-world/index.html">first post</a>, this site runs on a custom static site generator I built in Ruby. No framework – just a single script that converts Markdown files into HTML pages. Here is how the pieces fit together.</p>

<h2 id="the-workflow">The workflow</h2>

<pre><code class="language-mermaid">%%{ init: { "flowchart": { "curve": "linear" } } }%%
flowchart LR
    subgraph Local
        write["Write Markdown"] -- build.rb --&gt; preview["Preview"]
        preview -. fswatch .-&gt; write
    end

    subgraph GitHub
        repo[("Repository")]
    end

    subgraph Server["yohasebe.com"]
        pull["git pull"] --&gt; build["build.rb"] --&gt; nginx["nginx"]
    end

    write -- "git push" --&gt; repo
    repo -- "webhook" --&gt; pull
</code></pre>

<p>I write posts in Markdown with YAML front matter. Running <code>build.rb serve</code> starts a local preview server and watches for file changes via fswatch, rebuilding automatically whenever I save.</p>

<p>When I push to GitHub, a webhook notifies my server, which pulls the latest changes, runs the build script, and the updated site is served by nginx. The whole process takes a few seconds.</p>

<h2 id="tech-stack">Tech stack</h2>

<ul>
  <li><strong>Markdown processing</strong>: kramdown with GitHub Flavored Markdown, Rouge for syntax highlighting, KaTeX for math, and Mermaid for diagrams (pre-rendered to SVG at build time)</li>
  <li><strong>Image handling</strong>: Automatic EXIF metadata stripping and responsive sizing based on intrinsic dimensions</li>
  <li><strong>Search</strong>: Client-side full-text search using an inverted index generated at build time</li>
</ul>

<p>The source code is in a <a href="https://github.com/yohasebe/yohasebe.github.io">public GitHub repository</a>. As I wrote in <a href="../2026-04-01-why-english/index.html">an earlier post</a>, plain text files under Git are the most durable publishing format I know of.</p>
]]></content:encoded>
</item>
<item>
  <title>Named Entities as Typed Variables in Constructions</title>
  <link>https://yohasebe.com/posts/2026-04-02-ner-construction-search/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-02-ner-construction-search/</guid>
  <pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate>
  <description>Treating named entities as typed variables in construction grammar search -- %PERSON said, X since %EVENT -- now in</description>
  <content:encoded><![CDATA[<p>One of the core ideas in construction grammar is that linguistic knowledge consists of form-meaning pairings at every level – from individual morphemes to abstract clause patterns. Many constructions have open slots that speakers fill with words of a certain type. "Give X a Y" wants a recipient and a theme. "The more X, the more Y" wants two scalar expressions.</p>

<p>What if one of those slots expects a named entity – a person, a place, a historical event? Traditional corpus tools can find word patterns, but they have no way to ask for "a country name" or "a famous person".</p>

<p><a href="https://yohasebe.com/tcse">TCSE</a> (TED Corpus Search Engine) recently added support for <code>%ENTITY</code> notation in its advanced search, which treats named entity types as typed variables. <code>%PERSON</code> matches any person name, <code>%GPE</code> matches any country or city, <code>%ORG</code> matches any organization. See the <a href="https://yohasebe.github.io/tcse-doc/searching-for-words/named-entity-search/">documentation</a> for the full list of supported types. Under the hood, the feature uses spaCy's NER annotations, which are stored for every token in the corpus of 6,400+ TED Talks.</p>

<p>This turns out to be useful for finding constructions that would be hard to locate with plain keyword search. Let me show a few examples.</p>

<h2 id="x-since-event"><code>X since %EVENT</code></h2>

<p>Named events – wars, depressions, revolutions – serve as fixed points on a historical timeline that speakers use to anchor claims about the present. The construction pairs <em>since</em> with a named event and treats the event as a temporal benchmark.</p>

<ul>
  <li>the worst humanitarian crisis <strong>since World War II</strong> <a href="https://yohasebe.com/tcse/t/100183/24321/1/sentence/0/f/f">[source]</a></li>
  <li>the greatest transformation <strong>since the Industrial Revolution</strong> <a href="https://yohasebe.com/tcse/t/48495/80643/1/sentence/0/f/f">[source]</a></li>
  <li>the greatest debt crisis <strong>since the Great Depression</strong> <a href="https://yohasebe.com/tcse/t/2859/129968/1/sentence/0/f/f">[source]</a></li>
  <li>the first human to cross the channel by land <strong>since the Ice Age</strong> <a href="https://yohasebe.com/tcse/t/61644/65370/1/sentence/0/f/f">[source]</a></li>
</ul>

<p>The interesting thing here is what <code>%EVENT</code> lets us filter out. A search for <em>since</em> alone returns an enormous set of hits – "since 1973", "since last year", "since we started" – most of which anchor to a date, a time, or a clause rather than to a named occurrence. Swapping in <code>%EVENT</code> isolates a genuine subcategory: the construction where the benchmark is a named happening of any scale that the audience is expected to recognize. That is a distinction you cannot draw with keywords or part-of-speech tags.</p>

<h2 id="the-persons"><code>the %PERSONs</code></h2>

<p>English uses the plural form of a proper noun in several ways. "The Smiths" can mean a specific family. The construction I want to highlight here is different: the name is pluralized not to pick out a family but to stand for a type – "people like X".</p>

<ul>
  <li>they might become <strong>the Darwins of the 21st century</strong>, studying urban evolution <a href="https://yohasebe.com/tcse/t/64774/62028/1/sentence/0/f/f">[source]</a></li>
  <li>It is <strong>the Shackletons of our offices</strong> rather than <strong>the Amundsens</strong>, who serve as role models, who get promoted and who get rewarded. <a href="https://yohasebe.com/tcse/t/130848/607796/1/sentence/0/f/f">[source]</a></li>
  <li>Her name was Dottie and he made a vow that day to always know <strong>the Dotties in his life</strong>. <a href="https://yohasebe.com/tcse/t/70380/54571/1/sentence/0/f/f">[source]</a></li>
  <li>helping <strong>the Sams of the world</strong> <a href="https://yohasebe.com/tcse/t/66921/58207/1/sentence/0/f/f">[source]</a></li>
</ul>

<p>What typically marks this usage is a restricting phrase like <em>of the world</em>, <em>of the 21st century</em>, <em>of our offices</em>, or <em>in his life</em>, which anchors the name to a type rather than a household. In the Darwin, Shackleton, and Amundsen examples, the names are famous ones and the construction invokes the category they stand for – urban ecologists inheriting a scientific tradition, or the reckless explorer versus the methodical one as office archetypes. The Dottie and Sam examples are different: Dottie is a specific person the speaker's friend had noticed in passing, and Sam is the speaker's own name, used to refer generically to people in situations like hers. In neither case does the audience need to identify the individual behind the name. The construction itself signals "read this as a type, not an individual".</p>

<h2 id="meanwhile-in-gpe"><code>meanwhile in %GPE</code></h2>

<p>Place names participate in constructions too. <code>meanwhile in %GPE</code><sup id="fnref:gpe"><a href="#fn:gpe" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> is a discourse marker that introduces a parallel scene – "while that was happening here, over in X something else was going on". It is a narrative device speakers use to cut between settings.</p>

<ul>
  <li><strong>Meanwhile, in New York City</strong>, the NYPD has driven police cars equipped with license plate readers past mosques. <a href="https://yohasebe.com/tcse/t/2149/178141/1/sentence/0/f/f">[source]</a></li>
  <li><strong>Meanwhile in Constantinople</strong>, Anna fought her own battle. <a href="https://yohasebe.com/tcse/t/25874/368709/1/sentence/0/f/f">[source]</a></li>
  <li><strong>Meanwhile, in Australia</strong>, you can find a second type of mammal – marsupials. <a href="https://yohasebe.com/tcse/t/24468/480990/1/sentence/0/f/f">[source]</a></li>
  <li><strong>Meanwhile, in Copenhagen</strong> we're actually expanding the bicycle lanes. <a href="https://yohasebe.com/tcse/t/634/519831/1/sentence/0/f/f">[source]</a></li>
</ul>

<p>One more advantage of NER search is worth pointing out here: multi-token entities are treated as single units. Among the examples above, <em>Copenhagen</em> and <em>Constantinople</em> are single tokens, while <em>New York City</em> is three. spaCy's NER marks all of them as a single GPE, so <code>%GPE</code> matches the whole span in each case. A search for <code>meanwhile in %GPE</code> does not need to worry about how many words the place name has. That kind of chunking is essential when the slot has to accept anything a speaker might name.</p>

<h2 id="why-this-matters">Why this matters</h2>

<p>None of the constructions above is exotic. Linguists have studied many patterns like them. What is new is the ability to find them systematically across a curated corpus without knowing the filler words in advance.</p>

<p>For a construction grammarian, the ability to specify a slot as "any person name", "any place name", or "any named event" matches the way we think about constructions in the first place – as schemas with typed variables, not as lists of word combinations. NER search lets the search interface catch up to the theory.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:gpe">
      <p><code>%GPE</code> stands for <em>Geopolitical Entity</em> in spaCy's NER scheme. It covers countries, cities, and other place names with political boundaries, as distinct from <code>%LOC</code>, which is used for non-political geographic locations like mountain ranges and bodies of water. <a href="#fnref:gpe" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>
]]></content:encoded>
</item>
<item>
  <title>Why I Write in English</title>
  <link>https://yohasebe.com/posts/2026-04-01-why-english/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-04-01-why-english/</guid>
  <pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
  <description>Why I write this blog in English even though Japanese is my native language, and what LLMs have changed about that</description>
  <content:encoded><![CDATA[<p>Japanese is my native language. So why write this blog in English? There are several reasons.</p>

<h2 id="the-cost-of-writing-in-a-foreign-language">The cost of writing in a foreign language</h2>

<p>Until recently, writing in English meant spending far more time than writing in Japanese – drafting, checking grammar, looking up expressions. LLMs have compressed that gap dramatically. I can now draft, revise, and polish English text with an efficiency that was simply not possible a few years ago. The effort gap between writing in my mother tongue and writing in English has become small enough that it no longer feels like a decisive factor.</p>

<h2 id="writing-in-a-language-that-is-not-yours">Writing in a language that is not yours</h2>

<p>Japanese requires a lot of decisions before you even get to the content. Formal or informal register (<em>desu/masu</em> or plain form)? Which first-person pronoun (<em>watashi</em>, <em>boku</em>, <em>ore</em>, or something else)? Every language has such choices, but the difference is that in your native language, you feel the full weight of them.</p>

<p>English has its own subtleties, of course. But as a non-native speaker, I am largely exempt from them – or at least unaware of them. That turns out to be liberating. I am not trying to write poetry here. I want to describe facts, share observations, and record what I think. Having fewer choices – or more precisely, being unable to see most of them – makes that easier.</p>

<h2 id="reaching-whoever-finds-it-interesting">Reaching whoever finds it interesting</h2>

<p>I am not writing for a specific audience. If anything, I would like these posts to be found by anyone, from any country or cultural background, who happens to care about the same topics. For better or worse, English has established itself as a global lingua franca, and it is the most practical choice for this kind of open-ended reach.</p>

<h2 id="not-wanting-to-get-tired-of-social-media">Not wanting to get tired of social media</h2>

<p>I also want this blog to serve as a kind of personal log – things I have been working on, things I have been thinking about, things I have enjoyed. Social media platforms are designed to let you do exactly that, but they come with a cost: the moment you consider posting something, you start thinking about the reactions. Likes, replies, retweets. That feedback loop subtly reshapes what you write, and sometimes stops you from writing at all. A personal blog with no comment section and no analytics does not have that problem.</p>

<h2 id="owning-the-medium">Owning the medium</h2>

<p>Everything published on the open web will almost certainly be ingested by AI training pipelines. That is just the reality now. If I am going to put things out there, I want to do it on a medium I control as much as possible. This site is built from plain Markdown files stored in a GitHub repository, rendered into static HTML by a small Ruby script, and served from my own server. The content is mine, the edit history is mine, and the format is as durable as anything I know of. Text files in a Git repository will outlast any blogging platform.</p>
]]></content:encoded>
</item>
<item>
  <title>Collocation Networks in TCSE</title>
  <link>https://yohasebe.com/posts/2026-03-30-collocation-network/</link>
  <guid isPermaLink="true">https://yohasebe.com/posts/2026-03-30-collocation-network/</guid>
  <pubDate>Mon, 30 Mar 2026 00:00:00 +0000</pubDate>
  <description>The collocation network feature in TCSE, including the statistical measures behind it and a visualization</description>
  <content:encoded><![CDATA[<p>This post introduces the collocation network feature I built into <a href="https://yohasebe.com/tcse">TCSE</a>, covering the statistical measures behind it and a real visualization example.</p>

<h2 id="what-is-collocation">What is collocation?</h2>

<p>A collocation is a pair of words that co-occur more often than chance would predict. "Make a decision" sounds natural while "do a decision" does not – collocation is the concept that lets us quantify this kind of bond between words.</p>

<h2 id="statistical-measures">Statistical measures</h2>

<p>TCSE uses three statistical measures to assess collocation strength. The formulas below use the notation of a 2x2 contingency table:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Word 2 present</th>
      <th>Word 2 absent</th>
      <th>Total</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Word 1 present</strong></td>
      <td><span class="math-inline">$O_{11}$</span></td>
      <td><span class="math-inline">$O_{12}$</span></td>
      <td><span class="math-inline">$R_1$</span></td>
    </tr>
    <tr>
      <td><strong>Word 1 absent</strong></td>
      <td><span class="math-inline">$O_{21}$</span></td>
      <td><span class="math-inline">$O_{22}$</span></td>
      <td><span class="math-inline">$R_2$</span></td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><span class="math-inline">$C_1$</span></td>
      <td><span class="math-inline">$C_2$</span></td>
      <td><span class="math-inline">$N$</span></td>
    </tr>
  </tbody>
</table>

<p><span class="math-inline">$O_{11}$</span> is the observed co-occurrence frequency of the two words, <span class="math-inline">$R_1$</span> and <span class="math-inline">$C_1$</span> are the individual frequencies of each word, and <span class="math-inline">$N$</span> is the total number of tokens in the corpus.</p>

<h3 id="mutual-information-mi">Mutual Information (MI)</h3>

<p><div class="math-block">$$MI = \log_2 \frac{O_{11} \cdot N}{R_1 \cdot C_1}$$</div></p>

<p>MI expresses, on a logarithmic scale, how much the observed co-occurrence exceeds what we would expect under independence. Higher values indicate stronger association.</p>

<p>A known weakness of MI is that it tends to overestimate the strength of low-frequency pairs. To compensate, we use t-score alongside it.</p>

<h3 id="t-score">t-score</h3>

<p><div class="math-block">$$t = \frac{O_{11} - E_{11}}{\sqrt{O_{11}}}$$</div></p>

<p>where <span class="math-inline">$E_{11} = \frac{R_1 \cdot C_1}{N}$</span> is the expected frequency under independence. t-score is better suited for detecting high-frequency, stable collocations.</p>

<h3 id="difference-of-proportions-dp">Difference of Proportions (DP)</h3>

<p><div class="math-block">$$DP = \frac{O_{11}}{R_1} - \frac{C_1 - O_{11}}{N - R_1}$$</div></p>

<p>This takes the difference between the rate at which the collocate appears in the context of the target word and its rate elsewhere. It is intuitive to interpret and can be read as an effect size.</p>

<h2 id="visualization-the-network-for-beautiful">Visualization: the network for "beautiful"</h2>

<p>Below is the collocation network generated by searching for "beautiful" in TCSE (filtered at MI &gt;= 3, 50 nodes).</p>

<p><img src="images/collocation-beautiful.png" alt="Collocation network for &quot;beautiful&quot;" /></p>

<p>Node color represents part of speech: nouns (blue), verbs (red), adjectives (green), adverbs (orange). Node size is proportional to co-occurrence frequency.</p>

<p>Several interesting patterns emerge.</p>

<p>The adverb cluster (stunningly, incredibly, remarkably, exquisitely) consists of intensifiers that amplify "beautiful", reflecting the subjective evaluative function of the adjective.</p>

<p>The noun cluster splits in two directions. Concrete objects (garden, landscape, beach, picture, architecture) and abstract concepts (symphony, pattern, diversity) are positioned separately, showing that "beautiful" spans both sensory and abstract notions of beauty.</p>

<p>The large node for "thing" is also notable. "Beautiful thing" is a frequent expression in TED Talks – a discourse pattern where speakers refer to something as "a beautiful thing" without naming what exactly is beautiful.</p>
]]></content:encoded>
</item>

</channel>
</rss>
