Introducing Patot: Semantic Chunking for Jewish Texts

When you're building AI systems on top of long-form Jewish texts, one of the first problems you run into is chunking - how do you break a text into pieces that are small enough for a model to handle, but still meaningful enough to be useful?

We built Patot to answer that question.

Patot is an open-source Python toolkit for Hebrew/English-aware semantic chunking of Sefaria texts. It's designed to prepare texts for downstream AI workflows like embedding, retrieval, and question answering - and it's built around the specific structure and richness of Sefaria's corpus.

The Problem with Naive Chunking

Most chunking strategies split text by character count or token limit. For general-purpose documents, that's often fine. For Jewish texts, it's a problem.

Sefaria's segment boundaries are meaningful. A chapter of Talmud, a passage of Maimonides, a section of Tanakh are structure that matters. Splitting across these segments blindly loses context, fractures arguments, and degrades retrieval quality.

At the same time, AI models have hard token limits. Ingesting an entire tractate is not feasible.

Patot is designed to hold both constraints at once: respect the text's structure, and keep every chunk model-safe.

How It Works

Patot processes one Sefaria section at a time, using a three-pass pipeline.

Pass 1 runs semantic chunking across the ordered segments of a section, using statistical analysis of Gemini embeddings to identify where semantic continuity drops. Segments that belong together get grouped; segments that mark a topic shift get split. Crucially, chunk boundaries always fall on segment boundaries - Patot never splits a segment to complete a chunk.

Pass 2 handles segments that weren't grouped with any neighbors in Pass 1. If a standalone segment is long enough to warrant further splitting, Patot applies the same semantic chunking method to break it into sentence and clause units. The result is either a group of whole segments, or a subdivision of exactly one segment.

Pass 3 enforces hard token limits. Semantic chunking optimizes for coherence, not compliance - so this final pass validates every chunk against a configured maximum and splits any outliers safely.

What You Get

The output is a set of chunks that are semantically coherent, structurally sound, and guaranteed to fit your embedding model. Whether you're building a RAG pipeline, a source-aware Q&A assistant, a topic clustering tool, or a source sheet recommender, Patot gives you a principled foundation to build on.

Try It

Patot is open source and available on GitHub. It's not yet on PyPI, but you can install it directly:

pip install "patot[chunking,pdf] @ git+https://github.com/Sefaria/[email protected]"

Full documentation and usage examples are in the repo. We'd love to hear how you're using it.