Tibibu: automated pipeline for publishing children's books

The constraint

A friend bought 110 PLR children’s books in .docx format. He wanted to publish them on Amazon KDP under a pen name — rewritten with AI, with coloring pages generated from the illustrations. I built the pipeline.

The first problem appeared immediately. Standard text extractors returned duplicate sentences from every file.

The problem: duplicate text in the XML

I opened a .docx file and inspected the raw XML. The text lived inside <AlternateContent> tags, with a <Fallback> branch that repeated the same content for older Word versions.

Every extractor I tried pulled both copies.

I wrote a custom parser that walks the XML tree and skips <Fallback> branches entirely:

from docx import Document
from lxml import etree

FALLBACK_TAG = (
    "{http://schemas.openxmlformats.org/markup-compatibility/2006}Fallback"
)

def extract_clean_text(docx_path):
    doc = Document(docx_path)
    clean_paragraphs = []

    for para in doc.paragraphs:
        node = para._element
        in_fallback = any(
            ancestor.tag == FALLBACK_TAG
            for ancestor in node.iterancestors()
        )
        if not in_fallback and para.text.strip():
            clean_paragraphs.append(para.text.strip())

    return clean_paragraphs

Each .docx also contained exactly three images. I extracted and classified them by dimensions:

Image type	Dimensions	Format
Cover	~1024x1024	PNG
Page border	2480x3508	PNG (transparent)
Background art	2480x3508	JPEG

The PDFs bundled with the PLR were broken — background layer flattening had corrupted them. I ignored them and worked from the .docx files only.

The architecture

I structured the project as a monorepo at tibibu-monorepo with three components.

Core processor (`packages/core-processor`)

Four Python scripts handle all processing:

Script	What it does
`docx_parser.py`	Extracts clean text and images from `.docx`
`coloring_filter.py`	Converts illustrations to coloring pages
`ai_rewriter.py`	Rewrites story text via Gemini API
`pdf_engine.py`	Composes and exports KDP-compliant PDFs

The coloring page pipeline in coloring_filter.py:

from PIL import Image, ImageFilter

def generate_coloring_page(image_path, output_path):
    img = Image.open(image_path).convert("L")
    blurred = img.filter(ImageFilter.GaussianBlur(radius=2))
    edges = blurred.filter(ImageFilter.FIND_EDGES)
    threshold = 30
    bw = edges.point(lambda p: 0 if p > threshold else 255)
    bw.save(output_path)

pdf_engine.py composites the background image, overlays the transparent border, and wraps the story text in a semi-transparent box. Output is 8.5x11 inches at 300 DPI — KDP’s required spec.

Creator tool (`apps/tibibu-creator`)

Running scripts manually for 110 books is not workable. I built an internal editor instead.

FastAPI backend with four endpoints:

POST /api/upload          → parses .docx, returns text + images
POST /api/rewrite         → sends text to Gemini, returns rewrite
GET  /api/render-preview  → returns composited page as PNG
POST /api/compile         → generates final PDF

The React frontend connects to these endpoints. I can drag a .docx into the browser, preview the typeset page, adjust font size and text position with sliders, trigger a rewrite, and export the PDF — without touching the terminal.

Storefront (`apps/tibibu-web`)

Next.js App Router with Tailwind CSS. Three pages live:

Page	Purpose
Homepage	Book catalog with featured titles
Stories gallery	Full catalog, SEO-optimized
Coloring sheets	Email capture, free coloring pages as lead magnet

Current state

The pipeline works end to end. I can take a raw .docx, rewrite it, generate a coloring page, and export a KDP-compliant PDF from the browser.

110 books are queued. None are published yet.

The hard part was not the AI integration. It was the XML.