How to make LLM data augmentation resumable | Nie Er

A training-data processing pipeline is a long-running system for cleaning and augmenting LLM training samples. In this project, relational storage, DB cursors, and idempotent writes made the pipeline pauseable, resumable, and auditable.

This is a project note from Nie Er (AILDNC). The client is anonymized as an A-share listed environmental services group. The real corpus was mostly related to environmental project data, but the platform itself was designed as a general training-data processing layer.

My main lesson was simple: for LLM data augmentation, the hardest part is often not calling the model. It is stopping the job, resuming it, and not corrupting the dataset on the way.

A cleaning job may finish in minutes. An LLM augmentation job can run for tens of hours, sometimes close to a hundred hours depending on the dataset size. Users pause it. Workers fail. Model calls fail. Test environments rerun the same job. Once this becomes a platform feature, “just rerun the script” is not a serious answer.

Why not just stream JSONL files

The obvious design is file-based: read JSONL, clean it, augment it, write another JSONL file.

That would have been simpler.

It was not enough for this platform. Users needed to inspect dataset rows, search them, analyze them, preview the processed content, and then export JSONL for training. Multi-turn samples also had to be split by conversation row, turn number, system prompt, user prompt, reasoning content, response, and weight.

So I normalized the corpus into MySQL. Each record carried fields such as dataset_file_id, row_identifier, turn_taking, and a set of data_* text fields. Soft deletion was handled with is_deleted. When the pipeline needed a full conversation, it grouped rows by row_identifier and sorted by id.

This was not prettier than a file stream. It was more useful for this product.

The tradeoff was clear: pagination, preview, analysis, soft deletion, and resumability became easier; database pressure and cursor correctness became my problem.

The real core was the cursor

Pause and resume were not kept in worker memory. They were stored as DB cursors.

The execution state was keyed by (task_id, stage, task_type) and persisted:

current_file_id
current_offset
processed_rows
status

On resume, the job loaded the cursor, skipped completed files, and continued from the offset in the matching file.

One small detail mattered more than it looks: after the resume file was reached, resume_offset had to be reset to zero. Otherwise the same offset would be applied to later files, quietly skipping the first N rows of each one.

That kind of bug does not always crash. It just leaves a hole in your data.

Pause used a Redis flag. At the beginning of each batch, the task checked whether it should pause. If yes, it saved the cursor, raised a business exception, and marked the job as paused.

Why not Celery revoke? Honestly, this was not a long architecture debate. Redis was familiar, and this use case needed graceful pause: stop after saving progress, then resume later. Killing the worker was not the goal.

`10000` and `10` in the same pipeline

The pipeline used two very different batch sizes:

cleaning: 10000
augmentation: 10

Cleaning is mostly CPU-bound rule processing, formatting, and deduplication. A large batch reduces database round trips.

Augmentation is different. Every sample may trigger an LLM call. It is slower, more expensive, and more exposed to external service failures. If the batch is too large, the pause button feels broken because the user has to wait for the current batch to finish.

So the augmentation batch size was kept at 10.

It was not a mathematically optimal number. It was a practical choice: give up some throughput so pause can respond quickly. For long-running jobs, that is often the right tradeoff.

One missing id produced millions of dirty rows

The clearest failure happened in the Response augmentation stage.

That stage read original records from the target table with offset pagination, called the LLM, and wrote the augmented response back to the same table. Writes used INSERT ... ON DUPLICATE KEY UPDATE so a resumed batch would update the same records instead of duplicating them.

At least, that was the intent.

One write path missed the primary key id. Without the id, MySQL had no duplicate key to match. The upsert became an insert.

The job was now reading from a table while also appending new rows to the same table. offset += batch_size kept moving forward, but the end of the table kept moving away.

The loop never reached the end.

That job ran for four or five days and produced tens of millions of dirty rows. QA caught it. The dirty rows were deleted. The immediate code fix was one line: include 'id': original_record.id in the write payload.

The deeper issue was not that one line. It was the combination of same-table reads and writes, offset pagination, and an implicit rule that every update payload must include the primary key.

Implicit rules eventually get broken.

Idempotency cannot depend on memory

Upsert is useful, but it is not a safety net by itself.

After this incident, I became much less tolerant of hidden contracts. If a Response augmentation write cannot safely happen without id, the lower layer should reject the payload. It should fail loudly instead of silently inserting new rows.

The same pattern appeared in pause state handling. In one earlier version, only part of the state was updated during pause. The resume API could not find the paused record. The fix was not glamorous: write paused consistently across the execution state, local business state, and the external business system state.

Redundant state is not the ugly part. Half-updated exceptional paths are.

One risk that had not yet exploded

The LLM batch generation path isolated failures per sample. A failed item did not stop the whole batch. It returned an empty string and recorded success/failure counts.

The project material confirms that this had not caused a production incident. No batch of empty responses was known to have entered training data.

Still, the mechanism is too soft. Empty-string placeholders protect task continuity, but they do not protect data quality. If I were tightening this pipeline now, empty or malformed augmentation output would go to a retry or failed-sample queue instead of quietly moving downstream.

A long-running data pipeline should not only care whether it finishes. It should care what it leaves behind.

What I would keep from this project

When people discuss training-data pipelines, they usually start with cleaning rules, augmentation prompts, or deduplication algorithms. Those matter. But once the pipeline becomes a product feature, the important work sinks one layer lower:

Is progress persisted outside the worker?
Is pause a business state, not just a killed process?
Is retry idempotent?
Does the write layer reject unsafe payloads?
Are failed samples blocked before export?

These details are not flashy in a demo.

But on day three of an LLM augmentation job, when someone clicks pause, these details are the system.

Related case and project pages are available on aildnc.com. If you are evaluating training-data processing, enterprise RAG, document parsing, or LLM augmentation work, contact me at contact@aildnc.com. For China-based inquiries, use the WeChat QR code below the article.