Preparing your Data

TL;DR: Before you can do RAG, you need your knowledge in a format machines can read. My notes lived in OneNote and PDFs—locked in, unstructured. I got everything out (export → HTML → Markdown), cleaned and normalized it, and ended up with plain-text Markdown and spreadsheets that the rest of the pipeline could actually use.

For links to all the posts in this series, see Aanae.

Before you can build anything intelligent with AI, you need to answer a simple question: Where is your knowledge, and what format is it in?

For me, the answer was: OneNote.

Why OneNote Became a Problem

OneNote is great for taking notes, but it fails when you try to do anything else with them.

Here’s what went wrong:

Notes pile up in random places. Notebooks, sections, pages—it sounds organized, but everything becomes a mess.
Notes don’t connect. They just sit there, isolated from each other.
Search only finds words, not the ideas behind them.
Your data is stuck. OneNote stores everything in its own format that other tools can’t read.

Most importantly: OneNote won’t let you extract your knowledge as plain text that a computer program can work with.

If you want to use AI with your notes, this is a dealbreaker.

PDFs have the same problem. PDFs are designed to look good on a page, not to be read by software. Paragraphs aren’t clearly marked. Tables and sidebars run together. AI models can’t make sense of pages and fonts—they need plain text with clear structure.

My notes were easy for me to read, but impossible for a computer to process. That difference shaped everything I did next.

Step 1: Getting the Data Out

First task: get all my notes out of OneNote and into a format I could actually work with.

I tried several methods:

OneNote’s built-in export
Converting to HTML first
Third-party conversion tools

None of them worked cleanly. The exports were always messy—weird formatting, broken lists, inconsistent headings.

What eventually worked:

OneNote → export as text → convert to Markdown

Why Markdown?

Markdown is the ‘native language’ of most LLMs. When you give an AI a PDF, it has to guess where a header ends. When you give it Markdown, the # and ## symbols act like signposts. It tells the AI: ‘This is important, pay attention here.’ It saves the AI from having to guess your structure so it can focus on your content

This wasn’t automatic. I had to fix things manually, try different approaches, and accept that the first attempt would be ugly.

Step 2: Cleaning the Mess

The exported files were still a mess:

Lines broke in weird places
The same header appeared over and over
Leftover formatting junk from OneNote
Tables that weren’t actually tables
Lists that lost their meaning

Before trying anything with AI, I had to clean this up. I wrote small scripts and went through files manually to:

Fix the headings
Remove the junk
Make sure each file was about one clear topic
Make the text readable as text, not just pretty on screen

This is boring work. But if you skip it, everything else fails. Bad data doesn’t magically become good data later.

I kept making it a little bit better each time with each pass.

Step 3: Naming Files Properly

Unexpected problem: what should I name each file?

This turned out to matter a lot. I used a system like this:

ECONOMY_NCERT_XI_Macroenomics.md

Each filename told me:

Subject (ECONOMY)
Source (NCERT textbook)
Level (Class XI or XII)
Topic

Why this helped:

I could understand what was in a file without opening it
Computer programs could figure out context from just the filename
When the AI searched for information, it was less likely to get confused
I could quickly see what topics I had and what was missing

File names became a cheap way to add information without having to read the whole file.

Step 4: Dealing with Spreadsheets

Some of my material (statistics, data tables, summaries) was in Excel, not OneNote.

Excel creates different problems:

Rows aren’t the same as paragraphs
Column headers explain what the data means, but that context gets lost
Related information is often spread across multiple rows

I had to decide:

When does one row become its own piece of information?
When do multiple rows need to stay together?
How do I keep the column headers connected to the data?
How do I avoid creating facts that are missing important context?

This taught me something important: preparing data means making your assumptions clear.

Every rule I made was really just an assumption about how things should work. I wrote the rules down, tested them, and changed them when they didn’t work.

Step 5: Making Everything Work Together

At this point, my notes were:

Out of OneNote
Saved as plain text
Organized consistently
Readable without any special software

Only then were they ready to use.

What Changed

By the end of this phase, my notes weren’t just things I had written. They became: inspectable (I could see exactly what was in them), checkable (I could verify the quality), scriptable (programs could work with them), and independent (not locked to any one tool).

Only after this did it make sense to think about the next steps: breaking notes into chunks, adding context, building search systems. Those things only work if the underlying data is solid.

Getting the data ready was when this stopped being a collection of notes and started becoming a system.

Next: Chunking (and why the structure of your chunks matters more than their size)