I trained an AI to write my YouTube titles

TL;DR

I have a science & tech YouTube channel with 23k subscribers. That's pretty cool! But here's the thing—my click-through rate has been stuck at around 5% for months.

I'd spend hours brainstorming titles. I'd ask GPT-4 for help. I'd read every "title formula" blog post on the internet. And I'd get suggestions like:

"10 Mind-Blowing Facts About Quantum Computers"

They sound fine. They're fine. But they're not great. And I was tired of fine.

So I did what any reasonable person would do: I spent way too much time fine-tuning an LLM to write better titles for me. And it actually worked! My latest video got 15% CTR in the first 24 hours—triple my baseline.

Why Titles Are So Hard

Anybody who's tried to create content will tell you: creating the thing is difficult, but titling it is even harder. Because titles have consequences.

You can spend weeks filming, editing, perfecting every frame. But if your title doesn't work? Nobody clicks. And as Mr. Beast put it: "if nobody clicks, nobody watches your video. It's that simple."

The pressure is real. A bad title can tank weeks of work. A good one can make an average video perform like a great one.

That's why I built this thing.

The Problem: YouTube Datasets Are Full of Crap

I thought I was being clever. I'd train a model on trending YouTube videos! What could go wrong?

I grabbed a Kaggle dataset with 48,000 trending videos. I was pumped. This was going to be amazing.

Then I looked at the top rows:

Oh. These didn't trend because their titles were clever. They trended because they're Apple, SpaceX, Microsoft. Of course they trended!

If I trained a model on that, it would learn to write boring corporate press releases. That's not what I wanted. I wanted titles that actually worked for independent creators—the ones that spark curiosity, highlight absurd engineering feats, or create conflict. The stuff that gets clicks even when you don't have brand power behind you.

The Solution: Make Another LLM Do the Filtering

I needed a way to filter out all the corporate fluff. But how do you teach a computer to recognize "curiosity" vs "brand power"?

I decided to use Gemini Pro 1.5 as a ruthless curator. I'd feed it each video and ask it to decide: did this trend because of a clever title, or because of brand power?

Here's the prompt I gave it:

The Gemini filtering prompt
You are a hard-ass content strategist.  

Output only: KEEP or REMOVE + 5-word reason.

REMOVE:
- Corporate keynote, live-stream, product drop, deal list, music video, trailer.

KEEP:
- Small channel (<1M) BUT high view-to-sub ratio.
- Title contains curiosity, stakes, controversy, superlative engineering.

It took around 7 minutes and cost me $2.80 total. Not bad! It filtered out the corporate fluff, keeping only videos that actually had clever titles worth learning from.

Now I had my dataset. But wait—I still needed transcripts! You can't train a model to write titles from transcripts without... you know... transcripts.

The Process: How I Actually Built This Thing

Here's what I actually did, step by step:

Step 1: Filter for Science & Tech

First, I filtered the Kaggle dataset to only include Science & Technology videos (category 28). I used pandas in chunks because I didn't want to crash my laptop trying to load 48k rows at once:

Python: Filtering category 28 (Science & Technology)
df = pd.read_csv('US_youtube_trending_data.csv', chunksize=100_000)
cat28 = pd.concat([c[c.categoryId==28] for c in df]).drop_duplicates('video_id')
cat28.to_csv('category_28_videos.csv', index=False)   # 1,425 unique videos

Step 2: Let Gemini Filter Out the Corporate Fluff

I wrote a script that fed each video to Gemini and asked it to decide: KEEP or REMOVE? It looked for titles with curiosity, conflict, or absurd engineering feats—the stuff that works for independent creators.

Here's the actual function I used:

Python: Gemini filtering logic
def should_keep_video(title: str, channel_title: str, tags: str) -> Tuple[bool, str]:
    """Uses Gemini to determine if a video should be kept."""
    tags_str = tags if tags and tags != "[None]" else "No tags"
    
    prompt = f"""
{FILTERING_CRITERIA}

Video Title: "{title}"
Channel: {channel_title}
Tags: {tags_str}

Should this video be KEPT or REMOVED?
"""
    
    try:
        response = get_text_response(prompt)
        response_upper = response.upper()
        if "REMOVE" in response_upper:
            return False, response
        elif "KEEP" in response_upper:
            return True, response
        else:
            return True, f"UNCLEAR: {response}"
    except Exception as e:
        return True, f"ERROR: {e}"

Step 3: Get the Transcripts (This Was Annoying)

Now I needed transcripts. I wrote a script that spawned 10 threads, rotated through proxies (because YouTube doesn't like too many requests from one IP), and fetched transcripts for all the videos that passed Gemini's filter.

Some videos didn't have captions. Some had captions in other languages. After filtering, I had 734 videos with clean English transcripts. Good enough!

Python: Concurrent transcript fetching
def fetch_transcripts(youtube_urls: list[str]) -> list[dict]:
    """Fetches video data concurrently using a thread pool."""
    all_videos_data = []
    
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        future_to_url = {executor.submit(process_single_url, url): url 
                        for url in youtube_urls}
        
        for future in tqdm(as_completed(future_to_url), 
                          total=len(youtube_urls), desc="Processing Videos"):
            result = future.result()
            if result:
                all_videos_data.append(result)
                
    return all_videos_data

Step 4: Actually Train the Model

Now for the fun part! I formatted everything into OpenAI's fine-tuning format. I used gpt-4.1-mini because it's way cheaper than GPT-4, and honestly, I wasn't sure if this would even work.

Here's how I structured the training data:

Python: Training data preparation
BASE_MODEL = 'gpt-4.1-mini-2025-04-14'

system_message = {
    "role": "system",
    "content": """You are an expert YouTube title generator. Your task is to 
create compelling, accurate titles that capture the essence of video content 
based on transcripts.

Guidelines:
1. Be accurate and truthful
2. Make it compelling and click-worthy while staying honest
3. Keep titles concise (ideally 50-80 characters, max 100)
4. Use natural language that viewers would search for
5. Highlight the most interesting or valuable aspect
6. Capture curiosity gaps

Generate only the title itself, nothing else."""
}

# Format as JSONL for OpenAI fine-tuning
messages_list = [system_message, user_message, assistant_message]
json_line = json.dumps({"messages": messages_list})

I ran it for 3 epochs with a learning rate multiplier of 2.0 (I tested a bunch of values and this worked best). The whole training job finished in 11 minutes and cost me $11.56 total. That's... actually pretty reasonable?

I named the model "title-gen" because I'm creative like that.

Baseline vs Fine-Tuned: See the Difference

Here's what the difference actually looks like. I tested both models on the same video transcript (a review of a humanoid robot called Neo). Here are some real examples:

Baseline model (GPT-4.1-mini, no fine-tuning): Produces descriptive, factual titles like "Why the $20,000 Humanoid Robot Neo Is More Hype Than Help" or "The Truth About the $20K Humanoid Robot Neo: Promises, Privacy Risks & Limitations"

Fine-tuned model: Generates more personal, clickable titles like "Neo Is NOT the Robot I Wanted (Kinda)" or "I'm not Impressed by this Robot..."

The fine-tuned version learned to write titles that feel more human—less like a Wikipedia article, more like something you'd actually click on. Here are some real side-by-side comparisons from my testing interface:

Baseline (GPT-4.1-mini)

"Why the $20,000 Humanoid Robot Neo Is More Hype Than Help"

Fine-Tuned Model

"Neo Is NOT the Robot I Wanted (Kinda)"

Baseline (GPT-4.1-mini)

"Why the $20,000 Neo Humanoid Robot Might Be More Hype Than Help"

Fine-Tuned Model

"The truth behind this robot NEO..."

Baseline (GPT-4.1-mini)

"The Truth About the $20K Humanoid Robot Neo: Promises, Privacy Risks & Limitations"

Fine-Tuned Model

"I'm not Impressed by this Robot..."

The fine-tuned model consistently produces titles that are more conversational, personal, and curiosity-driven. It's matching the tone that actually works for independent creators on YouTube. You can see more examples on my channel.

Did It Actually Work?

My baseline CTR has been hovering around 5%. My worst recent video got 3.4%. I was stuck.

I made a video about a humanoid robot called Neo. I fed the transcript to my fine-tuned model and got 10 title suggestions. One of them caught my eye:

"This $20,000 robot can't do anything"

I went with it.

First 24 hours: 15% CTR. That's triple my baseline! Watch time from subscribers was up 38% compared to my average. TubeBuddy estimates I'll get about 9,000 extra views in the first week.

So yeah, it actually worked! I'm pretty pumped about it.

Since then, I've updated several more videos with titles generated by the model. The results have been extremely consistent—every single video I've changed has seen a CTR increase. Some got modest bumps of 1-2%, others saw bigger jumps like the robot video. But across the board? It's working. This isn't just luck—the model actually learned something useful.

I wanted to see what titles it would suggest for my entire channel. So I wrote a script that:

The whole thing took about 39 seconds and cost me maybe $0.50. Here are some of the recommendations it generated:

Original: "I Visualized My Brainwaves While Listening to Music"
Recommended: "Visualizing My Brain Waves In Real Time"
Original: "AI Orders a Pizza in 17:32 [any% glitchless]"
Recommended: "AI vs Human Pizza Speedrun"
Original: "I Can't Believe How Bad This AI Actually Is"
Recommended: "I let ChatGPT control my browser."
Original: "Half the Internet Broke Because of This"
Recommended: "One person pushed one bug and the whole internet went down"
Original: "I Made an AI Girlfriend"
Recommended: "I programmed an AI girlfriend."
Original: "I Built an AI Mirror. I Regret It."
Recommended: "I built a smart mirror that insults you if you're feeling too confident"

Some of these are genuinely better. The fine-tuned model learned to add specificity ("One person pushed one bug"), and write in a more conversational tone ("I let ChatGPT control my browser" vs "I Can't Believe How Bad This AI Actually Is").

The script saved all the recommendations to a JSON file, so I can review them and update titles whenever I want. Pretty cool for something that took 39 seconds to run.

But here's the thing: I'm only solving half the problem.

The Thing I'm Not Talking About (But Should)

Titles are important. But thumbnails? Thumbnails are JUST as important, if not more.

Think about it—when you're scrolling through YouTube, you see the thumbnail before you even read the title. Your brain processes the image in milliseconds. Bright colors, faces, dramatic visuals—they all matter way more than we give them credit for.

The best-performing videos have titles and thumbnails that work together. They create a cohesive click-bait package. But right now, I'm only generating titles. I'm ignoring half the equation.

Sure, I could manually create thumbnails that match my generated titles. But that's not scalable. And honestly? I'm lazy. I want a system that generates both as a package.

What I Learned

The biggest thing I learned? Fine-tuning isn't just about the model. It's about the dataset.

If I had just trained on all those trending videos without filtering, my model would have learned to write boring corporate titles. Instead, I used another LLM to filter for videos that actually had clever titles—ones that worked because of curiosity, conflict, or absurd engineering, not because of brand power.

That filtering step was the key. It's what made the difference between "fine" and "actually good."

What I'll Do Next

I'm already thinking about improvements:

But the big one? The thing I'm really excited about? Building a system that generates titles AND thumbnails together as a package.

Here's my plan: I'll use a vision model (probably GPT-4V or Claude) to analyze thumbnails from those same 734 "viral-by-design" videos. It'll describe what makes them work—bright colors, facial expressions, text placement, visual hierarchy, that kind of thing.

Then I'll fine-tune a model that takes a transcript and outputs both:

The key insight? Titles and thumbnails need to work together. They need to tell the same story. Right now I'm generating titles in isolation, but I should be generating them as part of a cohesive click package.

Will it work? I don't know! But that's half the fun. Sometimes you spend way too much time on something and it turns out to be worth it.

Oh, and by the way—if you clicked on this blog post, this same fine-tuned model came up with the title for this blog post too. So yeah, it's working on YouTube titles AND blog post titles now. Meta, right?

If you liked this, follow me on X! I'll probably write more about this stuff as I experiment with it.