💋 MediGlitterDiaries

⚡ Got Sick of Copy-Pasting, So I Built an Automated ETL Study Pipeline⚡

📝 The "Elevator Pitch"

If you use active recall to study, you know the struggle. I was spending hours manually copying text from my physics PDFs, pasting it into an AI, begging the AI to format it right, and then copying the results into my Obsidian vault. It was a tedious, mind-numbing loop.

So, I decided to stop doing it. I built a custom Python pipeline that reads my textbooks, extracts high-yield concepts using the Gemini API, formats them as perfect Markdown toggles, and injects them straight into my local Obsidian vault. Now, I just run a script, sit back, and let the machine build my flashcards while I chill.

🛠️ The Tech Stack

The Brain: Google Gemini-3.1-flash-lite API (Configured with a temperature of 0.0 for strict, zero-fluff data extraction).
The Engine: Python (Using the pypdf library to loop through textbook pages chronologically).
The Database: Obsidian (Receiving raw, perfectly formatted Markdown toggles directly into the local .md files).

⚙️ How It Works

I structured this as a classic ETL (Extract, Transform, Load) pipeline:

Extract: The script opens my textbook PDF and reads it page by page, automatically skipping blank pages and ignoring formatting fluff.
Transform: It feeds the raw text to Gemini with a highly specific "Negative Prompt" (telling it exactly what not to do, like ignoring page numbers and historical trivia) and forces it to output strict active-recall toggles.
Load: Using a retry loop to bypass API speed limits safely, Python automatically appends the generated questions to my Obsidian vault.

🎯 Why It Matters

"There are dozens of 'AI Flashcard' apps out there charging $15 to $20 a month. By using Python and Google's free API tier, I completely bypassed the paywalls."

More importantly, I have total control over the output. If the AI misses a concept, I don't have to wait for an app update—I just tweak my prompt and run it again. Learning to automate my own workflow was infinitely more rewarding than just paying for another subscription.

💻 The Script

Here is the core logic of the pipeline. It’s designed to be lightweight and efficient:

from google import genai from pypdf import PdfReader import time # 1. Setup the new 2026 Client # Put your API key here client = genai.Client(api_key="blablablablablabla") # The new high-speed workhorse model for 2026 current_model = 'gemini-3.1-flash-lite-preview' pdf_file_path = r"blablablablabla" obsidian_file_path = r"blablablablablablabla.md" print(f"Opening PDF... Using {current_model}") reader = PdfReader(pdf_file_path) with open(obsidian_file_path, "a", encoding="utf-8") as file: # Starting from page 1 for i in range(0, len(reader.pages)): print(f"Reading page {i + 1}...") page_text = reader.pages[i].extract_text() if not page_text or len(page_text.strip()) < 50: continue prompt = f""" You are a strict exam tutor preparing a student for high-level exams Read the provided text and extract ONLY the hard scientific concepts. CRITICAL 'DO NOT' RULES: - NO page numbers, chapter titles, or headers. - NO historical trivia or dates. - NO conversational filler. CRITICAL 'MUST DO' RULES: - Extract definitions, laws, principles, and postulates. - Extract the "Why" and "How" behind phenomena. - Format as: - Question? - Answer. TEXT: {page_text} """ success = False while not success: try: print(f"Asking AI for page {i + 1}...") # The new 2026 way to call the AI response = client.models.generate_content( model=current_model, contents=prompt ) file.write(f"\n\n### Questions from Page {i + 1}\n") file.write(response.text.strip()) file.flush() # Force saves to your vault immediately print(f"Page {i + 1} saved! Pausing briefly...") time.sleep(5) success = True except Exception as e: print(f"Waiting for a minute... Error: {e}") time.sleep(60) print("Success! Your new chapter is ready in Obsidian.") # Run it