Using Generative AI to Build Generative AI

2 months ago 11

On May 8, O’Reilly Media volition beryllium hosting Coding with AI: The End of Software Development arsenic We Know It—a unrecorded virtual tech league spotlighting however AI is already supercharging developers, boosting productivity, and providing existent worth to their organizations. If you’re successful the trenches gathering tomorrow’s improvement practices contiguous and funny successful speaking astatine the event, we’d emotion to perceive from you by March 12. You tin find much accusation and our telephone for presentations here. Just privation to attend? Register for escaped here.


Hi, I americium a prof of cognitive subject and plan astatine UC San Diego, and I precocious wrote posts connected Radar astir my experiences coding with and speaking to generative AI tools similar ChatGPT. In this station I privation to speech astir utilizing generative AI to widen 1 of my world bundle projects—the Python Tutor instrumentality for learning programming—with an AI chat tutor. We often perceive astir GenAI being utilized successful large-scale commercialized settings, but we don’t perceive astir arsenic overmuch astir smaller-scale not-for-profit projects. Thus, this station serves arsenic a lawsuit survey of adding generative AI into a idiosyncratic task wherever I didn’t person overmuch time, resources, oregon expertise astatine my disposal. Working connected this task got maine truly excited astir being present astatine this infinitesimal close arsenic almighty GenAI tools are starting to go much accessible to nonexperts similar myself.

Learn faster. Dig deeper. See farther.

For immoderate context, implicit the past 15 years I’ve been operating Python Tutor (https://pythontutor.com/), a escaped online instrumentality that tens of millions of radical astir the satellite person utilized to write, run, and visually debug their codification (first successful Python and present besides successful Java, C, C++, and JavaScript). Python Tutor is chiefly utilized by students to recognize and debug their homework duty codification step-by-step by seeing its telephone stack and information structures. Think of it arsenic a virtual teacher who draws diagrams to amusement runtime authorities connected a whiteboard. It’s champion suited for tiny pieces of self-contained codification that students commonly brushwood successful machine subject classes oregon online coding tutorials.

Here’s an illustration of utilizing Python Tutor to measurement done a recursive relation that builds up a linked database of Python tuples. At the existent step, the visualization shows 2 recursive calls to the listSum relation and assorted pointers to database nodes. You tin determination the slider guardant and backward to spot however this codification runs step-by-step:

AI Chat for Python Tutor’s Code Visualizer

Way backmost successful 2009 erstwhile I was a grad student, I envisioned creating Python Tutor to beryllium an automated tutor that could assistance students with programming questions (which is wherefore I chose that task name). But the occupation was that AI wasn’t astir bully capable backmost past to emulate a quality tutor. Some AI researchers were publishing papers successful the tract of intelligent tutoring systems, but determination were nary wide accessible bundle libraries oregon APIs that could beryllium utilized to marque an AI tutor. So alternatively I spent each those years moving connected a versatile codification visualizer that could beryllium *used* by quality tutors to explicate codification execution.

Fast-forward 15 years to 2024, and generative AI tools similar ChatGPT, Claude, and galore others based connected LLMs (large connection models) are present truly bully astatine holding human-level conversations, particularly astir method topics related to programming. In particular, they’re large astatine generating and explaining tiny pieces of self-contained codification (e.g., nether 100 lines), which is precisely the people usage lawsuit for Python Tutor. So with this exertion successful hand, I utilized these LLMs to adhd AI-based chat to Python Tutor. Here’s a speedy demo of what it does.

First I designed the idiosyncratic interface to beryllium arsenic elemental arsenic possible. It’s conscionable a chat container beneath the user’s codification and visualization:

There’s a dropdown paper of templates to get you started, but you tin benignant successful immoderate question you want. When you click “Send,” the AI tutor volition nonstop your code, existent visualization authorities (e.g., telephone stack and information structures), terminal substance output, and question to an LLM, which volition respond present with thing like:

Note however the LLM tin “see” your existent codification and visualization, truthful it tin explicate to you what’s going connected here. This emulates what an adept quality tutor would say. You tin past proceed chatting back-and-forth similar you would with a human.

In summation to explaining code, different communal usage lawsuit for this AI tutor is helping students get unstuck erstwhile they brushwood a compiler oregon runtime error, which tin beryllium precise frustrating for beginners. Here’s an scale out-of-bounds mistake successful Python:

Whenever there’s an error, the instrumentality automatically populates your chat container with “Help maine hole this error,” but you tin prime a antithetic question from the dropdown (shown expanded above). When you deed “Send” here, the AI tutor responds with thing like:

Note that erstwhile the AI generates codification examples, there’s a “Visualize Me” fastener underneath each 1 truthful that you tin straight visualize it successful Python Tutor. This allows you to visually measurement done its execution and inquire the AI follow-up questions astir it.

Besides asking circumstantial questions astir your code, you tin besides inquire wide programming questions oregon adjacent career-related questions similar however to hole for a method coding interview. For instance:

… and it volition make codification examples that you tin visualize without leaving the Python Tutor website.

Benefits implicit Directly Using ChatGPT

The evident question present is: What are the benefits of utilizing AI chat wrong Python Tutor alternatively than pasting your codification and question into ChatGPT? I deliberation determination are a fewer main benefits, particularly for Python Tutor’s people assemblage of beginners who are conscionable starting to larn to code:

1) Convenience – Millions of students are already writing, compiling, running, and visually debugging codification wrong Python Tutor, truthful it feels precise earthy for them to besides inquire questions without leaving the site. If alternatively they request to prime their codification from a substance exertion oregon IDE, transcript it into different tract similar ChatGPT, and past possibly besides transcript their mistake message, terminal output, and picture what is going connected astatine runtime (e.g., values of information structures), that’s mode much cumbersome of a idiosyncratic experience. Some modern IDEs bash person AI chat built in, but those necessitate expertise to acceptable up since they’re meant for nonrecreational bundle developers. In contrast, the main entreaty of Python Tutor for beginners has ever been its easiness of access: Anyone tin spell to pythontutor.com and commencement coding close distant without installing bundle oregon creating a idiosyncratic account.

2) Beginner-friendly LLM prompts – Next, adjacent if idiosyncratic were to spell done the occupation of copy-pasting their code, mistake message, terminal output, and runtime authorities into ChatGPT, I’ve recovered that beginners aren’t bully astatine coming up with prompts (i.e., written instructions) that nonstop LLMs to nutrient easy understandable responses. Python Tutor’s AI chat addresses this occupation by augmenting chats with a system prompt similar the pursuing to stress directness, conciseness, and beginner-friendliness:

You are an adept programming teacher and I americium a pupil asking you for assistance with ${LANGUAGE}.
– Be concise and direct. Keep your effect nether 300 words if possible.
– Write astatine the level that a beginner pupil successful an introductory programming people tin understand.
– If you request to edit my code, marque arsenic fewer changes arsenic needed and sphere arsenic overmuch of my archetypal codification arsenic possible. Add codification comments to explicate your changes.
– Any codification you constitute should beryllium self-contained and runnable without importing outer libraries.
– Use GitHub Flavored Markdown.

It besides formats the user’s code, mistake message, applicable enactment numbers, and runtime authorities successful a well-structured mode for LLMs to ingest. Lastly, it provides a dropdown paper of communal questions and requests similar “What does this mistake connection mean?” and “Explain what this codification does line-by-line.” truthful beginners tin commencement crafting a question close distant without staring astatine a blank chat box. All of this behind-the-scenes punctual templating helps users to debar communal problems with straight utilizing ChatGPT, specified arsenic it generating explanations that are excessively wordy, jargon-filled, and overwhelming for beginners.

3) Running your codification alternatively of conscionable “looking” astatine it – Lastly, if you paste your codification and question into ChatGPT, it “inspects” your codification by speechmaking implicit it similar a quality tutor would do. But it doesn’t really tally your codification truthful it doesn’t cognize what relation calls, variables, and information structures truly beryllium during execution. While modern LLMs are bully astatine guessing what codification does by “looking” astatine it, there’s nary substitute for moving codification connected a existent computer. In contrast, Python Tutor runs your codification truthful that erstwhile you inquire AI chat astir what’s going on, it sends the existent values of the telephone stack, information structures, and terminal output to the LLM, which again hopefully results successful much adjuvant responses.

Using Generative AI to Build Generative AI

Now that you’ve seen however Python Tutor’s AI chat works, you mightiness beryllium wondering: Did I usage generative AI to assistance maine physique this GenAI feature? Yes and no. GenAI helped maine astir erstwhile I was getting started, but arsenic I got deeper successful I recovered little of a usage for it.

Using Generative AI to Create a Mock-Up User Interface

My attack was to archetypal physique a stand-alone web-based LLM chat app and aboriginal integrate it into Python Tutor’s codebase. In November 2024, I bought a Claude Pro subscription since I heard bully buzz astir its codification procreation capabilities. I began by moving with Claude to make a mock-up idiosyncratic interface for an LLM chat app with acquainted features similar a idiosyncratic input box, substance bubbles for some the LLM and quality user’s chats, HTML formatting with Markdown, syntax-highlighted codification blocks, and streaming the LLM’s effect incrementally alternatively than making the idiosyncratic hold until it finished. None of this was innovative—it’s what everyone expects from utilizing an LLM chat interface similar ChatGPT.

I liked moving with Claude to physique this mock-up due to the fact that it generated unrecorded runnable versions of HTML, CSS, and JavaScript codification truthful I could interact with it successful the browser without copying the codification into my ain project. (Simon Willison wrote a great station connected this Claude Artifacts feature.) However, the main downside is that whenever I petition adjacent a tiny codification tweak, it would instrumentality up to a infinitesimal oregon truthful to regenerate each the task codification (and sometimes annoyingly permission parts arsenic incomplete […] segments, which made the codification not run). If I had alternatively utilized an AI-powered IDE similar Cursor oregon Windsurf, past I would’ve been capable to inquire for instant incremental edits. But I didn’t privation to fuss mounting up much analyzable tooling, and Claude was bully capable for getting my web frontend started.

A False Start by Locally Hosting an LLM

Now onto the backend. I primitively started this task aft playing with Ollama connected my laptop, which is an app that allowed maine to tally LLMs locally for escaped without having to wage a unreality provider. A fewer months earlier (September 2024) Llama 3.2 had travel out, which featured smaller models similar 1B and 3B (1 and 3 cardinal parameters, respectively). These are overmuch little almighty than state-of-the-art models, which are 100 to 1,000 times bigger astatine the clip of writing. I had nary anticipation of moving larger models locally (e.g., Llama 405B), but these smaller 1B and 3B models ran good connected my laptop truthful they seemed promising.

Note that the past clip I tried moving an LLM locally was GPT-2 (yes, 2!) backmost successful 2021, and it was TERRIBLE—a symptom to acceptable up by installing a clump of Python dependencies, ace dilatory to run, and producing nonsensical results. So for years I didn’t deliberation it was feasible to self-host my ain LLM for Python Tutor. And I didn’t privation to wage to usage a unreality API similar ChatGPT oregon Claude since Python Tutor is simply a not-for-profit task connected a shoestring budget; I couldn’t spend to supply a escaped AI tutor for implicit 10,000 regular progressive users portion eating each the costly API costs myself.

But now, 3 years later, the operation of smaller LLMs and Ollama’s ease-of-use convinced maine that the clip was close for maine to self-host my ain LLM for Python Tutor. So I utilized Claude and ChatGPT to assistance maine constitute immoderate boilerplate codification to link my prototype web chat frontend with a Node.js backend that called Ollama to tally Llama 1B/3B locally. Once I got that demo moving connected my laptop, my extremity was to big it connected a fewer assemblage Linux servers that I had entree to.

But hardly 1 week in, I got atrocious quality that ended up being a immense blessing successful disguise. Our assemblage IT folks told maine that I wouldn’t beryllium capable to entree the fewer Linux servers with capable CPUs and RAM needed to tally Ollama, truthful I had to scrap my archetypal plans for self-hosting. Note that the benignant of low-cost server I wanted to deploy connected didn’t person GPUs, truthful they ran Ollama overmuch much dilatory connected their CPUs. But successful my archetypal tests a tiny exemplary similar Llama 3.2 3B inactive ran good for a fewer concurrent requests, producing a effect wrong 45 seconds for up to 4 concurrent users. This isn’t “good” by immoderate measure, but it’s the champion I could bash without paying for a unreality LLM API, which I was acrophobic to bash fixed Python Tutor’s sizable userbase and tiny budget. I figured if I had, accidental 4 replica servers, past I could service up to 16 concurrent users wrong 45 seconds, oregon possibly 8 concurrents wrong 20 seconds (rough estimates). That wouldn’t beryllium the champion idiosyncratic experience, but again Python Tutor is escaped for users, truthful their expectations can’t beryllium sky-high. My program was to constitute my ain load-balancing codification to nonstop incoming requests to the lowest-load server and queuing codification truthful if determination were much concurrent users trying to link than a server had capableness for, it would queue them up to debar crashes. Then I would request to constitute each the sysadmin/DevOps codification to show these servers, support them up-to-date, and reboot if they failed. This was each a daunting imaginable to codification up and trial robustly, particularly due to the fact that I’m not a nonrecreational bundle developer. But to my relief, present I didn’t person to bash immoderate of that grind since the assemblage server program was a no-go.

Switching to the OpenRouter Cloud API

So what did I extremity up utilizing instead? Serendipitously, astir this clip idiosyncratic pointed maine to OpenRouter, which is an API that allows maine to constitute codification erstwhile and entree a assortment of paid LLMs by simply changing the LLM sanction successful a configuration string. I signed up, got an API key, and started making queries to Llama 3B successful the unreality wrong minutes. I was shocked by however casual this codification was to acceptable up! So I rapidly wrapped it successful a server backend that streams the LLM’s effect substance successful existent clip to my frontend utilizing SSE (server-sent events), which displays it successful the mock-up chat UI. Here’s the essence of my Python backend code:

import openai # OpenRouter uses the OpenAI API
client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=<your API key>
)
completion = client.chat.completions.create(
    model=<name of LLM, specified arsenic Llama 3.2 3B>,
    messages=<your query to the LLM>,
    stream=True
)
for chunk successful completion:
    text = chunk.choices[0].delta.content
    <stream substance to web frontend utilizing Server-Sent Events>

OpenRouter does outgo money, but I was consenting to springiness it a changeable since the prices for Llama 3B looked much tenable than state-of-the-art models similar ChatGPT oregon Claude. At the clip of writing, 3B is astir $0.04 USD per cardinal tokens, and a state-of-the-art LLM costs up to 500x arsenic overmuch (ChatGPT-4o is $20 and Claude 3.7 Sonnet is $18). I would beryllium frightened to usage ChatGPT oregon Claude astatine those prices, but I felt comfy with the overmuch cheaper Llama 3B. What besides gave maine comfortableness was knowing I wouldn’t aftermath up with a elephantine measure if determination were a abrupt spike successful usage; OpenRouter lets maine enactment successful a fixed magnitude of money, and if that runs retired my API calls simply neglect alternatively than charging my recognition paper more.

For immoderate other bid of caput I implemented my ain complaint limits: 1) Each user’s input and full chat conversations are constricted to a definite magnitude to support costs nether power (and to trim hallucinations since smaller LLMs thin to spell “off the rails” arsenic conversations turn longer); 2) Each idiosyncratic tin nonstop lone 1 chat per minute, which again prevents overuse. Hopefully this isn’t a large occupation for Python Tutor users since they request astatine slightest a infinitesimal to work the LLM’s response, effort retired suggested codification fixes, past inquire a follow-up question.

Using OpenRouter’s unreality API alternatively than self-hosting connected my university’s servers turned retired to beryllium truthful overmuch amended since: 1) Python Tutor users tin get responses wrong lone a fewer seconds alternatively than waiting 30-45 seconds; 2) I didn’t request to bash immoderate sysadmin/DevOps enactment to support my servers, oregon to constitute my ain load balancing oregon queuing codification to interface with Ollama; 3) I tin easy effort antithetic LLMs by changing a configuration string.

GenAI arsenic a Thought Partner and On-Demand Teacher

After getting the “happy path” moving (i.e., erstwhile OpenRouter API calls succeed), I spent a clump of clip reasoning astir mistake conditions and making definite my codification handled them good since I wanted to supply a bully idiosyncratic experience. Here I utilized ChatGPT and Claude arsenic a thought spouse by having GenAI assistance maine travel up with borderline cases that I hadn’t primitively considered. I past created a debugging UI sheet with a twelve buttons beneath the chat container that I could property to simulate circumstantial errors successful bid to trial however good my app handled those cases:

After getting my stand-alone LLM chat app moving robustly connected mistake cases, it was clip to integrate it into the main Python Tutor codebase. This process took a batch of clip and elbow grease, but it was straightforward since I made definite to person my stand-alone app usage the aforesaid versions of older JavaScript libraries that Python Tutor was using. This meant that astatine the commencement of my task I had to instruct Claude to make mock-up frontend codification utilizing those older libraries; different by default it would usage modern JavaScript frameworks similar React oregon Svelte that would not integrate good with Python Tutor, which is written utilizing 2010-era jQuery and friends.

At this constituent I recovered myself not truly utilizing generative AI day-to-day since I was moving wrong the comfortableness portion of my ain codebase. GenAI was utile astatine the commencement to assistance maine fig retired the “unknown unknowns.” But present that the occupation was well-scoped I felt overmuch much comfy penning each enactment of codification myself. My regular grind from this constituent onward progressive a batch of UI/UX polishing to marque a creaseless idiosyncratic experience. And I recovered it easier to straight constitute codification alternatively than deliberation astir however to instruct GenAI to codification it for me. Also, I wanted to recognize each enactment of codification that went into my codebase since I knew that each enactment would request to beryllium maintained possibly years into the future. So adjacent if I could person utilized GenAI to codification faster successful the abbreviated term, that whitethorn travel backmost to haunt maine aboriginal successful the signifier of subtle bugs that arise due to the fact that I didn’t afloat recognize the implications of AI-generated code.

That said, I inactive recovered GenAI utile arsenic a replacement for Google oregon Stack Overflow sorts of questions similar “How bash I constitute X successful modern JavaScript?” It’s an unthinkable assets for learning method details connected the fly, and I sometimes adapted the illustration codification successful AI responses into my codebase. But astatine slightest for this project, I didn’t consciousness comfy having GenAI “do the driving” by generating ample swaths of codification that I’d copy-paste verbatim.

Finishing Touches and Launching

I wanted to motorboat by the caller year, truthful arsenic November rolled into December I was making dependable advancement getting the idiosyncratic acquisition much polished. There were a cardinal small details to enactment through, but that’s the lawsuit with immoderate nontrivial bundle project. I didn’t person the resources to measure however good smaller LLMs execute connected existent questions that users mightiness inquire connected the Python Tutor website, but from informal investigating I was dismayed (but not surprised) astatine however often the 1B and 3B models produced incorrect explanations. I tried upgrading to a Llama 8B model, and it was inactive not amazing. I held retired anticipation that tweaking my strategy punctual would amended performance. I didn’t walk a ton of clip connected it, but my archetypal content was that nary magnitude of tweaking could marque up for the information that a smaller exemplary is conscionable little capable—like a canine encephalon compared to a quality brain.

Fortunately successful precocious December—only 2 weeks earlier launch—Meta released a new Llama 3.3 70B model. I was moving retired of time, truthful I took the casual mode retired and switched my OpenRouter configuration to usage it. My AI Tutor’s responses instantly got amended and made less mistakes, adjacent with my archetypal strategy prompt. I was tense astir the 10x terms summation from 3B to 70B ($0.04 to $0.42 per cardinal tokens) but gave it a changeable anyhow.

Parting Thoughts and Lessons Learned

Fast-forward to the present. It’s been 2 months since launch, and costs are tenable truthful far. With my strict complaint limits successful spot Python Tutor users are making astir 2,000 LLM queries per day, which costs little than a dollar each time utilizing Llama 3.3 70B. And I’m hopeful that I tin power to much almighty models arsenic their prices driblet implicit time. In sum, it’s ace satisfying to spot this AI chat diagnostic unrecorded connected the tract aft dreaming astir it for astir 15 years since I archetypal created Python Tutor agelong ago. I emotion however unreality APIs and low-cost LLMs person made generative AI accessible to nonexperts similar myself.

Here are immoderate takeaways for those who privation to play with GenAI successful their idiosyncratic apps:

  • I highly urge utilizing a unreality API supplier similar OpenRouter alternatively than self-hosting LLMs connected your ain VMs oregon (even worse) buying a carnal instrumentality with GPUs. It’s infinitely cheaper and much convenient to usage the unreality here, particularly for personal-scale projects. Even with thousands of queries per day, Python Tutor’s AI costs are tiny compared to paying for VMs oregon carnal machines.
  • Waiting helped! It’s bully to not beryllium connected the bleeding borderline each the time. If I had attempted to bash this task successful 2021 during the aboriginal days of the OpenAI GPT-3 API similar aboriginal adopters did, I would’ve faced a batch of symptom moving astir unsmooth edges successful fast-changing APIs; easy-to-use instruction-tuned chat models didn’t adjacent beryllium backmost then! Also, determination would not beryllium immoderate online docs oregon tutorials astir champion practices, and (very meta!) LLMs backmost past would not cognize however to assistance maine codification utilizing those APIs since the indispensable docs weren’t disposable for them to bid on. By simply waiting a fewer years, I was capable to enactment with high-quality unchangeable unreality APIs and get utile method assistance from Claude and ChatGPT portion coding my app.
  • It’s amusive to play with LLM APIs alternatively than utilizing the web interfaces similar astir radical do. By penning codification with these APIs you tin intuitively “feel” what works good and what doesn’t. And since these are mean web APIs, you tin integrate them into projects written successful immoderate programming connection that your task is already using.
  • I’ve recovered that a short, direct, and elemental strategy punctual with a larger LLM volition bushed elaborate strategy prompts with a smaller LLM. Shorter strategy prompts besides mean that each query costs you little wealth (since they indispensable beryllium included successful the query).
  • Don’t interest astir evaluating output prime if you don’t person resources to bash so. Come up with a fewer handcrafted tests and tally them arsenic you’re developing—in my lawsuit it was tricky pieces of codification that I wanted to inquire Python Tutor’s AI chat to assistance maine fix. If you accent excessively overmuch astir optimizing LLM performance, past you’ll ne'er vessel anything! And if you find yourself yearning for amended quality, upgrade to a larger LLM archetypal alternatively than tediously tweaking your prompt.
  • It’s precise hard to estimation however overmuch moving an LLM volition outgo successful accumulation since costs are calculated per cardinal input/output tokens, which isn’t intuitive to crushed about. The champion mode to estimation is to tally immoderate trial queries, get a consciousness of however wordy the LLM’s responses are, past look astatine your relationship dashboard to spot however overmuch each query outgo you. For instance, does a emblematic query outgo 1/10 cent, 1 cent, oregon aggregate cents? No mode to find retired unless you try. My hunch is that it astir apt costs little than you imagine, and you tin ever instrumentality complaint limiting oregon power to a lower-cost exemplary aboriginal if outgo becomes a concern.
  • Related to above, if you’re making a prototype oregon thing wherever lone a tiny fig of radical volition usage it astatine first, past decidedly usage the champion state-of-the-art LLM to amusement disconnected the astir awesome results. Price doesn’t substance overmuch since you won’t beryllium issuing that galore queries. But if your app has a just fig of users similar Python Tutor does, past prime a smaller exemplary that inactive performs good for its price. For maine it seems similar Llama 3.3 70B strikes that equilibrium successful aboriginal 2025. But arsenic caller models travel onto the scene, I’ll reevaluate those price-to-performance trade-offs.
Read Entire Article