Local LLMs workshop - CSI HPCC Documentation

This is the companion page for the Local LLMs workshop run at the College of Staten Island. Everything you need to follow along is on the workshop USB stick. This page covers the same material the deck does, plus the step-by-step USB walkthrough so you can repeat the setup later.

By the end of the workshop you will have a chat app (AnythingLLM) talking to a model server (Ollama) running two open-weight models (Gemma 4 E2B and Qwen 3.5 0.5B) entirely on your own laptop — no internet required after setup.

Who is running this

Ethan Castro

Learning Engineer at Playlab.ai (employee #3). Full-time data science student at Baruch. Former neuro + bio research assistant at Brooklyn College and SUNY Downstate.

Hussam Ali

Full-time Electrical and Computer Engineering student at CSI. Former TSMC Process Engineering intern. Experiments across every layer of the AI stack.

Why this matters

Six reasons we are running this workshop on this campus, in this room, in this decade.

Lever	Why it matters
Empire AI	$500M+ committed by New York State. CUNY is one of seven founding institutions. This is happening with or without us.
CUNY HPCC	Literally in this building. Any CUNY undergrad doing research can request an account — it is also the on-ramp to Empire AI compute.
Career	Applied AI engineer median total comp is $245K. ML infrastructure at staff level is$ 250-400K.
Data privacy	Every regulated industry — banking, healthcare, law, education — is converging on the same conclusion: data can’t leave the perimeter.
NYC	The Bay Area treats AI like a regional industry. NYC has the talent, the schools, and the capital.
CUNY	CUNY is the largest urban public university in the country. The next wave of AI builders should look like the city they come from.

Cloud LLMs vs. local LLMs

Cloud LLMs

AI models hosted on remote servers by providers like OpenAI, Anthropic, Google, and Microsoft. You send a prompt over the internet; their GPUs run the model; the answer comes back.Examples: ChatGPT, Claude, Gemini, Copilot.Hyperscalers (Meta, Microsoft, Google, AWS) are projected to spend $700B+ on AI cloud infrastructure this year.

Local LLMs

AI models downloaded and executed directly on your own computer, laptop, or private server — no cloud round-trip.Inference engines: vLLM, SGLang, llama.cpp, Ollama, MLX, LM Studio.Open-weight model families: Llama, Qwen, Gemma, DeepSeek, GLM, Kimi, Nemotron.

Benefits of running models locally

Privacy — your data stays with you

Local models run directly on your personal device. Prompts, notes, code, research, prescriptions, PII, or confidential files do not need to be sent to a company’s cloud server. The model never sees anything you don’t hand it.

Offline access — AI without internet

Once a model is downloaded, it works without WiFi. Useful in classrooms, labs, on a plane, in low-connectivity areas, and in secure environments where outbound traffic is restricted.

Lower long-term cost

Instead of paying per prompt or for a subscription, you run the model on hardware you already own. That makes experimentation accessible to students and to small teams that can’t expense API spend.

Environmental awareness

No cloud requests. No round-trips to a data center. Local LLMs use your laptop’s existing power budget instead of remote facilities that need large amounts of electricity, cooling, and water.

Customization

You can tune the model’s behavior, point it at your own notes or documents, and build specialized workflows for school, research, coding, or engineering projects.

Open source vs. closed source

Open source / open weight

Model weights can be downloaded.
Can run locally on your own device or server.
More privacy and control.
Community can test, fine-tune, and build on top of it.
Examples: Llama, Qwen, DeepSeek, Gemma, Mistral.

Closed source / proprietary

Accessed through an app or API only.
Weights are not public.
Easier to use; less control.
Data handling and cost depend on the provider.
Examples: ChatGPT, Claude, Gemini.

Big models vs. small models

Large (30B – 2T+ parameters)

Better general reasoning.
Handles more complex tasks.
Usually needs cloud GPUs or big servers.
More expensive to run.
Good for: coding, research, planning, agents.

Small (under 20B parameters)

Faster and cheaper to run.
Runs on laptops, phones, or modest local servers.
Better when focused on one task.
Easier to customize or fine-tune.
Good for: tutoring, classification, privacy-sensitive tasks, simple assistants.

The current landscape

Two snapshots from Artificial Analysis Intelligence Index v4.0:

Leading models by country. The current frontier is split between the United States (Anthropic, OpenAI, Google, Meta) and China (Kimi, MiMo, Qwen, DeepSeek, GLM, MiniMax), with single entries from France (Muse Spark), South Korea, and the UAE.
Open weights vs. proprietary. Many of the frontier-quality models scoring in the 50-57 range are now open weight — Kimi K2.6, MiMo, Qwen 3.6, DeepSeek V4 Pro, GLM 5.1, MiniMax M2.7. The cost-of-entry to a strong local model has dropped dramatically.

The Artificial Analysis Intelligence Index v4.0 combines 10 benchmarks: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, and CritPt.

Step-by-step setup

You need two free desktop apps and two open-weight models. Follow these five steps in order. If you’re at the workshop, the USB has everything preloaded — see the USB shortcut below.

Download and install Ollama

Open ollama.com/download and grab the installer for your OS (Mac, Windows, or Linux). Run it.Ollama is the local model server — it runs in the background and exposes a local API that other apps can talk to. After install, you should see the Ollama icon in your menu bar (Mac) or system tray (Windows).

Download and install AnythingLLM

Open anythingllm.com/desktop and grab the desktop app for your OS. Run the installer.AnythingLLM is the chat front-end — the part that looks like ChatGPT but talks to your local Ollama instead of OpenAI.

Pull the two workshop models

Open a terminal (macOS: Terminal.app; Windows: PowerShell or Command Prompt) and run:

ollama pull gemma4:e2b
ollama pull qwen3.5:0.8b

gemma4:e2b is the main workshop model. qwen3.5:0.8b is the smaller alternate. Both will download from ollama.com/library in the background — total around 4-5 GB.

Open AnythingLLM and connect to Ollama

Launch AnythingLLM. In the onboarding screens:

LLM provider: Ollama
Model: gemma4:e2b
Click through the remaining onboarding screens (workspace name, telemetry choice, etc.).

Start chatting

You’re now running an AI model entirely on your own laptop. No cloud, no API key, no internet required after this point.To swap models later, open AnythingLLM’s settings → LLM Preference → pick qwen3.5:0.8b. It only loads into memory when selected.

Want to skip Ollama’s CLI? AnythingLLM can pull models for you. Once Ollama is installed, AnythingLLM will detect it and let you pull gemma4:e2b from inside its UI.

Workshop-day USB shortcut

If you’re in the room with us, you don’t need to download anything — the USB stick has all of the above preloaded. Just plug it in and run one file:

macOS
Windows

Open the USB and run START-MAC.command

Double-click START-MAC.command.If macOS blocks it: right-click → Open → confirm Open in the dialog.

Install AnythingLLM and Ollama

Installers from installers/ will open. Drag each app icon into Applications.Switch back to the Terminal window and press Enter between installs.

Wait for the model to extract

The launcher extracts models.tar into Ollama’s model store. AnythingLLM launches automatically when it’s done.

Configure AnythingLLM

LLM provider: Ollama
Model: gemma4:e2b
Click through the remaining onboarding screens.

Open the USB and run START-WINDOWS.bat

Double-click START-WINDOWS.bat.If SmartScreen blocks it: click More info → Run anyway.

Click through each installer

Installers from installers/ will appear for AnythingLLM and Ollama. Accept the defaults.

Wait for the model to extract

The launcher extracts models.tar into Ollama’s model store. AnythingLLM launches automatically when it’s done.

Configure AnythingLLM

LLM provider: Ollama
Model: gemma4:e2b
Click through the remaining onboarding screens.

What is on the USB

File / folder	What it does
`0-START-HERE.txt`	Plain-English fallback instructions. Open this first if anything is unclear.
`START-MAC.command`	One file Mac users double-click — opens installers, extracts models, starts Ollama, launches AnythingLLM.
`START-WINDOWS.bat`	One file Windows users double-click — runs the PowerShell setup with the right execution policy.
`installers/`	Offline installers for AnythingLLM (anythingllm.com/desktop) and Ollama (ollama.com/download) so the room doesn’t fight WiFi.
`models.tar`	Preloaded Ollama model store — `gemma4:e2b` and `qwen3.5:0.8b` so you don’t pull from the internet.
`References to Search Through/`	Markdown files for the agent to search — gives you something real to investigate immediately.
`WORKSHOP-GUIDE.html`	The visual overview of everything on the stick. Open in any browser.

Models on the USB

gemma4:e2b

The main workshop model. Better answers, multimodal demos, agent tasks.

Quantization: Q4_K_M
Loaded into RAM on demand
The default choice for all workshop activities

qwen3.5:0.8b

The smaller alternate model. Useful for showing speed-vs-quality tradeoffs.

0.8B parameters — runs comfortably on modest laptops
Loaded only when selected
Try it after you’ve used Gemma so the difference is obvious

The agent activity — search local files

The headline hands-on moment of the workshop. Instead of explaining vector databases first, you point the agent at a folder and watch it search.

Open AnythingLLM with Gemma loaded

From the setup steps above. Ollama running in the background, Gemma selected as the model.

Open the prompt sheet

References to Search Through/FILE-SEARCH-AGENT-PROMPTS.md — this has copy-paste prompts you can run.

Point the agent at the references folder

Replace PATH_TO_THIS_FOLDER with the actual path on your machine, then send:

@agent Search PATH_TO_THIS_FOLDER for Gemma 4 and qwen3.5. Read the most
relevant files and compare the two models for this workshop.

Ask for an output

The agent can summarize, compare, cite filenames, and create new files from what it found. Try:

“Write a one-page summary of how Empire AI access works.”
“List the SLURM commands mentioned in the HPCC docs.”
“Compare Gemma 4 E2B and Qwen 3.5 0.8B head to head.”

What’s in the references folder

HPCC questions

Ask about CUNY HPCC accounts, SLURM, modules, storage, job submission, and first steps on the cluster.

Empire AI questions

Search Empire AI material for governance, research areas, active projects, and mission context.

Model questions

Compare Gemma, Qwen, and other small open-source models using the bundled benchmark references.

Live demo tasks

Two prompts we run live with the audience to show off multimodal behavior on a tiny local model:

Task 1

“Show me a photo of the College of Staten Island.”Demonstrates how a multimodal local model handles a request for visual content it doesn’t have, and how it explains its own limits.

Task 2

“Show me a photo of one of the workshop hosts.”Same exercise with a person — pushes on whether the model has the relevant identity in its training data, and reinforces that local models aren’t omniscient.

Compatibility

Device	Status	Note
Apple Silicon Mac (M1+)	Ready	Uses the bundled Apple Silicon AnythingLLM and Ollama installers.
Windows 10 / 11 x64	Ready	Uses the bundled Windows installers and setup script.
Intel Mac	Partial	AnythingLLM Intel build is on the stick. The current Ollama DMG is Apple Silicon only, so Intel Mac users may need Ollama already installed.
All devices	Disk space	Use an exFAT-formatted USB, and leave at least 12-15 GB free on the laptop for model extraction.

After the workshop

If you keep using local models, the natural next steps are:

Run a larger model if your laptop has the RAM. Try gemma3:12b or qwen3:14b from ollama pull.
Move to a real GPU via the CSI HPCC or, for bigger work, Empire AI.
Wire local models into your IDE — Continue, Cursor, and Aider all support an Ollama endpoint.
Read the open-weight model cards on Hugging Face before downloading new models — licenses vary widely.

Local LLMs are part of a broader open-source ecosystem. Inference engines, model creators, fine-tuners, and tooling teams all contribute — the workshop is a starting point, not the destination.

Workshop

Documentation Index

​Who is running this

Ethan Castro

Hussam Ali

​Why this matters

​Cloud LLMs vs. local LLMs

Cloud LLMs

Local LLMs

​Benefits of running models locally

​Open source vs. closed source

Open source / open weight

Closed source / proprietary

​Big models vs. small models

Large (30B – 2T+ parameters)

Small (under 20B parameters)

​The current landscape

​Step-by-step setup

​Workshop-day USB shortcut

​What is on the USB

​Models on the USB

gemma4:e2b

qwen3.5:0.8b

​The agent activity — search local files

​What’s in the references folder

HPCC questions

Empire AI questions

Model questions

​Live demo tasks

Task 1

Task 2

​Compatibility

​After the workshop

Who is running this

Why this matters

Cloud LLMs vs. local LLMs

Benefits of running models locally

Open source vs. closed source

Big models vs. small models

The current landscape

Step-by-step setup

Workshop-day USB shortcut

What is on the USB

Models on the USB

The agent activity — search local files

What’s in the references folder

Live demo tasks

Compatibility

After the workshop