Open Source Projects¶

Open source projects relevant to the topics covered in Blueprint for an AI-First Company. Organized by category with descriptions and GitHub URLs.

Frameworks¶

Application frameworks for building LLM-powered systems. Central to Chapter 4 (Infrastructure), Chapter 5 (Building with AI), and Chapter 6 (Agent Architecture).

Project	Description	GitHub
LangChain	Framework for building LLM applications with composable chains, agents, tools, and memory. The most widely adopted LLM orchestration framework.	https://github.com/langchain-ai/langchain
LlamaIndex	Data framework for connecting LLMs to external data sources. Specializes in RAG pipelines, document indexing, and structured data retrieval. RAG adoption jumped from 31% to 51% in one year.	https://github.com/run-llama/llama_index
AutoGen	Microsoft's framework for building multi-agent conversational systems. Agents can collaborate, debate, and execute code autonomously.	https://github.com/microsoft/autogen
CrewAI	Framework for orchestrating multi-agent AI systems with role-based agents, task delegation, and collaborative workflows.	https://github.com/crewAIInc/crewAI
Semantic Kernel	Microsoft's SDK for integrating LLMs into applications. Supports plugins, planners, and memory across C#, Python, and Java.	https://github.com/microsoft/semantic-kernel
Haystack	End-to-end NLP framework by deepset for building search systems, RAG pipelines, and question answering applications.	https://github.com/deepset-ai/haystack
DSPy	Framework from Stanford NLP for programming (not prompting) language models. Optimizes prompts and weights algorithmically.	https://github.com/stanfordnlp/dspy

Open-Weight Models¶

Model families that can be self-hosted, fine-tuned, and deployed on your own infrastructure. Discussed in Chapter 3 (The AI Landscape) and the Foundation Models framework. European banks run open models for regulatory compliance; Shopify runs 40-60M daily inferences on fine-tuned open models.

Project	Description	GitHub / Model Hub
Llama	Meta's open-weight model family. Most widely adopted open model for enterprise deployments. Used by Shopify, European banks, and healthcare organizations for on-premises workloads.	https://github.com/meta-llama/llama
Mistral	European AI lab's open-weight models. Fastest time-to-first-token (0.30s). Strong performance relative to model size.	https://github.com/mistralai/mistral-inference
Phi	Microsoft's small language model family. Designed for resource-constrained environments and edge deployment with strong performance per parameter.	https://github.com/microsoft/phi-3
Gemma	Google DeepMind's open model family. Available in multiple sizes for research and commercial use.	https://github.com/google-deepmind/gemma
Qwen	Alibaba's open-weight model family with strong multilingual capabilities and multiple model sizes.	https://github.com/QwenLM/Qwen

Serving & Inference Tools¶

Tools for running models efficiently in production. Addresses the infrastructure challenges discussed in Chapter 4 and the cost considerations in the Foundation Models framework.

Project	Description	GitHub
vLLM	High-throughput LLM serving engine with PagedAttention for efficient memory management. The standard for production LLM serving.	https://github.com/vllm-project/vllm
Ollama	Run LLMs locally with a simple CLI. Supports Llama, Mistral, Phi, Gemma, and other open models out of the box.	https://github.com/ollama/ollama
LM Studio	Desktop application for discovering, downloading, and running local LLMs with a chat interface. Available for Mac, Windows, and Linux.	https://github.com/lmstudio-ai/lms
Text Generation Inference (TGI)	Hugging Face's production LLM serving solution. Optimized for throughput with continuous batching and tensor parallelism.	https://github.com/huggingface/text-generation-inference
llama.cpp	C/C++ port of Meta's Llama model for efficient CPU and GPU inference. Enables model quantization for running on consumer hardware.	https://github.com/ggerganov/llama.cpp
LocalAI	Drop-in OpenAI API replacement for running models locally. Supports multiple model formats and backends.	https://github.com/mudler/LocalAI

Evaluation & Benchmarking¶

Tools for evaluating model quality and comparing performance. The book emphasizes that benchmarks are "marketing tools masquerading as objective measures" -- these tools help you evaluate on your own data.

Project	Description	GitHub / URL
LMSYS Chatbot Arena	Live human preference evaluation platform where users compare model outputs head-to-head. The book notes Elo differences under 50 points are "basically a toss-up."	https://github.com/lm-sys/FastChat
Open LLM Leaderboard	Hugging Face's automated benchmark for open-weight models across multiple evaluation tasks.	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
HELM (Holistic Evaluation of Language Models)	Stanford's comprehensive evaluation framework testing models across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.	https://github.com/stanford-crfm/helm
Evals	OpenAI's framework for evaluating LLMs with custom evaluation tasks. Extensible for domain-specific testing.	https://github.com/openai/evals
DeepEval	Open source LLM evaluation framework for unit testing LLM outputs with metrics for hallucination, relevance, faithfulness, and toxicity.	https://github.com/confident-ai/deepeval
promptfoo	Open source tool for testing and evaluating LLM outputs across prompts, models, and parameters. Supports automated CI/CD evaluation.	https://github.com/promptfoo/promptfoo

Security & Safety¶

Tools for protecting AI systems against the risks described in Chapter 11 and the 7 AI Risks and Mitigations framework.

Project	Description	GitHub
Garak	LLM vulnerability scanner that probes models for hallucination, data leakage, prompt injection, toxicity, and jailbreaking. Named after the Star Trek character.	https://github.com/NVIDIA/garak
Rebuff	Self-hardening prompt injection detection framework. Uses multi-layered defense including heuristics, LLM analysis, and canary tokens.	https://github.com/protectai/rebuff
LLM Guard	Input/output validation framework for LLMs. Detects prompt injections, scans for PII, checks toxicity, and enforces content policies.	https://github.com/protectai/llm-guard
AI Fairness 360	IBM's toolkit for detecting and mitigating bias in AI models. Referenced in the book's discussion of algorithmic fairness.	https://github.com/Trusted-AI/AIF360
Fairlearn	Microsoft's toolkit for assessing and improving fairness of AI systems. Supports multiple fairness metrics and mitigation algorithms.	https://github.com/fairlearn/fairlearn
NeMo Guardrails	NVIDIA's toolkit for adding programmable guardrails to LLM applications. Controls topics, prevents jailbreaking, and enforces safety policies.	https://github.com/NVIDIA/NeMo-Guardrails

Observability & Monitoring¶

Tools for monitoring AI systems in production. Addresses the observability gap described in the 5 Infrastructure Mistakes framework -- only 51% of organizations can evaluate AI ROI.

Project	Description	GitHub
Phoenix	Arize's open source ML observability tool for monitoring model performance, detecting drift, and debugging LLM applications.	https://github.com/Arize-AI/phoenix
OpenLLMetry	Open source observability framework for LLM applications built on OpenTelemetry. Traces LLM calls, chains, and agent actions.	https://github.com/traceloop/openllmetry
Langfuse	Open source LLM engineering platform for tracing, evaluation, prompt management, and cost tracking.	https://github.com/langfuse/langfuse
whylogs	WhyLabs' open source library for logging and monitoring data and model quality with statistical profiling.	https://github.com/whylabs/whylogs

Data & RAG Infrastructure¶

Tools for building the data pipelines and RAG systems that power AI-first products. Supports the Data Flywheel framework.

Project	Description	GitHub
Chroma	Open source embedding database for AI-native applications. Lightweight and designed for rapid prototyping and production use.	https://github.com/chroma-core/chroma
Weaviate	Open source vector database with built-in vectorization and hybrid search.	https://github.com/weaviate/weaviate
Qdrant	Open source vector similarity search engine with extended filtering and distributed deployment.	https://github.com/qdrant/qdrant
pgvector	PostgreSQL extension for vector similarity search. Store embeddings alongside relational data.	https://github.com/pgvector/pgvector
Unstructured	Open source library for preprocessing and extracting content from documents (PDFs, HTML, images, etc.) for RAG pipelines.	https://github.com/Unstructured-IO/unstructured

Foundation Models Landscape -- Context for open vs. closed model decisions
Build vs Buy Calculus -- When open source tools make sense vs. vendor solutions
5 Infrastructure Mistakes -- Infrastructure patterns these tools help address
7 AI Risks and Mitigations -- Risk categories the security tools help mitigate
7 Failure Modes of Agents -- Agent failure patterns the observability tools help detect