Hi, I’m George

Currently: exploring pre‑training LLM training efficiency, sample efficiency, building small coding LLMs, and finance LLM agents.

Previously I worked at Together AI running distributed training on hundreds of GPUs and building evals. At Snap I developed multimodal LLMs that reached 100M+ users, did diffusion model pre-training, built AI applications end-to-end, and contributed to research. I work across text, image, and speech.

Highlights

Large-scale fine-tuning and evaluation of open-source LLMs (1B–405B), with a focus on training efficiency and model quality.
Built distributed training platforms (60+ LLMs, 600+ GPUs) and LLM-as-a-judge evaluation pipelines (Flyte-orchestrated).
Long-context fine-tuning: custom sequence parallelism with flash_attn_varlen for 131k-context (LLaMA 3.1 70B) and 16k-context (LLaMA 3.1 405B).

Introduction to parallelism in PyTorch

Training large models inevitably requires a solid understanding of parallelism techniques. In this post, I’ll give a practical, in-depth overview of the most common approaches — DDP, FSDP, and TP — and how they’re actually used in real PyTorch training setups. This article was inspired by the excellent “How to Scale Your Model” blog series. While that series is clear and insightful, I felt it was missing some hands-on perspective and real-world lessons from someone who has trained models in the wild. ...

Tokenization from first principles

Byte-level BPE from first principles: what matters for speed and quality, how to implement it cleanly, and why a SuperBPE variant can lift sample efficiency.