Hi, I’m George

Currently: exploring pre‑training LLM training efficiency, sample efficiency, building small coding LLMs, and finance LLM agents.

Previously I worked at Together AI running distributed training on hundreds of GPUs and building evals. At Snap I developed multimodal LLMs that reached 100M+ users, did diffusion model pre-training, built AI applications end-to-end, and contributed to research. I work across text, image, and speech.

Highlights

  • Large-scale fine-tuning and evaluation of open-source LLMs (1B–405B), with a focus on training efficiency and model quality.
  • Built distributed training platforms (60+ LLMs, 600+ GPUs) and LLM-as-a-judge evaluation pipelines (Flyte-orchestrated).
  • Long-context fine-tuning: custom sequence parallelism with flash_attn_varlen for 131k-context (LLaMA 3.1 70B) and 16k-context (LLaMA 3.1 405B).

See also: Publications

Introduction to parallelism in PyTorch

Training large models inevitably requires a solid understanding of parallelism techniques. In this post, I’ll give a practical, in-depth overview of the most common approaches — DDP, FSDP, and TP — and how they’re actually used in real PyTorch training setups. This article was inspired by the excellent “How to Scale Your Model” blog series. While that series is clear and insightful, I felt it was missing some hands-on perspective and real-world lessons from someone who has trained models in the wild. ...

October 31, 2025 · 21 min · 4269 words · George Grigorev

Tokenization from first principles

Byte-level BPE from first principles: what matters for speed and quality, how to implement it cleanly, and why a SuperBPE variant can lift sample efficiency.

October 7, 2025 · 16 min · 3224 words · George Grigorev