Speeding up LLM inference with parallelism