The inputs and outputs are actually pre-allocated static tensors, before any CUDA graph is even created.
As for why the pool is useful:
Say you capture a CUDA graph A: itneeds memory to execute. This memory is needed to store activations or workspace tensors for the kernels launched in graph A. This allocated memory is owned by graph A.
Then, if you create another graph B, it will also allocate some memory for its execution. Because CUDA can't be sure you won't run graph A and B at the same time, the memory allocated for graph A and B can't be the same: you would risk data corruption. But if you can guarantee graph A and B won't run at the same time, then there is no reason not to allocate the same memory to graph A and B. That memory shared between graphs A and B is the memory pool.
And as a sidenote, if you know which graph is going to need the most memory, you better capture it first. That way, the graph pool has the maximum size right away, and graph you capture afterwards can always fit inside the pool. Whereas if you capture a graph that requires a low amount of memory and then try to capture a graph that requires more memory, you run the risk of having memory fragmentation.