This has come up often enough on the Discord to justify a comprehensive post on the matter.
People like to say that PyGame is slow, and that you should use a fast programming language like C++, and a “real” engine like Unity3D or UE4. People tell that to beginners too, but beginners can write slow code in any language, in any game engine, yet they quickly start to parrot "PyGame is slow" without understanding why that is and when performance matters.
Something like “Python is slow“ or “PyGame is slow” only helps if you understand what makes other game engines fast.
If your game is running fast enough, if you get a stable 60 FPS with some CPU cycles to spare, then you shouldn't waste your time optimising. PyGame might be slower than UE4, but it is fast enough often enough, and I happily trade some speed for the convenience of writing Python. Especially in web development, where correctness and development time are more important than speed, people happily make the same trade-off for business-critical software.
If you want to write a clone of Pong, or Tetris, or Flappy Bird, then PyGame is fast enough. If you are making a point-and-click adventure, or a top-down RPG, or a Sokoban-like game, or a narrative game, then PyGame is also fast enough, but you may want to consider using AGS, RPG Maker, PuzzleScript, or Twine.
Python is a dynamic, interpreted language. That has many advantages for rapid development. You can get the type of a value at run time, you can always add a new instance variable to an object, you can use dir() to list the variable names in a module, and you can type an expression into the Python shell and evaluate it interactively.
The CPython interpreter (the default Python implementation you can download from python.org) achieves this level of dynamism in a rather straighforward way, but at the cost of run-time performance. When a simple line like x +=1 is executed in Python, the interpreter needs to look up x in some dictionary of variables and values, increment the refcount of the value of x, see if x is a number or if it needs to invoke the __add__(self, other) method, add the numbers, increment the refcount of the result, decrement the refcount of the old value of X twice, and assign the new value to x in the dictionary of local variables.
In C, if x is an int, code like x++; would in all likelihood compile into a single instruction, and if x is already loaded into a register, and the pipelining of the CPU works, it might be executed within a single CPU cycle.
Of course, the compiled binary has no idea that the variable is called x and that its type is int.
All this makes pure Python between 10 and 100 times slower than pure C, but in practice, this difference is not quite as big. Many Python functions like list.sort() or re.find() are implemented in C, so even if the function call overhead is 10 times slower, the time spent in the body of the function would be roughly the same.
PyGame is implemented in C, and uses the cross-platform C library SDL2 (which achieves its cross-platformness by implementing platform-specific functionality again and again for each supported processor architecture, operating system or game console). The performance-critical party of PyGame are not implemented in Python. This makes PyGame reasonably fast.
PyGame doesn't use the GPU for drawing. Instead of textures, vertices, and polygons on the GPU, PyGame uses 2D pixel buffers (called “surfaces” in SDL jargon), which are stored in RAM (main memory). You can either draw the contents of a surface into another surface or onto the screen (which is called called “blitting”), rotate and scale surfaces with the pygame.transform module, or draw shapes into surfaces with the pygame.draw module.
Blitting and drawing operations in PyGame are reasonably fast, because they are implemented in C, but they use the CPU and manipulate RAM. That means that every drawing and blitting operation has some Python function call overhead, but takes an amount of time that is roughly proportional to the number of pixels drawn.
The higher your resolution, the more pixels are drawn. Theoretically you can blit 100 8x8 surfaces, which makes for 6400 pixels, faster than you can draw 10 32x32 surfaces, which would amount to 10240 pixels. In practice, the overhead of Python will take its toll when you blit hundreds of small surfaces, and when you set individual pixels of a surface, the overhead of Python interpretation will make this operation 100x slower than the equivalent C code.
When rendering using a graphics card, it's not just faster because the graphics card is optimised for pushing pixels around and has a higher fill rate, but because your code running on the CPU can do other things while the GPU is doing the rendering. As long as your GPU is not maxed out, using it is "infinitely" faster than the CPU - at least in terms of CPU time used. Some GPU operations are as slow as or slower than software rendering. Loading a surface from RAM into a texture in graphics memory (VRAM) is limited by both the speed of the RAM and the system bus (your graphics card is probably connected via PCIe), but it's probably still faster than loading an image file from a hard disk (not so sure about loading a file from NVMe SSD). As long as all the textures have been loaded into the GPU (not just RAM) when you loaded the level, drawing them on the GPU every frame will be much, much faster than the equivalent software-based rendering.
Unfortunately, you can't just use a graphics card as a drop-in replacement for software rendering in PyGame for two connected reasons: First, there is some overhead involved in talking to the graphics card, so replacing every blit with a draw call to the graphics card might be slower, and second, you can blit surfaces not just onto the screen, but onto other surfaces, and you can draw onto every surface, and read back the pixels. To achieve 100% backward compatibility for rendering all existing PyGame games with hardware acceleration, you would have to copy pixels back and forth between main memory and VRAM, and have the CPU wait for the GPU to finsih drawing before reading back pixels. That kind of overhead would be way worse than Python function call overhead on small drawing operations, comparable to the overhead of making queries to your SQL database in a for loop instead of using a JOIN statement. Not all games in PyGame take advantage of all features, so it's possible to rewrite most but not all PyGame games to take advantage of the GPU easily.
On the bright side, many PyGame games are using only simple graphics with a low resolution, so rendering is usually not the bottleneck in the first place.
One benefit of software rendering is that drawing works similar to the way it did back in BASIC, and that makes PyGame a great teaching tool. Rendering is conceptually simple, and fully exposed to the user. With hardware-accelerated graphics, either a lof of complexity would be exposed to the user, or a lot of complexity would be hidden away behind a game engine. This brings us to the next point:
PyGame is a library, not an engine. As a pythonic wrapper around SDL, it occupies a similar niche to Löve2D, libGDX, XNA/FNA/MonoGame, SFML, and raylib.
In PyGame, you are free to design your game loop however you like, and there is no concept of "game objects", levels, or components. You are free to draw whatever you want to the screen, and use your own algorithms and data structures.
A game engine usually controls the entry point of the program, and loads level data and game-objects into its own data structures. The gameplay code written by the game designer is invoked during loading, when game objects are updated, or during collisions, but the game loop of the engine cannot be changed. Furthermore, the engine is structured so gameplay code doesn't need to do a lot of complex computation: Rendering, physics, and collision detection are already handled by the engine, and gameplay code only needs to set parameters on game objects, create or destroy them, which is usually quite fast - or rather, if the game engine is fast enough, the game will probably run fast enough.
In contrast to this, libraries like PyGame require the game designer to implement functionality like culling and collision detection (if desired), and it is easy enough to get them slightly wrong, or accidentally implement them with quadratic run-time. There are more ways to shoot yourself in the foot. You could make too many draw calls, read a file from disk every frame, or stop updating the game while you wait for network input. All of these problems can be avoided when you know what you are doing, or they can become problems when "rolling your own engine" in a fast language like C++. The next one is unavoidable when using Python and PyGame, but not as impactful as choosing algorithms and data structures with good worst-case complexity.
In a game engine, the core functionality is usually implemented in C or C++, with only the game logic written scripting language like lua or a managed language like C#. This way, the "hot" code paths of rendering and collision detection do not only use algorithms with good worst-case behaviour, they also have very little overhead. Many modern engines even lay out the game objects in a contiguous block of memory to avoid following pointers, speed up memory access, and allow fast iteration over all game objects.
If this code was written in pure Python, even if the Python-based main loop called fast blitting routines implemented in assembly language, or if rendering was based on OpenGL, iterating over all game objects would involve Python lists of Python objects, following pointers to each object, and virtual method dispatch. There would be a much greater overhead per object.
It is possible to implement the core of an engine in efficient C, with game objects laid out in memory for fast access, and to expose this data structure to Python scripts through some kind of façade or proxy that translates Python method calls into manipulations of these flat data structures. This is of course a performance trade-off, trading slightly worse performance of custom scripts and a fixed main loop for much better performance of the engine core, but it might be worth it, especially if scripts leave the heavy lifting to the engine core and just change values of game objects. Fast native-code game engines that can be scripted with Python already exist.
More On Python Performance
PyGame has some fast functions implemented in C, but unfortunately, performance is like dieting: It only adds up. Just like you can't "even out" the calories of a meal if you eat three salads after you had three bars of chocolate, you can't really improve performance by calling more PyGame functions. Sometimes, the slight overhead of the Python/C API has a higher impact on overall performance than the time spent actually doing the thing.
The global interpreter lock or GIL is another annoying implementation detail of CPython. It ensures that C extension modules are running single-threaded by default, and Python code is always running single-threaded. You can execute Python code while a C extension waits for input or performs a long-running computation in another thread, but only if the C extension is programmed to allow that. And it better not access any Python objects while another thread is running Python code, or your code will crash.
If you use an engine that can be scripted with Python, only one script can be executed at a time, but if your engine core is all C, you can at least make the performance-critical parts multi-threaded.
PyGame can perform some costly operations like blitting or pygame.transform.scale() while another thread is running Python code, but in general, multithreading does not improve the performance of PyGame all that much.
Python 3 has introduced the asyncio module to make it slightly easier to write network servers that concurrently perform I/O with many clients (which was something you could already do with threads) and the numpy module can use a multi-threaded algorithm to compute the eigenvalues of large matrices. Neither can be used to make PyGame utilise another CPU core.
There is another Python implementation, called PyPy, consisting of a Python VM and compiler written in Python itself. PyPy code runs much faster then ordinary CPython code. It's not quite as fast as C, but close. PyPy also does not have a GIL, which means that you can see real performance gains from running multi-threaded PyPy code on multi-core CPUs.
Unfortunately, Python modules that are implemented in C assume the GIL is in place and use the old CPython API, and so it happens that PyGame talks to PyPy via some kind of proxy or facade that translates between the old Python API and the fast compiled PyPy code. As a result, the overhead of writing your main loop on PyPy is much lower, but the function call overhead when calling PyGame is much, much higher, so that many games will actually run slower on PyPy than on CPython. Smart people are already working on a new API to make C extension modules run fast on different Python implementations, so this might change in the future.