Q : “What could be the reason for this inferior performance?”
Welcome to the real-world of how computing devices actually work.
Your code makes a few performance sins, so let’s dig them up, ok?
The code makes 500
.map()-ed calls of an
object() function and pays immense upfront costs of spawning n-copies of the whole python session into replicated processes (so as to escape from the monopolistic GIL-lock ownership re-
[SERIAL]-isation, that would appear otherwise – read other great posts on this subject, if not aware about GIL-details ), yet the actual work, that is being delegated to such expensive “remote”-d processes is to just run a
.minimize()-method, driven by a square of a difference. In other words, all the
.cpu_count()-times replicated ( memory-IO allocated (thus swapping, if having headbanged into any of the physical RAM-size and O/S memory-manager ceilings ) + copied the full range of the main-python interpreter process memory-image … all data-structures included … – yes, that is what happens to the win O/S, somehow least devastating costs on Linux O/S-es ).
So quite an expensive product of calling just an innocent
pool.map() SLOC, isn’t it?
Next comes the SER/DES + communication costs when passing parameters there and back. This bears a few
kB here for the payloads on a way “there” and a few
B on the way of the results going “back”, so you happily do not sense much of this kind of pain, that may hurt your code in some other, less happy use-cases, yet you still do pay it… yes, again by some additional add-on overhead times, each time of the 500
Next comes the worse part. Having requested
.cpu_count()-many processes to stand in the O/S process-scheduler queue, so as to get some time to grab a CPU-core and execute (for some O/S process-scheduler granted time, before being forcibly moved out-of-the-CPU-core, so as to let some other O/S-assigned process to move in & execute – this is how process-scheduling works), you might already feel the smoke, that this comes at an additional cost of add-on overhead times, consumed by the (heavy) lifting, once many processes stay waiting in the queue for their respective turn.
Next comes the worst penalty – you easily, almost granted, start to pay
~ 300x ~ higher memory-access costs, as upon your scheduled process re-entry to the CPU-core has lost the chance to reuse the fastest computing resource, the
L1_data on-core cache, first by having any such fast-2b-re-used data overwritten by some other, stored there by a processes, that has used this CPU-core before your process gets its next CPU-core-share to continue, after a previous process has been forcedly removed by the O/S scheduler, leaving but LRU-cache data you never need to reuse & so your process will pay extra costs to re-fetch your data again (paying costs of not less than
~100 ~380 \[ns\] to re-fetch data from main RAM, … if memory-I/O channels permit to be served without further waiting for a free channel … ). This will most probably happen per each O/S-process-scheduler process move-out/move-in cycle, as in most cases you do not even get to the same CPU-core as you have been camped last time (so even no chance to speculate on some residuals, that might “remain” in the
L1_data on-core cache from your “last round” the process was assigned to some of the available CPU-cores.
Last but not least – contemporary CPU perform another level of process-2-core “optimisation” strategy, so as to reduce the risk of hitting the thermal-ceiling. So, the processes move “around” even more often, so as to allow them to work on colder CPU-cores and leaving “hot”-cores cold down. That strategy works fine for light workloads, where a few computing-intensive processes may enjoy thermal-motivated jumping from a hot CPU-core to another, colder CPU-core, because otherwise, if it were left on the hot CPU-core, the hot silicon-fabric will slow the frequency of such CPU-core, so as to permit the thermal-ceiling not be overcome – and yes, you get it – your processing will get slower as at the reduced CPU-frequency you get less CPU-clocks and less computing will take place, until this hot-CPU-core gets colder (which is a sort of oxymoron for heavy-computing jobs, isn’t it?). If this were for a few processes on a multicore-CPU, such thermal-strategy may seem attractive to show high-GHz-clocking and thermal-jumps from hot CPU-core to colder CPU-core, but that -for obvious reasons…- stops working if you
.map()-processes to “cover” all CPU-cores (not mentioning all other O/S-orchestrated processes in the queue), so the only result will be, that all-hot CPU-cores will slow down the CPU-frequency and will work in slow-motion so as to withstand the thermal-ceiling limitations.
No. This is how the contemporary silicon toys work. Speculative strategies work nice, but only for a few and rather light-weight workloads. Then you start suffering from the reality of all the constraints that the laws of physics drive (which were until that hidden by the over-hyped marketing slogans, valid only for vast-excess of (cold) resources and light-weight computing / memory-I/O traffic patterns).
More reads on this + an analysis on modern criticism of the original ( overhead-naive ) Amdahl’s Law argument is here.
So, welcome to the reality of computing :o)
CLICK HERE to find out more related problems solutions.