-
-
Notifications
You must be signed in to change notification settings - Fork 32.7k
bpo-46823: Implement LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE #31484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pyperformance (gcc with --enable-optimizations --with-lto):
|
With a very nice microbenchmark:
from pyperf import Runner, perf_counter
from itertools import repeat
class A:
pass
def bench(loops):
a = A()
a.x = 42
it = repeat(None, loops)
t0 = perf_counter()
for x in it:
a.x; a.x; a.x; a.x; a.x
a.x; a.x; a.x; a.x; a.x
a.x; a.x; a.x; a.x; a.x
a.x; a.x; a.x; a.x; a.x
a.x; a.x; a.x; a.x; a.x
return perf_counter() - t0
runner = Runner()
runner.bench_time_func("a.x", bench, inner_loops=25) |
Execution countsexecution counts for all instructions
Pair countsPair counts for top 100 pairs
Specialization statsspecialization stats by familyBINARY_SUBSCRspecialization stats for BINARY_SUBSCR family
Specialization attempts
STORE_SUBSCRspecialization stats for STORE_SUBSCR family
Specialization attempts
UNPACK_SEQUENCEspecialization stats for UNPACK_SEQUENCE family
Specialization attempts
FOR_ITERspecialization stats for FOR_ITER family
Specialization attempts
STORE_ATTRspecialization stats for STORE_ATTR family
Specialization attempts
LOAD_ATTRspecialization stats for LOAD_ATTR family
Specialization attempts
COMPARE_OPspecialization stats for COMPARE_OP family
Specialization attempts
LOAD_GLOBALspecialization stats for LOAD_GLOBAL family
Specialization attempts
BINARY_OPspecialization stats for BINARY_OP family
Specialization attempts
LOAD_METHODspecialization stats for LOAD_METHOD family
Specialization attempts
PRECALLspecialization stats for PRECALL family
Specialization attempts
CALLspecialization stats for CALL family
Specialization attempts
Specialization effectivenessspecialization effectiveness
Call statsInlined calls and frame stats
Object statsallocations, frees and dict materializatons
Stats gathered on: 2022-02-22 |
🤖 New build scheduled with the buildbot fleet by @sweeneyde for commit b6e0b7b 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the superinstructions separate from the specialized instructions, but this isn't much extra code and the speedup looks solid.
Pragmatism beats purity, it seems.
A couple of minor changes, otherwise looks good.
@@ -5430,6 +5460,53 @@ MISS_WITH_CACHE(BINARY_SUBSCR) | |||
MISS_WITH_CACHE(UNPACK_SEQUENCE) | |||
MISS_WITH_OPARG_COUNTER(STORE_SUBSCR) | |||
|
|||
LOAD_ATTR_INSTANCE_VALUE_miss: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't need special casing. If the LOAD_ATTR_INSTANCE_VALUE
get de-optimized, the preceding LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE
is still valid, just less likely to hit.
Given how unlikely a branch to the LOAD_ATTR
is, don't worry about this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the combined instruction borrows the cache from the following instruction, I was worried about the combined instruction misinterpreting some of the cache data, so that's why I was attempting to keep them in sync.
But it seems that the instructions in the LOAD_ATTR family all interpret the cache entries in the same way, so for now anyway, if there's ever a correct cache1->tp_version
, then cache0->index
should be correct: you can't turn something with a class that has a _PyObject_ValuesPointer into a class with __slots__
without modifying tp_version_tag
. (right?)
This might not be true though if we ever specialize instance.class_attribute
. In that scenario, a LOAD_ATTR_INSTANCE_VALUE cache entry could have a cache1->tp_version
and a cache0->index
into the values array of the instance, but then later get de-optimized and re-optimized to the hypothetical LOAD_ATTR_CLASS_ATTRIBUTE, which could (?) have the same cache1->tp_version
but a cache0->index
indexing into types's value array.
It feels less fragile to me to keep the adjacent instructions in sync, so that cache is only interpreted in one way at a time -- never as a LOAD_ATTR_INSTANCE_VALUE
cache from one instruction's perspective but a LOAD_ATTR_ADAPTIVE
cache from a different perspective. Is that worth worrying about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does make sense to special case LOAD_ATTR_INSTANCE_VALUE_miss
.
I didn't properly consider the case where LOAD_ATTR_INSTANCE_VALUE
get deoptimised before LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE
.
LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE_miss
can be simplified by letting LOAD_ATTR_INSTANCE_VALUE_miss
do the work.
Something like:
LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE_miss:
{
// This is special-cased because we have a superinstruction
// that includes a specialized instruction.
// We execute the first instruction, then perform a miss for
// the second instruction as usual.
// Do LOAD_FAST
{
PyObject *value = GETLOCAL(oparg);
Py_INCREF(value);
PUSH(value);
NEXTOPARG();
next_instr++;
}
// Now we are in the correct state for LOAD_ATTR
goto LOAD_ATTR_INSTANCE_VALUE_miss;
Python/ceval.c
Outdated
LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE_miss: | ||
{ | ||
// Special-cased because the standard opcode does not push | ||
// "owner" to the stack, but we need to push it if we DEOPT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems misleading. We don't need to push the local here because the jump to LOAD_FAST
does it.
Maybe:
"Special cased as this is a specialization of LOAD_FAST
, but de-optimization occurs during the following LOAD_ATTR
instruction.
When you're done making the requested changes, leave the comment: |
Python/ceval.c
Outdated
{ | ||
// This is special-cased because we have a specialization of | ||
// LOAD_FAST that borrows cache from the following instruction. | ||
STAT_INC(LOAD_FAST__LOAD_ATTR_INSTANCE_VALUE, miss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to count this as a miss. It is LOAD_ATTR_INSTANCE_VALUE
that misses and we don't want to count the miss twice.
I have made the requested changes; please review again |
Thanks for making the requested changes! @markshannon: please review the changes made to this pull request. |
🤖 New build scheduled with the buildbot fleet by @markshannon for commit 6450e4b 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
Looks good |
Failures are either timeouts or pre-existing test_gdb failures. All unrelated to this PR. |
I still need to benchmark, but this avoids an INCREF/DECREF pair for "owner", and this is currently the most common opcode pair.
See faster-cpython/ideas#291
https://bugs.python.org/issue46823