-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48744: [C++][Parquet] Fix undefined behavior in memcpy with nullptr #48745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nullptr When ByteArray has len=0 and ptr=nullptr (the default-constructed state), calling std::memcpy with nullptr is undefined behavior according to the C++ standard, even when size is 0. This commit adds guards to check len > 0 before calling memcpy in: - TypedStatisticsImpl<ByteArrayType>::Copy (statistics.cc) - DictDecoderImpl<ByteArrayType>::SetDict (decoder.cc) - Test utilities (statistics_test.cc, column_writer_test.cc) Claude-Generated-By: Claude Code (cli/claude-opus-4-5=100%) Claude-Steers: 1 Claude-Permission-Prompts: 8 Claude-Escapes: 1
|
|
|
BTW, are there other similar calls in the parquet folder? I just did a quick peek and it seems that arrow/cpp/src/parquet/column_writer.cc Line 919 in fe5e0e5
|
Co-authored-by: Gang Wu <[email protected]>
|
this is getting scattered all the places, so I think we can just provide a |
Introduce a centralized SafeMemcpy function in types.h that guards against calling std::memcpy with nullptr, which is undefined behavior even when size is 0. Replace memcpy calls throughout parquet with SafeMemcpy where the size is dynamic and could be 0 (ByteArray lengths, buffer sizes, RLE sizes). Keep std::memcpy for static sizes (sizeof, FLBA type_length) that are guaranteed > 0.
|
@wgtmac can you help merging this? Thanks! |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is a good idea to wrap memcpy though the performance might not affected if it is inlined. My concern is the inconsistently mixed use of the wrapper and the raw function across the code base which confuses developers on which to choose in the future.
Should we just consider using std::copy_n to replace memcpy in the parquet subdirectory where variable-length is copied to limit the scope of this change? In principle, std::copy_n is safer and can be optimized by compilers.
WDYT @pitrou?
| /// | ||
| /// According to the C++ standard, calling std::memcpy with a nullptr is undefined | ||
| /// behavior even when size is 0. This function guards against that case. | ||
| static inline void SafeMemcpy(void* dest, const void* src, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be the wrong place to put SafeMemcpy. It seems cpp/src/arrow/util/memory_internal.h is a better home.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have arrow/util/ubsan.h and there's a handy function MakeNonNull there.
But I wouldn't be against adding a SafeMemcpy to that same file if it makes things easier.
Rationale for this change
As noted by @wgtmac in #48717 (comment), calling
std::memcpywith a nullptr source is undefined behavior according to the C++ standard, even when the size is 0.When
ByteArrayhaslen=0andptr=nullptr(the default-constructed state pertypes.h:649), the current code invokes UB.What changes are included in this PR?
Added guards to check
len > 0before callingmemcpyin:TypedStatisticsImpl<ByteArrayType>::Copyinstatistics.ccDictDecoderImpl<ByteArrayType>::SetDictindecoder.ccstatistics_test.ccandcolumn_writer_test.ccAre these changes tested?
Existing tests pass. The fix prevents potential UB that sanitizers or optimizing compilers could exploit.
Are there any user-facing changes?
No.