-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attaching a Debugger to BPF #14756
Comments
Just an update - I've looked into the third alternative option under next steps a bit more and I think I'm going to try that next - I have a feeling this will end up being a bit less error-prone, and it also has the added benefit of making gdbstub better for many use cases. |
was this ever worked on/finished ? |
There is a work in progress by @terorie and me building on top of this. We want to make it accessible in the browser but unfortunately that will still take some time. For now, you could follow the steps outlined here to have basic debugging functionality with gdb. Getting this up and running is a bit cumbersome as of now but after #anza-xyz/llvm-project#38 lands in bpf-tools you would only need to compile a patched gdb and patch rbpf with the gdbstub crate to make it work. If you have a project you want to debug today you could either build rust with an llvm fork as mentioned in the repo or share your programs code (if you are able/allowed to) so I may be able to build it with debug info for you. In any case since this is kind of scattered around github and I'm not sure the info provided in solana-poc-debugging-example is very clear you could open an issue there and I can help to get it running |
@jawilk awesome tysm |
a beginner question: is the debugging related to the debugging here: https://solana.com/docs/programs/debugging the link above shows a way to debug bpf program using which is still very limited. |
Yes, this is correct. |
Problem
There is no debugger support for solana BPF programs yet, meaning users are limited to print statements for debugging solana programs, which is difficult as many solana programs use shared memory or have manual struct packing methods that do tricky things with pointers like pointer arithmetic and shared memory references which are much easier to debug using a debugger.
Proposed Solution
Implement a GDB stub server for solana BPF. Implementing a stub server instead of a dedicated arch definition makes it easier to decouple the debugging logic from the VM itself and to debug programs running in the context of a node. It also gives a lot more flexibility for defining how stuff like single step, breakpoints, and watchpoints are implemented, and in particular we can do it however we like in the context of a running BPF instance.
Proposed implementation follows three steps:
CURRENT STATUS: For this, I used the crate gdbstub to avoid reinventing wheels. I then defined a simple "request/reply" pattern between the two threads over a
std::sync::mpsc
so the gdb stub could ask the vm to do things for it, like "step a single instruction" or "set a breakpoint at some address". Currently, the debugger, if enabled, will block until a GDB client connects to it.CURRENT STATUS: This was basically implemented as a handler for the "request/reply" pattern mentioned in the previous item, though actual tests still need to be written. It's also worth noting that these breakpoints / single steps are pretty much useless because of the next item. Initially it was conditionally-compiled with the
debug
feature, but that ended up being annoying for writing tests, so instead I just have someif let
statmements gating debugger behavior in the loop - in the future if that's too much performance overhead, it can be refactored to have less of an impact, probably using some kind of IoC injection. However, only single-step works from an actual GDB instance (next item says why).CURRENT STATUS: If the existing support isn't working for our needs we'd need to extend the existing target definition. The existing BPF support doesn't seem to be in any "commonly distributed" releases of gdb available, and I spent 4-ish hours trying to get it to build on my mac, but I ended up just using x86 gdb to avoid wasting time trying to compile GNU stuff on mac catalina. This "works" in the sense that the execution stops at the right pc if I print it out in the VM while debugging a test program, but it doesn't understand when the server responds with BPF register information because (obviously) x86 has different registers than BPF. But eventually we'll probably need to fork it and distribute it the same way we distribute our fork of the llvm toolchain.
CURRENT STATUS: This is where the majority of my internship was spent, and as my internship comes to an end I unfortunately still haven't come to a solution yet. This turned out to be a rather difficult issue that involved a lot "code archaeology" (in the words of Matt Godbolt), manually annotating hexdumps of malformed ELF sections that couldn't be read by
llvm-dwarfdump
and a lot of spelunking in the massive, messy codebases of the LLVM project, namely clang, lld, and llvm itself.@jackcmay has been very helpful as far as providing suggestions and links to relevant and helpful documentation - I hope I didn't consume too much of his time. To make sure the minimum amount of information gets lost when I go back to school, I've added a somewhat extensive summary of what I did, what I found, and future directions I would take had I more time below. Part of me thinks I might spend some time on this afterwards because I'm still curious and it's open source, but in any case feel free to tag me in a comment on this issue if anyone has further questions about this in the future after my internship ends.
What I did
-g
flag to the compilation commands here in rbpf's tests andld.lld
ended up having an aneurism, screaming this:invalid pointer size in compunit header
, prompting me to go learn what a linker actually does in a enough detail to understand what was happening since, at this point, my understanding of the compilation process was the typical "intro systems" (as I hadn't taken a course on compilers before) explanation of "the compiler turns your C code into many object file and the linker does some black magic to fuse them all together into a single executable or shared library". So I then had to go learn about that in enough detail for stuff to actually make some sense, which took almost a week, and even then relocations still seemed kinda magical.cargo-build-bpf
to command not strip symbols and I removed the--release
flag fromcargo
so that I could actually spent some time adding print statments to LLD to see what the existing BPF relocations were doing in the context in which the user would actually build their binaries using examples fromsolana-program-library
, and that ended up causingld.lld
to straight up fail because it didn't handle theR_BPF_NONE
relocation which apparently clang omits. Added a simple fix for that here but then I ended up getting roughly the sameld.lld
tantrum as above.readelf
anddwarfdump
to inspect ELF's. Dumps of the shared-objects that resulted from the linker were giving enormous outputs (>100k lines long) that contained the sameinvalid pointer size in compunit header
, but the unlinked relocatable objects weren't, so now I was pretty sure it was an issue in the linker (though not entirely sure, see below other possible culprits that I didn't inspect).rbpf
as the simplest possible case, so I wrote a "small but not trivial" buggy test C program and started using that for all of my future investigations. and now the dumps were of a comprehensible size, and I did some more spelunking in thelld
codebase to try to get an overall idea of what it was doing before I continued, and the biggest thing I noticed is that almost all of solana's patches were relocation-related, so I thought it was probably a relocation support issue and looked into that specifically. While I was doing this, I realized that there are separatellvm-readelf
andllvm-dwarfdump
commands, whose outputs actually interpreted the dwarf sections for me which was pretty nice.-X +dwarfris
that prevented cross-section relocations from occurring in DWARF sections, and when I tried that it madeld.lld
stop screaming, so I proceeded as if that was themore correct
way as it limited the number of issues it could be. The debug sections went from having manyR_BPF_64_32
relocations to having threeR_BPF_64_64
relocations. But alas, GDB still couldn't read it, and when Illvm-dwarfdump
'd it, the only significant issue was that some unexpected null bytes were prematurely terminating the.debug_info
section - and looking at the offsets for the newR_BPF_64_64
, I was pretty sure stuff that wasn't supposed to be null was being overwritten with null bytes. To confirm this, I ended up going deeper, pulling out the DWARF spec and trying to wrap my head around what it's doing so I could eventually look at hexdumps and see what exactly is being overwritten and where.hexdump
'd the ELF's into text files and manually annotated the.debug_info
and.debug_abbrev
sections for both the shared objects and the pre-link relocatable objects (.debug_abbrev
was the same for both), I not only found exactly what was being malformed, but I also got a much more precise understanding how DWARF and relocations work.-X +dwarfris
clang flag), I spent the last few days digging around in solana/lld, adding print statements and trying to understand exactly what transformation theR_BPF_64_64
relocations were performing , and at this point I'm pretty sure it's due to the fact that it's being used as an address relocation for addresses in.debug_info
, but it actually performs a relocation of anlddw
instruction, which is a bit different than an address.What I found
-O2
-X +dwarfris
flag removes a vast majority of the issues and makes things very simple. I'm pretty sure we should use it, but it may be the case that a fix somewhere else will remove the need for it.-X +dwarfris
flag is included,R_BPF_64_64
relocations are being applied to relocate addresses, notlddw
instructions, which would cause the issue where a relocation in the first debugging information entry is overwriting null bytes into some of the of the second debugging information entry in.debug_info
. If this is the way forward, the mistake could very well beclang
emittingR_BPF_64_64
relocations when they should have been something else, though I'm not entirely sure what it should be or even if such a relocation type is defined yet in the BPF ABI. But it could also still be an issue in the linker, where there's other cases to consider when performing / interpreting anR_BPF_64_64
relocation.Next steps
-X +dwarfris
:R_BPF_64_64
is the correct relocation type for relocating addresses in.debug_info
R_BPF_64_64
relocation type instead of something else.debug_info
should be doing, if anything.-X +dwarfris
:.debug_info
and.debug_abbrev
sections of both the shared object and the pre-link relocatable object.R_BPF_32_32
relocations should be to accomplish, if anything.debug_info
inrbpf
's ELF loader instead of the linkerinformation about hexdumps
dbi
means "dump of.debug_info
",dba
means "dump of.debug_abbrev
"._so
means it's from the shared-object, while_o
means its from the pre-link relocatable object.ris
is appended todbi
ordba
for dumps that omitted the-X +dwarfris
flag, though I haven't really spent much time on those..debug_info
section is basically a series of Debugging Information Entries (DIE's), which specify 1) a corresponding entry of.debug_abbreviation
(viaabbrev_index
), which basically says all of the values that DIE is supposed to have and 2) the values themselves. DIE's can have "children", and a null byte following a DIE indicates the end of a sequence of DIE's at a particular level, so at the top level this indicates the end of the section. You can read more about DIE's in the DWARF spec on page 21.dbi_so
anddbi_o
side-by-side, as i've manually annotated most of each wrt todba_o
, which is identical todba_so
.The text was updated successfully, but these errors were encountered: