Early Detection of Configuration Errors to Reduce Failure Damage
Paper: Early Detection of Configuration Errors to Reduce Failure Damage
- Defines
Latent Configuration (LC) Errors
which are caused due to insufficient validations on the configuration, later until the configuration is actually used- There might be a large time between loading this configuration, generally in the initialization phase, to actually using it (thus latent).
- When such configurations are related to reachability, availability or serviceability, LC errors can lead to downtime.
- Two main issues with such configs
- The values are not checked at all. eg: check if file exists
- The values are not checked according to the usage. eg: value is used in open(config_value, WRITE)
- Paper implements a checker based on the static analysis and instrumentation
- Static analysis:
- Taint analysis to go from the configuration to the actual usage along the data flow path. Control flow is ignored in most cases to avoid over tainting
- Along with these instruction, the dependent values are also extracted. Eg: open(config_value, permission) <- here permission is dependent value
- Any value that cannot be determined are skipped. Eg: a dependent value read from network
- Instrumentation:
- Code is generated to perform same check as that in the actual usage, but in a “sandboxed” manner
- Here any side effect on the program is avoided. Eg: a local copy of global value is used instead of the actual global value.
- Utilities are written to check the actions performed by some library and system calls.
- This generated code is run right after the initialization phase of the program
- Developer need to annotate two things
- The interface of how configuration values are fetched
- The place where program moves from initialization state to execution
- Developer need to annotate two things
- TOCTOU issues are avoided by adding support to run these checkers regularly in a thread
Efficient Scalable Thread-Safety-Violation Detection
- Existing solutions
- Static or dynamic analysis to identify the potential buggy locations to inject delays.
- Injects small number of delays but large analysis time
- Inject probabilistic random delays
- Inject large number of delays but small analysis time
- TSVD tries to find the middle ground
- Static or dynamic analysis to identify the potential buggy locations to inject delays.
- TSVD employs two techniques to select the points to inject delays
- Near miss tracking
- Happens before relationship identification
- Near miss tracking
- Identify two operations on a thread-unsafe object, one of which is write and happens close to each other on different threads
- If the time difference falls within the threshold, mark it as dangerous pair
- Happens Before relationship identification
- If adding a delay at location 1 delays the execution of the location 2.
- Delay is injected on all such pairs
- Delay is decayed if a pair does not trigger error
- Once the probability of delay drops to 0, the pair is removed from dangerous pair list
- Built to support .NET projects
- Instrumentation and Runtime library
- Evaluation
- Why some bugs were missed?
- Two operations are close to each other only on some rare executions
- False positive happens before prediction
- Delay injection was not sufficient to capture the bugs
- Why some bugs were missed?
kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels
Paper: kAFL
- Feedback fuzzing of closed source kernel mode components
- Feedback using hardware capabilities
- Intel PT
- Q: What does it give?
- Intel PT
- Challenges with kernel fuzzing
- Lots of states
- Interrupts and threads
- No straightforward way to “invoke” the kernel
- Technical details
- x86-64: Kernel and userspace is split into halves
- Total virtual address space: 2^48
- Why?
- Each get 2^47
- Switching from user to kernel on syscalls do not switch page table
- Total virtual address space: 2^48
- x86-64: Kernel and userspace is split into halves
- Intel Processor Trace
- Three types
- Taken-Not-Taken: For conditional jumps, tell if a branch is taken or not
- Target-IP: Indirect jumps, target IP
- Flow Update Packets: Interrupts and async events.
- Filters can be added to these
- IP range
- Privilege level / ring
- CR3 filter: Only when the cr3 value matches. Helps in filtering per process
- Three types
- System Design
- Components
- Host user space process: kAFL
- QEMU-PT + KVM-PT for getting the processor trace from guest
- Usermode agent in the target OS
- Setup:
- Agent performs a hypercall to provide kernel panic handler
- Host patches this to get the feedback on crash
- Instead of waiting for hte timeout
- Then CR3 is exchanged from agent to host
- This is used to set the filter
- Then a shared memory address is exchanged where the agent expects the input for fuzzing
- Fuzzing loop starts
- While fuzzing is being performed, the QEMU-PT decodes the trace
- When the agent is done, it sends a hypercall (hc_finished).
- On this VM-Exit, it stops tracing
- Fuzzing logic
- This is the core and does similar to AFL
- Also runs fuzzing in parallel
- Most fuzzing is not CPU bound, so this helps
- User mode agent
- Broken into loader and agent
- Agent lets you run arbitrary program, thus making it easier
- Also loader checks if the program crashed and so it can restart
- KVM-PT
- This helps in tracing virtual cpu instead of logical
- By enabling on vm-entry and disabling on vm-exit
- QEMU-PT
- QEMU-PT also filters the stream of executed addresses—based on previous knowledge of non-deterministic basic blocks—to prevent false-positive fuzzing results, and makes those available to the fuzzing logic as AFL-compatible bitmaps
- ???
- Also cache the disassembly results to speed up populating the bitmap
- Stateful and non deterministic
- Interrupts generate non-deterministic exections
- So the fuzzer runs the program multiple times and identifies such basic blocks
- Adds it to blacklist
- This is ignored when updating the coverage map
- Hypercalls
- Accessible from ring3
- So add custom hypercalls that can help in fuzzing
- Eg: crash, ask for input
- Components
- KVM-PT
- vCPU specific traces
- MSR autoload feature lets you load MSRs on exit or entry
- Continuous tracing
- Uses ToPA
- Table of physical address
- Each address is associated with behavior on overflow
- First -> interrupt
- Second -> Stop tracing
- But keep it large enough for this to never happen
- On overflow it triggers and results in vm switch
- Buffer is cleared and switched back to the VM
- Uses ToPA
- vCPU specific traces
- QEMU-PT
- Userspace application to interact with KVM-PT
- When to start stop
- Also does the decoding the trace to generate a AFL map
- Our Intel PT software decoder acts like a just-in-time decoder, which means that code sections are only considered if they are executed according to the decoded trace data
- ???
- Discussion
- OS specific code
- Not a necessity but improves fuzzing (cr3 value, custom process to test kernel)
- Kernel JIT
- Out of scope
- But very interesting
- Intel PT does not give all the instruction pointers and need the executable to decode
- Becomes tricky
- OS specific code
Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures
Paper: Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures
- Software failure recovery to make the softwares more available
- Makes use of Checkpointing and Rollback to revert to an older state
- Then makes some environmental changes and continues the execution of the application.
- If none of the changes work, it goes back one more checkpoint and retries
- Components
- Proxy: Separates client and server interactions and helps in the saving and replay of requests upon re-execution.
- Sensors: Identifies when there is an error in the application using exceptions, interrupts etc.
- Checkpointing and Rollback
- Based on: Flashback
- Deletes oldest checkpoint based on stratergies.
- Environmental wrappers: For modifying environment during re-execution
- Memory allocation wrappers: eg: zero fill, add padding
- Scheduling wrapper to change the unit of time for scheduling
- User request dropping
- Control unit: Coordinates with all the components
- Also provides useful information for the programmer to diagnose and fix errors.
- Tested on Squid, Apache, CVS, MySQL
My development setup
Windows
- Lenovo Legion Y540
- i7 processor with 8 cores
- 24 GB RAM
- Windows 11 Pro
- WSL with Ubuntu 20
- Ubuntu 22 VM running on Hyper-V
- Mostly accessed using X-Forwarding to WSL
MacBook
- M3 Air with 16 GB + 512 GB
- This is a new machine
Tools
- tmux
- VS Code
- With VIM extension
- zsh with OH MY ZSH
- Terminal
- Windows Terminal on Windows
- Default terminal app on Mac
- tmux
Other applications
- Notability for reading papers and taking notes on iPad
- Obsidian for note taking on laptop
- Vault on iCloud to sync across devices
- Slack and Discord for messaging
- Microsoft Edge as a browser
- Google Scholar PDF reader extension for reading papers. Provides good navigation support for links.
- 1Password for password management