Early Detection of Configuration Errors to Reduce Failure Damage


Paper: Early Detection of Configuration Errors to Reduce Failure Damage

  • Defines Latent Configuration (LC) Errors which are caused due to insufficient validations on the configuration, later until the configuration is actually used
    • There might be a large time between loading this configuration, generally in the initialization phase, to actually using it (thus latent).
  • When such configurations are related to reachability, availability or serviceability, LC errors can lead to downtime.
  • Two main issues with such configs
    • The values are not checked at all. eg: check if file exists
    • The values are not checked according to the usage. eg: value is used in open(config_value, WRITE)
  • Paper implements a checker based on the static analysis and instrumentation
  • Static analysis:
    • Taint analysis to go from the configuration to the actual usage along the data flow path. Control flow is ignored in most cases to avoid over tainting
    • Along with these instruction, the dependent values are also extracted. Eg: open(config_value, permission) <- here permission is dependent value
    • Any value that cannot be determined are skipped. Eg: a dependent value read from network
  • Instrumentation:
    • Code is generated to perform same check as that in the actual usage, but in a “sandboxed” manner
    • Here any side effect on the program is avoided. Eg: a local copy of global value is used instead of the actual global value.
    • Utilities are written to check the actions performed by some library and system calls.
  • This generated code is run right after the initialization phase of the program
    • Developer need to annotate two things
      • The interface of how configuration values are fetched
      • The place where program moves from initialization state to execution
  • TOCTOU issues are avoided by adding support to run these checkers regularly in a thread
Read more ⟶

Efficient Scalable Thread-Safety-Violation Detection


Paper: Efficient Scalable Thread-Safety-Violation Detection

  • Existing solutions
    • Static or dynamic analysis to identify the potential buggy locations to inject delays.
      • Injects small number of delays but large analysis time
    • Inject probabilistic random delays
      • Inject large number of delays but small analysis time
    • TSVD tries to find the middle ground
  • TSVD employs two techniques to select the points to inject delays
    • Near miss tracking
    • Happens before relationship identification
  • Near miss tracking
    • Identify two operations on a thread-unsafe object, one of which is write and happens close to each other on different threads
    • If the time difference falls within the threshold, mark it as dangerous pair
  • Happens Before relationship identification
    • If adding a delay at location 1 delays the execution of the location 2.
  • Delay is injected on all such pairs
    • Delay is decayed if a pair does not trigger error
    • Once the probability of delay drops to 0, the pair is removed from dangerous pair list
  • Built to support .NET projects
    • Instrumentation and Runtime library
  • Evaluation
    • Why some bugs were missed?
      • Two operations are close to each other only on some rare executions
      • False positive happens before prediction
      • Delay injection was not sufficient to capture the bugs
Read more ⟶

kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels


Paper: kAFL

  • Feedback fuzzing of closed source kernel mode components
  • Feedback using hardware capabilities
    • Intel PT
      • Q: What does it give?
  • Challenges with kernel fuzzing
    • Lots of states
    • Interrupts and threads
    • No straightforward way to “invoke” the kernel
  • Technical details
    • x86-64: Kernel and userspace is split into halves
      • Total virtual address space: 2^48
        • Why?
      • Each get 2^47
      • Switching from user to kernel on syscalls do not switch page table
  • Intel Processor Trace
    • Three types
      • Taken-Not-Taken: For conditional jumps, tell if a branch is taken or not
      • Target-IP: Indirect jumps, target IP
      • Flow Update Packets: Interrupts and async events.
    • Filters can be added to these
      • IP range
      • Privilege level / ring
      • CR3 filter: Only when the cr3 value matches. Helps in filtering per process
  • System Design
    • Components
      • Host user space process: kAFL
      • QEMU-PT + KVM-PT for getting the processor trace from guest
      • Usermode agent in the target OS
    • Setup:
      • Agent performs a hypercall to provide kernel panic handler
      • Host patches this to get the feedback on crash
        • Instead of waiting for hte timeout
        • Then CR3 is exchanged from agent to host
          • This is used to set the filter
        • Then a shared memory address is exchanged where the agent expects the input for fuzzing
        • Fuzzing loop starts
        • While fuzzing is being performed, the QEMU-PT decodes the trace
        • When the agent is done, it sends a hypercall (hc_finished).
          • On this VM-Exit, it stops tracing
    • Fuzzing logic
      • This is the core and does similar to AFL
      • Also runs fuzzing in parallel
        • Most fuzzing is not CPU bound, so this helps
    • User mode agent
      • Broken into loader and agent
      • Agent lets you run arbitrary program, thus making it easier
      • Also loader checks if the program crashed and so it can restart
    • KVM-PT
      • This helps in tracing virtual cpu instead of logical
      • By enabling on vm-entry and disabling on vm-exit
    • QEMU-PT
      • QEMU-PT also filters the stream of executed addresses—based on previous knowledge of non-deterministic basic blocks—to prevent false-positive fuzzing results, and makes those available to the fuzzing logic as AFL-compatible bitmaps
      • ???
    • Also cache the disassembly results to speed up populating the bitmap
    • Stateful and non deterministic
      • Interrupts generate non-deterministic exections
      • So the fuzzer runs the program multiple times and identifies such basic blocks
      • Adds it to blacklist
      • This is ignored when updating the coverage map
    • Hypercalls
      • Accessible from ring3
      • So add custom hypercalls that can help in fuzzing
        • Eg: crash, ask for input
  • KVM-PT
    • vCPU specific traces
      • MSR autoload feature lets you load MSRs on exit or entry
    • Continuous tracing
      • Uses ToPA
        • Table of physical address
        • Each address is associated with behavior on overflow
          • First -> interrupt
          • Second -> Stop tracing
            • But keep it large enough for this to never happen
      • On overflow it triggers and results in vm switch
      • Buffer is cleared and switched back to the VM
  • QEMU-PT
    • Userspace application to interact with KVM-PT
    • When to start stop
    • Also does the decoding the trace to generate a AFL map
    • Our Intel PT software decoder acts like a just-in-time decoder, which means that code sections are only considered if they are executed according to the decoded trace data
      • ???
  • Discussion
    • OS specific code
      • Not a necessity but improves fuzzing (cr3 value, custom process to test kernel)
    • Kernel JIT
      • Out of scope
      • But very interesting
      • Intel PT does not give all the instruction pointers and need the executable to decode
        • Becomes tricky
Read more ⟶

Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures


Paper: Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures

  • Software failure recovery to make the softwares more available
  • Makes use of Checkpointing and Rollback to revert to an older state
  • Then makes some environmental changes and continues the execution of the application.
    • If none of the changes work, it goes back one more checkpoint and retries
  • Components
    • Proxy: Separates client and server interactions and helps in the saving and replay of requests upon re-execution.
    • Sensors: Identifies when there is an error in the application using exceptions, interrupts etc.
    • Checkpointing and Rollback
      • Based on: Flashback
      • Deletes oldest checkpoint based on stratergies.
    • Environmental wrappers: For modifying environment during re-execution
      • Memory allocation wrappers: eg: zero fill, add padding
      • Scheduling wrapper to change the unit of time for scheduling
      • User request dropping
    • Control unit: Coordinates with all the components
      • Also provides useful information for the programmer to diagnose and fix errors.
  • Tested on Squid, Apache, CVS, MySQL
Read more ⟶

My development setup


Windows

  • Lenovo Legion Y540
    • i7 processor with 8 cores
    • 24 GB RAM
  • Windows 11 Pro
  • WSL with Ubuntu 20
  • Ubuntu 22 VM running on Hyper-V
    • Mostly accessed using X-Forwarding to WSL

MacBook

  • M3 Air with 16 GB + 512 GB
  • This is a new machine

Tools

  • tmux
  • VS Code
    • With VIM extension
  • zsh with OH MY ZSH
  • Terminal
    • Windows Terminal on Windows
    • Default terminal app on Mac
  • tmux

Other applications

  • Notability for reading papers and taking notes on iPad
  • Obsidian for note taking on laptop
    • Vault on iCloud to sync across devices
  • Slack and Discord for messaging
  • Microsoft Edge as a browser
    • Google Scholar PDF reader extension for reading papers. Provides good navigation support for links.
  • 1Password for password management
Read more ⟶