Performance Tools for AI Engineers: strace

Source: Deep Learning on Medium

Example #1: Logging the sequence of events

strace can help illuminate the read pattern within a file for a specific file format.

strace -f -T -o strace.out python benchmark.py
  • -f also traces child processes
  • -T records the timing of each syscall
  • -o saves the output to a text file strace.out
  • python benchmark.py is our test script

>> “What system calls are being issued and in what order?”

Here’s a section of our log file corresponding to a single data read:

‘lseek’ is used to move the file pointer to specific locations within the file, then ‘read’ is used for actually reading data. When the file was opened, file descriptor “3” was returned, so subsequent commands can just refer to the open file with “3”.

The final line is a read of 4992 bytes that took 0.000010 seconds to retrieve one value in the key-value pairs that comprise our HDF5 file.

The prior 5 reads are of metadata (pointers) inside the file. Depending on the file format’s structure, the system may have to consult several internal tree nodes before getting to the data.

>> “What flags are being passed?”

Here’s the top of our log file from when the file was first opened:

In the above example, the ‘/nfs/data.hdf5’ path is opened with the ‘O_RDONLY’ flag, indicating “read only” mode.

A bevy of other flags can impact read performance. For example,

  • ‘O_DIRECT’ in an ‘open’ call means that data will be transferred directly without caching.
  • file locking–‘flock’–flags like ‘LOCK_SH’ (shared lock) and ‘LOCK_EX’ (exclusive lock).

Example #2: Summarizing where time is spent

strace -c dd if=/dev/zero of=/dev/null count=1000
  • -c summarizes the occurrence count and total duration of each system call
  • dd copies data
  • ___ if=/dev/zero is our read-from location that will “read” all zeros
  • ___ of=/dev/null is our write-to location that discards data after the write completes
  • ___ count=1000 is how many blocks will be copied

Output:

We can see that our dd test is balanced: we issued 1000 read and 1000 writes, and we spent roughly 50% of the time in reads and 50% in writes. During deep learning training we’ve seen upwards of 95% of time spent in the read syscall if data loading >> training.