Source of data

The data currently shown comes from supercomputers at TACC (the Texas Advanced Computing Center). The data was collected using a project called Lariat, between April 2012 through January 2013. It is displayed using the following assumptions:

  • We assume executables with the same name are the same code. We ignore the path to the executable. We have checksum and version information, but are currently ignoring it for simplicity.
  • We ignore any executable used by fewer than five distinct users.
  • We ignore executables named "a.out" or named after a short list of standard POSIX commands
  • We associate executable names with well-known scientific software packages by simple string matching: for example any executable containing "vasp" as a substring is assumed to be part of VASP. For some packages we have a list of strings we match against, taken from lists of executables mentioned on the packages' web sites. For example executables containing the strings "gipaw" or "tddfpt" are known to be part of Quantum Espresso. Please let us know about ones we missed.
  • Two executables are considered to be "used together" if they are run in the same job on the same day.
  • The Lariat data also has information about which libraries (things like FFTW and PGI) are used by which executables. We are currently omitting this information because almost all other applications use them.
  • Publications are associated with jobs by scraping data from the XSEDE portal (https://portal.xsede.org/publications#/show) where researchers associate publications with their projects, using the same code number the lariat files classify jobs by (allocation identifiers that take the form TG-XXXNNNNNNN).

Some of these choices are arbitrary, and if you are curious about how the graphs and statistics would look under different assumptions, let me know (Chris Bogart: cbogart at cs dot cmu dot edu) and I'd be glad to find useful ways of interpreting and displaying this data.

Other data sources

If you have a scientific software package or suite that you would like to track using this tool, let us know. Beyond tracking codes on a single supercomputer, we are interested in helping you to track your software's usage, even if it is widely deployed at individual researchers' sites.