Skip to end of metadata
Go to start of metadata

Some Questions of Interest

These are questions discussed at the Sep 27 2012 SSG faculty meeting in response to the recent Lustre filesystem crash on the Cheaha cluster.

What is the nature of the Lustre storage issue and what is being done about it?
  • Excerpt from John-Paul's Sep 25 2012 e-mail:

    The investigation over the weekend and yesterday into the Lustre failure has identified the source of the problem as an errant system install procedure on the backup metadata server. This process wrote incorrect data to the metadata database and caused a loss of some portion of the metadata database. When Lustre noticed this corruption it went off-line. We are now investigating the extent of the damage and determining effective recovery procedures.

    An analogy may help clarify what this means and provide insights into the recovery process.

    Lustre is a high performance, distributed file system. File systems, or rather, "filing systems" organize information in a variety of ways in order to address the needs of the situations to which they are applied. In other words, to address the requirements of their applications. In order to provide efficient access to many files, Lustre stores files in two parts: the information contained in a file (the data) is stored in a dedicated set of databases known as object stores; information about a file is stored in a separate database known as the metadata service, this descriptive information includes the name of the file, access controls, and other organizational information like which object stores contain the data.

    This filing system is much like a library that stores many books on many shelves, however, in order to increase search speed and organizational flexibility the covers to the books have been removed and put in a sorted, easy-to-browse bin. The book covers represent the metadata in the Lustre filing system, and the content of the books on the shelves represent our data in the Lustre filing system. Along with each book cover, the library stores a special number that describes on which shelve(s) to find the content of a book. This all works very efficiently so long as all the information (covers and content) remains intact.

    The predicament we are in is that the easy-to-browse bin (the metadata) has been overturned, thrown into a pile on the floor, and some percentage of the book covers have likely been destroyed. The shelves which actually contain the content of the books (our data), however, appear to be undisturbed and remain in tact. We are now faced with the chore of having to identify what percentage of book covers were lost and re-constucting the easy-to-browse bin (the metadata) for the book covers that remain.

    Today we are assessing the damage to the metadata and determining how we can re-represent the metadata to you (the book covers) so you can easily identify your data (the book content). We are also investigating how we might represent the data for which the metadata has been lost in a way that you can inspect the content and, hopefully, re-label the data.

  • Excerpt from John-Paul's Sep 26 2012 e-mail:

    We are moving forward with a recovery plan developed with the Lustre
    support professionals. This process is estimated to take between one
    and three weeks, depending on how quickly the metadata can be reassembled.

    Preliminary analysis indicates only a small percentage of the metadata
    may have been lost, however, a firm understanding won't come until a
    later stage of the recovery process.

Is there a general impression that scratch data on the Cheaha cluster is backed up automatically? If yes, what can we do moving forward to clarify this misconception and minimize its potentially serious consequences?
  • Lack of documentation?
    • Existing docs, is this level of detail sufficient for the average end user?
      • http://www.ssg.uab.edu/wiki/x/mwEv, HPC space on SSG wiki
        • Especially see highlighted note regarding scratch environment variables under the UAB Beowulf clusters heading and links to the ME wiki where $UABGRID_SCRATCH and $UABGRID_PROJECT are described in more detail.
      • http://www.ssg.uab.edu/wiki/x/AoTn, Data Management FAQ page under HPC space
  • Insufficient education and training opportunities?
    • Yearly UAB HPC bootcamp, http://www.soph.uab.edu/ssg/courses/hpcbootcamp
      • Accompanied by SSG-specific hands-on lab
    • On average, one journal or grant writing club a year on reproducible research topics (including version control).
    • Ad hoc one-on-one tutorials and training provided by SSG programmers
    • Do we need more mandatory eduction/training? If yes, should it be for version control, big data management in general, both? Other?
  • End user misconception that "high-performance" means "high-availability"?
    • Think race car.
  • Non-optimal choice of name for environment variable $UABGRID_PROJECT?
    • Would it be helpful to rename the $UABGRID_PROJECT variable to $UABGRID_SHARED_SCRATCH?
What kinds of bad things can happen to data?
  • Data corruption
    • Example 1 - during a big download, truncation is easy to detect, but what if the file size is correct?
    • Example 2 - inadvertent save
    • Example 3 - bad sectors on a hard drive
    • Example 4 - operating system crash and incomplete filesystem recovery
    • Example 5 - software bug (say, in a database)
    • Example 6 - bitrot
  • User error
    • Example - accidentally or unintentionally deleted a file or directory or more
  • Catastrophic loss
    • Example 1 - A meteor hits your data center
    • Example 2 - Your otherwise reliable filesystem corrupts the metadata table.
What options are currently immediately available to SSG researchers for backing up data?
  • SSG version control server at svn.ssg.uab.edu
    • Version control is primarily to help us manage change to research data (including source code, intermediate working material, results), achieve reproducible research, and easily synchronize our efforts with fellow collaborators (who may be located in geographically disparate locations).
    • However, because data committed to version control are more or less by definition necessary for reproducible research, a nice side effect is that all repositories are part of an automatic backup plan detailed at http://www.ssg.uab.edu/wiki/x/QgCN.
    • Training opportunities and materials, see http://www.ssg.uab.edu/wiki/x/gAEv for a summary. Most end users (who have given me feedback) are able to get started with 1 - 2 hours of invested time in following the "quick start".
    • Is everyone who wishes to use version control able to do so, and sufficiently enabled to do so?
  • specific case-by-case arrangements with UAB or SOPH IT to do backups of research data
    • can be one-time (for data not expected to change), recurring, or as customized as need be
    • typically to tape or other low-cost external media at going rates
  • specific servers with storage earmarked for mirroring public data
    • $UABGRID_PROJECT
    • ssg-srv3 genomics "databasing" server with 24 TB of storage
Are there situations where the available backup options are non-optimal or inadequate? If yes, what are those situations, which options should we add, and what are their pros, cons, and costs?
  • Some options
  • Be sure to avoid data security theater here! You still need to test your backup solution thoroughly and regularly.
What SSG data are currently in limbo? If some (or all) data are recovered, what things can we do to check that the data still have integrity? In the worst case scenario, where data cannot be recovered from the Lustre filesystem, what should we be doing to reconstruct the lost data and redo lost analyses?
  • non-version-controlled data (or data that has not otherwise been backed up)
    • Can these data be re-created? If yes, what resources will be needed? Reservations?
  • uncommitted VC data
  • "public" cached data (think WTCCC, TCGA, 1000G, etc)
  • prototype UAB Galaxy instance and associated files
Labels
  • None