Handling of Petabyte-Scale datasets in modern Physics Experiments
With the advances in accelerator technologies, which are able to accelerate an ever-growing variety of particle species to higher and higher energies, the size of information produced by physics experiments has been growing dramatically. With the latest generations of detector systems at the Relativistic Heavy Ion Collider (RHIC) at the Brookhaven National Laboratory and the Large Hadron Collider (LHC) at CERN in Geneva, annual dataset sizes are routinely measured in PetaBytes.
The PHENIX experiment at RHIC crossed the PetaByte/year threshold in 2004 and has collected about 7PB of raw data since then. The processed data, traditionally called “Data Summary Tapes” or DSTs, add another 50% to the overall data set size.
With the example of the PHENIX experiment, we will describe how these datasets are acquired. We will outline the different approaches of different groups to cope with such large datasets, and survey the different storage technologies in use, as well as different access strategies, such as the GRID and Analysis Trains. We will describe the problem of disseminating large datasets to geographically dispersed groups of scientists (and in some cases granting public access to the data), and describe different solutions.
Another problem in the digital age, not only for scientists, is data retention, when the life time and support of digital media and formats is measured in years, while the data have to remain accessible and readable for many decades. The standard industrial solution (making enough backups) is generally impractical with PetaByte-sized datasets.
Introduction to Programming with CUDA
In recent years, a new trend in high-performance computing has evolved, which makes use of commodity graphics processors (GPUs) for massively parallel computing tasks. The increase of the processing power of GPUs, driven by high-end computer gaming, can easily be put to use for other CPU-intensive tasks. Off-the-shelf systems providing multiple TeraFlops of processing power are available at commodity price levels today.
This presentation is meant to provide an introduction to the CUDA technology, which is NVIDIA’s framework for GPU programming. I will try to briefly put CUDA into perspective with the equivalent toolkit for ATI cards and the still-emerging OpenCL standard, which is designed to provide a generic framework for programming GPUs.
We will start with the prerequisites on the main platforms (Linux, Mac, Windows), and progress to simple examples to show the relative ease of use. We will demonstrate some common pitfalls and show how to avoid them, touch on some special considerations for multi-threaded programs on the CPU, and, given time and interest, progress to some advanced topics. At the end of the presentation, you should be able to write simple CUDA programs and be able to study advanced topics on your own.
Dr. Martin Purschke is a staff physicist at the Brookhaven National Laboratory. He is the Data Acquisition Coordinator of the PHENIX Experiment at the Relativistic Heavy Ion Collider, and has written substantial portions of the online and offline software in use in the PHENIX experiment. He is also a member in the RatCAP Pet-Imaging project at BNL. Most of his CUDA programming takes place in the framework of the medical image reconstruction for the PET detectors.