Reading XML files containing gzipped data in C++

21 minute read

Introduction

We saw how to create an XML file containing compressed particle data in the article “XML format for particle input files”. Let us now explore how to read in that data in our C++ particle simulation code.

Recap

Recall that the compressed base64 XML file contains data of the form shown below. We would like to convert these data back into numerical values that can be used by the simulation.

<?xml version="1.0"?>
<Ellip3D_input>
  <Particles number="3" type="ellipsoid" compression="gzip" encoding="base64">
    <id unit="none" numComponents="1">eJwzVDC0MFcwNjcGAAg2Aa8=</id>
    <radii unit="m" numComponents="3">eJwz1DM0VTDQA2EzSwVD3DwAm34HcA==</radii>
    <axle_a unit="rad" numComponents="3">eJxVy8ENwEAIA8FWqABx5oxx/40lz+S30miRYBVvnKRKnqgcGRQDyZ5Zfc3DdemtNXrB/7j3tB+22RBv</axle_a>
    <axle_b unit="rad" numComponents="3">eJxNyckNACAIAMFWaEACIldB9t+Cmojxt5llVCdPg4HGKTYbOXDhjSNBoiJ7P42KhI5hmhT/1gVJPxJZ</axle_b>
    <axle_c unit="rad" numComponents="3">eJxNyckNADAIxMBWUgGCheXov7HwS34jGxIJdx4TltbkgYCqjIXVtvgXPbO5iHSreUulB97oC4BQD7g=</axle_c>
    <position unit="m" numComponents="3">eJwNyMERAEAEA8BWVGBCONJ/Y+e3syUfxpTB4BLfZFsfAb3LdiY6ZYRvROUdfbUgP13AC20=</position>
    <velocity unit="m/s" numComponents="3">eJwlitsNACAIxFZhAUmRh7j/YpKY3M+1RdnQuAmCkjnzloU6Fl05fI5NcT+38Xef30cFduoB3dgNXA==</velocity>
    <omega unit="rad/s" numComponents="3">eJwzUDDQMzAytzC0NDRXMABCXSDf3MDMzNDSAsY3sjA1szQ2VDAAAL6CCE8=</omega>
    <force unit="N" numComponents="3">eJwdycERACAIA7BVWACuYK2w/2J6fhPnpJoG8yzNanMyVxzUNzR2sK1qpNjPpNcgL0K4Cu8=</force>
    <moment unit="Nm" numComponents="3">eJwdy8kNwDAMA8FW1IAE0rr7byyxf4MFFsaayS1RcppHFObYsxOXZEeNi3q1u+XciOzY9JKfwQBWNOGwqDeBi674ADeiEak=</moment>
  </Particles>
  <Sieves number="5" compression="none" encoding="ascii">
    <percent_passing>1 0.8 0.6 0.3 0.1</percent_passing>
    <size unit="mm">1.4 1.3 1.2 1.15 1</size>
    <sieve_ratio>
      <ratio_ba>0.8</ratio_ba>
      <ratio_ca>0.6</ratio_ca>
    </sieve_ratio>
  </Sieves>
</Ellip3D_input>

The header file

We use a ParticleFileReader object to read the file. The declaration of the object is listed below. Particle data are stored in an array of pointers to Particle objects, called ParticlePArray.

#include <zenxml/xml.h>
#include <string>
#include <vector>
class ParticleFileReader
{
public:
  ....
  void read(const std::string& fileName,
            ParticlePArray& particles) const;
private:
  template<typename T>
  bool readParticleValues(zen::XmlIn& ps,
                          const std::string& name,
                          const std::string& particleType,
                          std::vector<T>& output) const;
  template<typename T>
  bool decodeAndUncompress(const std::string &inputStr,
                           const int& numComponents,
                           std::vector<T>& output) const;
  template<typename T>
  T convert(const std::string& str) const;
  ....
};

The main workhorse methods in this class are the templated functions readParticleValues, decodeAndUncompress, and convert. Templates are used because similar logic is used for different variable types.

The implementation

Let us now look at the implementations of these functions. We will ignore any checks that are necessary to make sure that the XML file is readable and contains the right data.

The read function

The point of entry is the read function:

bool
ParticleFileReader::read(const std::string &inputFileName,
                         ParticlePArray &particles) const
{
  // Read the input file
  zen::XmlDoc docs = zen::load(inputFileName);
  // Load the docsument into input proxy for easier element access
  zen::XmlIn ps(docs);
  // Loop through the particle types in the input file
  for (auto particle_ps = ps["Particles"]; particle_ps; particle_ps.next()) {
    // Get the attributes of the particles
    std::size_t numParticles = 0;
    std::string particleType = "sphere", compression = "none", encoding = "none";
    particle_ps.attribute("number", numParticles);
    particle_ps.attribute("type", particleType);
    particle_ps.attribute("compression", compression);
    particle_ps.attribute("encoding", encoding);
    // Assume that the input file is encoded and compressed
    if (encoding == "base64" && compression == "gzip") {
      // Get the particle ids
      std::vector<size_t> particleIDs;
      bool success = readParticleValues<size_t>(particle_ps,
                                                "id", particleType,
                                                particleIDs);
      // Get the particle radii
      std::vector<Vec> particleRadii;
      success = readParticleValues<Vec>(particle_ps,
                                        "radii", particleType,
                                        particleRadii);
      ............
      ............
      // Create the Particle array
      for (std::size_t ii = 0; ii < numParticles; ++ii) {
        ParticleP pt = std::make_shared<Particle>(
            particleIDs[ii], particleType, particleRadii[ii], ....);
        particles.push_back(pt);
      }
    } // end if (encoding == "base64" && compression == "gzip") 
  } // end for
  .....
}

The particle data associated with each tag is an array containing either 1 or 3 components. We use the explicitly instantiated templated function readParticleValues<T> to read in the data into arrays.

The readParticleValues templated function

Let us now look at the readParticleValues function that does the extraction and conversion of the compressed and encoded data.

template <typename T>
bool
ParticleFileReader::readParticleValues(zen::XmlIn &ps,
                                       const std::string &name,
                                       const std::string &particleType,
                                       std::vector<T> &output) const
{
  // Get the particle values
  int numComp = 1;
  std::string particleDataStr;
  auto prop_ps = ps[name];
  prop_ps.attribute("numComponents", numComp);
  prop_ps(particleDataStr);
  // Do the decoding and inflation of the compressed data
  bool success = decodeAndUncompress<T>(particleDataStr, numComp, output);
  return success;
}

The function just extracts the encoded data from the XML file and the number of components in the data (1 or 3). It then passes these on to the actual decode and uncompress code.

The decodeAndUncompress templated function

This is where the main work is done. For decoding the data into binary form, we use the cppcodec library. For decompression we use ZLib. To make sure that the cppcodec library is available in the repository where our code is stored, we add it as a submodule using

git submodule add git://github.com/tplgy/cppcodec.git cppcodec

For the Zlib library to be available to our cmake build system, we add the following to our CMakeLists.txt file:

#-------------------------------------------------------
# Add requirements for Zlib compression library 
#-------------------------------------------------------
find_package(ZLIB REQUIRED)
if (ZLIB_FOUND)
  message(STATUS "Zlib compression library found")
  include_directories(${ZLIB_INCLUDE_DIRS})
else()
  message(STATUS "Zlib compression library not found")
  set(ZLIB_DIR "")
  set(ZLIB_LIBRARIES "")
  set(ZLIB_INCLUDE_DIRS "")
endif()

The code for the decodeAndUncompress function is listed below.

#include <cppcodec/cppcodec/base64_default_rfc4648.hpp>
#include "zlib.h"
template <typename T>
bool
ParticleFileReader::decodeAndUncompress(const std::string &inputStr,
                                        const int &numComponents,
                                        std::vector<T> &output) const
{
  // Decode from base64
  std::vector<std::uint8_t> decoded = base64::decode(inputStr);
  // Uncompress from gzip
  std::vector<std::uint8_t> uncompressed;
  z_stream stream;
  // Allocate inflate state
  stream.zalloc = Z_NULL;
  stream.zfree = Z_NULL;
  stream.opaque = Z_NULL;
  stream.avail_in = 0;
  stream.next_in = Z_NULL;
  int err = inflateInit(&stream);
  if (err != Z_OK) {
    std::cerr << "inflateInit" << " error: " << err << std::endl;
    return false;
  }
  // Uncompress until stream ends
  stream.avail_in = decoded.size();
  stream.next_in = &decoded[0];
  do {
    do {
      std::vector<std::uint8_t> out(decoded.size());
      stream.avail_out = out.size();
      stream.next_out = &out[0];
      err = inflate(&stream, Z_SYNC_FLUSH);
      uncompressed.insert(std::end(uncompressed), std::begin(out), std::end(out));
    } while (stream.avail_out == 0);
  } while (err != Z_STREAM_END);
  // Clean up and exit
  if (inflateEnd(&stream) != Z_OK) {
    std::cerr << "inflateEnd" << " error: " << err << std::endl;
    return false;
  }
  // Split the uncompressed string into a vector of tokens
  // (Assume that data are space separated)
  // (See: https://stackoverflow.com/questions/236129/split-a-string-in-c)
  std::istringstream iss(std::string(uncompressed.begin(), uncompressed.end()));
  std::vector<std::string> outputStr = {std::istream_iterator<std::string>{iss},
                                        std::istream_iterator<std::string>{}};
  // Convert the strings into the right type
  for (auto iter = outputStr.begin(); iter != outputStr.end(); iter += numComponents) {
    std::string str = *iter;
    for (int ii = 1; ii < numComponents; ii++) {
      // For more than one component, join into string with space separator
      str += " ";
      str += *(iter + ii);
    }
    output.push_back(convert<T>(str));
  }
  return true;
}

Here the main complication arises during the inflation of the compressed data. We don’t know the size of the output buffer beforehand and have to read the buffer repeatedly until the entire input buffer has been inflated. After each chunk has been read into the out vector, we insert the data into uncompressed and continue the process.

After the entire stream has be uncompressed, we convert the string into the correct size type using the convert<T> function. Notice that this function is implicitly instantiated using output.push_back(convert<T>(str)). Template specialization is needed at this stage to make sure the right work is work during the conversion of each type. To see why this is not always a good idea, see the article Why Not Specialize Function Templates?. Care is needed to make sure that we don’t try to explictly instantiate convert<T> elsewhere, and modern compilers will probably throw an error if that is attempted.

The convert<T> template specializations

We will define two specializations here; the first function deals with properties such as particle ID while the second deals with vector properties such as position and force.

template <>
size_t
ParticleFileReader::convert<size_t>(const std::string &str) const
{
  return std::stoul(str);
}
template <>
Vec
ParticleFileReader::convert<Vec>(const std::string &str) const
{
  std::istringstream iss(std::string(str.begin(), str.end()));
  std::vector<std::string> split = {std::istream_iterator<std::string>{iss},
                                    std::istream_iterator<std::string>{}};
  return Vec(std::stod(split[0]), std::stod(split[1]), std::stod(split[2]));
}

That completes the implementation. To see a version of this approach in action, look at ParticleFileReader.cpp.

Remarks

We can see that the process or decoding and unzipping the data in the XML file is quite straightforward. But it takes a bit more effort than reading a formatted text file. However, if our data include millions of particles, and these particles have to be broadcast to several nodes of a multiprocessor system, compression can not only save us a lot of communication time during simulations but also disk space.

In the next article, we will explore some more aspects of our particle simulation code.

Categories: , ,

Updated: