pygcam.query

This module contains classes and a sub-command that allow you to run batch queries against GCAM’s XML database, generate label rewrites on the fly, and perform various common operations on the resulting .csv files.

Running queries by name

The query module looks for queries by name in any of the directories or files listed in the Query Path, which is specified as an argument to the function pygcam.query.runBatchQuery(), on the command-line to the query sub-command, or by the value of the config variable GCAM.QueryPath.

configuration parameter GCAM.QueryPath. Elements are separated by ‘;’ on Windows and by ‘:’ otherwise. The elements can be:

  1. An XML file structured like the standard GCAM queries file, Main_Queries.xml,
  2. An XML file structured like a batch query file, or
  3. A directory holding XML files defining queries.

To locate the query code, each element of GCAM.QueryPath is examined, in order, for the named query and variations thereof (more below.)

If a path element is a directory, we look for a file with the exact query name and an .xml extension. If the file exists, it is not altered in any way; it is simply referenced from the generated batch query file. If a path element is not a directory, it must be the name of an XML file in the format of Main_Queries.xml or a query file for use as a batch query. To find the query, the title attribute is compared directly and with the variants below, in this order:

  1. Original name, but with all ‘-’ changed to spaces.
  2. Original name, but with all ‘_’ changed to spaces.
  3. Original name, but with all ‘-’ and ‘_’ changed to spaces.

Running sets of queries

The query sub-command allows a set of queries to be identified in a text file, with one query name per line, or in an XML file that offers additional features. See queries.xml for information on the XML file format.

Controlling how queries are executed

Queries can be executed one-at-a-time or all together in a single XML batch file. This behavior is controlled by configuration parameter GCAM.BatchMultipleQueries, which, if set to True, runs all queries in one file.

A new version of GCAM (available only internally at JGCRI currently) allows the XML database to be stored in memory rather than being written to disk, saving both time and disk space. GCAM will run designated queries against this in-memory database to generate CSV files prior to exiting. In addition, GCAM can run queries automatically even if the database is written to disk. Two additional config parameters control this behavior in pygcam: GCAM.InMemoryDatabase and GCAM.RunQueriesInGCAM, whose meanings should be obvious. Setting GCAM.InMemoryDatabase implies both GCAM.RunQueriesInGCAM and GCAM.BatchMultipleQueries are True, since this is the only way to get data out of the model.

Generating label rewrites to aggregate and filter results

The user can define a set of queries in an XML file. The query elements are processed in order, adding any designated label rewrites to the query.

The query sub-command runs the batch query in ModelInterfaces and saves the results in the designated .CSV file. The GCAM tool (gt) page for command-line options.

Label rewrites are currently defined in a separate rewriteSets.xml, which can be named in the pygcam configuration file by the variable GCAM.RewriteSetsFile.

API

pygcam.query.createBatchFile(scenario, queries, xmldb='', queryPath=None, outputDir=None, regions=None, regionMap=None, rewriteParser=None, batchFileIn=None, batchFileOut=None, tmpFiles=True, noDelete=False)

Create an optionally-temporary XML file that will run multiple queries, by extracting queries into separate temp files and referencing them from the batch query file.

Parameters:
  • scenario – (str) the name of the scenario to perform the query on
  • queries – (list of str query names and/or Query instances)
  • xmldb – (str) path to XMLDB, or ‘’ to use in-memory DB
  • queryPath – (str) a list of directories or XML filenames, separated by a colon (on Unix) or a semi-colon (on Windows)
  • outputDir – (str) the directory in which to write the .CSV with query results, default is value of GCAM.OutputDir.
  • regions – (iterable of str) the regions you want to include in the query. If not specified here, the value appearing in the <Query states=”xxx”> statement is used to return the indicated region names.
  • regionMap – (dict-like) keys are the names of regions that should be rewritten. The value is the name of the aggregate region to map into.
  • rewriteParser – (RewriteSetParser instance) parsed representation of rewriteSets.xml
  • batchFileIn – (str) the name of a pre-formed batch file to run
  • batchFileOut – (str) where to write output from batchFileIn, if given
  • singleCSV – (bool) if True, write a batch file that puts all results in a single CSV output file.
  • tmpFiles – (bool) if True temporary files are used and deleted when the program exits, otherwise normal files are create in outputDir.
  • noDelete – (bool) if True, temporary files created by this function are not deleted (use for debugging)
Returns:

(str) the pathname of the temporary batch query file

pygcam.query.csv2xlsx(inFiles, outFile, skiprows=0, interpolate=False, years=None, startYear=0)

Convert a set of CSV files representing GCAM query results into an XLSX file with an index page linked by the file names to the sheets with the results.

Parameters:
  • inFiles – (list of str) the names of CSV files to read.
  • outFile – (str) the name of the XLSX file to create
  • skiprows – (int) the number of rows to skip in the CSV files before processing.
  • interpolate – (bool) if True, interpolate annual values between time-steps.
  • years – (str) the years to extract from the CSV files; must be of the form XXXX-YYYY, e.g. 2010-2050.
  • startYear – (int) If interpolating, the year to begin interpolation
Returns:

none

pygcam.query.dropExtraCols(df, inplace=True)

Drop some columns that GCAM queries sometimes return, but which we generally don’t need. The columns to drop are taken from from the configuration file variable GCAM.ColumnsToDrop, which should be a comma-delimited string. The default value is scenario,Notes,Date.

Parameters:
  • df – a DataFrame hold the results of a GCAM query.
  • inplace – if True, modify df in-place; otherwise return a modified copy.
Returns:

the original df (if inplace=True) or the modified copy.

pygcam.query.extractQueries(titles, xmlFiles, delete=True)

Find the named queries in the given XML files and extract them to a tmp batch file. Return the path to the temporary batch file. Note that both the titles and xmlFiles arguments can be either a list (or tuple) or a delimited string. On Windows, the delimiter is “:”, on macOS and Linux, it’s “;”.

Parameters:
  • titles – (list of str, or “path” string with one or more filenames) names of queries
  • xmlFiles – (list of str, or “path” string with one or more filenames) pathnames of XML files containing queries
  • delete – (bool) whether to delete the temporary file (set to False to debug)
Returns:

(str) the pathname to the temporary batch file

pygcam.query.interpolateYears(df, startYear=0, inplace=False)

Interpolate linearly between each pair of years in the GCAM output. The time-step is calculated from the numerical (string) column headings given in the DataFrame df, which are assumed to represent years in the time-series. The years to interpolate between are read from df, so there’s no dependency on any particular time-step, or even on the time-step being constant.

Parameters:
  • df – (DataFrame) Data of the format returned by batch queries on the GCAM XML database
  • startYear – (int) If non-zero, begin interpolation at this year.
  • inplace – (bool) If True, modify df in place; otherwise modify a copy.
Returns:

if inplace is True, df is returned; otherwise a copy of df with interpolated values is returned.

pygcam.query.limitYears(df, years)

Modify df to drop all years outside the range given by years.

Parameters:years – a sequence of two years (str or int); only values in this range (inclusive) are kept. Data for other years is dropped.
Returns:(DataFrame) df, modified in place.
pygcam.query.readCsv(filename, skiprows=1, years=None, interpolate=False, startYear=0, cache=False)

Read a CSV file of the form generated by GCAM batch queries, i.e., skip one row and then read column headings and data. Optionally drop all years outside the years given. Optionally, linearly interpolate annual values between time-steps.

Parameters:
  • filename – (str) the path to a CSV file
  • skiprows – (int) the number of rows to skip before reading the data matrix
  • years – (iterable of two values coercible to int) the year columns to keep; others are dropped
  • interpolate – (bool) If True, interpolate annual values between time-steps
  • startYear – (int) If interpolating, the year to begin interpolation
  • cache – (bool) If True, file will be sought in, and saved to, a CSV cache. The “raw” file data is cached, so if called with different processing args, the same initial DataFrame is used, but it will be processed correctly.
Returns:

(DataFrame) the data read in, processed as per arguments

pygcam.query.readQueryResult(batchDir, baseline, queryName, years=None, interpolate=False, startYear=0, cache=False)

Compose the name of the ‘standard’ result file, read it into a DataFrame and return the DataFrame. Data is read from the computed filename “{batchDir}/{queryName}-{baseline}.csv”.

Parameters:
  • batchDir – (str) a directory in which the data file resides
  • baseline – (str) the name of a baseline scenario
  • queryName – (str) the name of a batch query.
  • years – (iterable of two values coercible to int) the year columns to keep; others are dropped
  • interpolate – (bool) If True, interpolate annual values between time-steps
  • startYear – (int) If interpolating, the year to begin interpolation
  • cache – (bool) If True, files will be sought in and saved to a CSV cache
Returns:

(DataFrame) the data in the computed filename.

pygcam.query.readRegionMap(filename)

Read a region map file containing one or more tab-delimited lines of the form key <tab> value, where key should be a standard GCAM region and value the name of the region to map the original to, which can be an existing GCAM region or a new name defined by the user.

Parameters:filename – the name of a file containing region mappings
Returns:a dictionary holding the mappings read from filename
pygcam.query.runBatchQuery(scenario, queryName, queryPath, outputDir, xmldb='', csvFile=None, miLogFile=None, regions=None, regionMap=None, rewriters=None, rewriteParser=None, noRun=False, noDelete=False, saveAs=None)

Run a single query against GCAM’s XML database given by xmldb (or computed from other parameters), saving the results into a CSV file.

Parameters:
  • scenario – (str) the name of the scenario to perform the query on
  • queryName – (str) the name of a query to execute
  • queryPath – (str) a list of directories or XML filenames, separated by a colon (on Unix) or a semi-colon (on Windows)
  • outputDir – (str) the directory in which to write the .CSV with query results
  • xmldb – (str) the path to the XML database, or ‘’ to use in-memory DB
  • csvFile – if None, query results are written to a computed filename.
  • miLogFile – (str) optional name of a log file to write ModelInterface output to.
  • regions – (iterable of str) the regions you want to include in the query
  • regionMap – (dict-like) keys are the names of regions that should be rewritten. The value is the name of the aggregate region to map into.
  • rewriters – (list of tuples of (mapName, level)) list of mapping rewrites to apply to the query results, based on rewriteSets.xml. If level is specified, it overrides the level given in the mapping as defined in rewriteSets.xml.
  • rewriteParser – (RewriteSetParser instance) parsed representation of rewriteSets.xml
  • noRun – (bool) if True, the command is printed but not executed
  • noDelete – (bool) if True, temporary files created by this function are not deleted (use for debugging)
  • (str) (saveAs) – alternative name to use to save the query results as
Returns:

(str) the absolute path to the generated .CSV file, or None

pygcam.query.runModelInterface(scenario, outputDir, csvFile=None, batchFile=None, queryFile=None, queryText=None, xmldb='', miLogFile=None, noDelete=False, noRun=False, asDataFrame=False)

Run a query file on the XML database given by xmldb, saving results in a CSV file.

Parameters:
  • scenario – (str) the name of the scenario to perform the query on
  • outputDir – (str) the directory in which to write the CSV with query results
  • csvFile – (str) the file to create; ignored if batchFile is not None.
  • batchFile – (str) the path to an existing batchFile; if not given, a temporary one will be created based on the other arguments. In no batchFile is specified, a csvFile must be provided.
  • queryFile – (str) the path to the XML file holding the queries, or None if the batchFile is not None, in which case the queryFile should already be referenced in the batchFile.
  • queryText – (str) the text of an XML query. Used only if queryFile and batchFile are both None, in which case the queryText is written to a tmp file that is used as the queryFile.
  • xmldb – (str) the path to the XML database or ‘’ to use in-memory DB
  • miLogFile – (str) optional name of a log file to write ModelInterface output to. The value is the name of the aggregate region to map into.
  • noRun – (bool) if True, the command is printed but not executed
  • noDelete – (bool) if True, temporary files created by this function are not deleted (use for debugging)
  • asDataFrame – (bool) see return, below
Returns:

(str or DataFrame) if asDataFrame is True, return the result as a pandas DataFrame, if the query was successful, else None. If asDataFrame is False, return the absolute path to the generated CSV file if one was specified, else None.

pygcam.query.runMultiQueryBatch(scenario, queries, xmldb='', queryPath=None, outputDir=None, miLogFile=None, regions=None, regionMap=None, rewriteParser=None, batchFileIn=None, batchFileOut=None, noRun=False, noDelete=False)

Create a single GCAM XML batch file that runs multiple queries, placing the each query’s results in a file named of the form {queryName}-{scenario}.csv.

Parameters:
  • scenario – (str) the name of the scenario to perform the query on
  • queries – (list of str query names and/or Query instances)
  • xmldb – (str) path to XMLDB, or ‘’ to use in-memory DB
  • queryPath – (str) a list of directories or XML filenames, separated by a colon (on Unix) or a semi-colon (on Windows)
  • outputDir – (str) the directory in which to write the .CSV with query results, default is value of GCAM.OutputDir.
  • regions – (iterable of str) the regions you want to include in the query
  • regionMap – (dict-like) keys are the names of regions that should be rewritten. The value is the name of the aggregate region to map into.
  • rewriteParser – (RewriteSetParser instance) parsed representation of rewriteSets.xml
  • batchFileIn – (str) the name of a pre-formed batch file to run
  • batchFileOut – (str) where to write output from batchFileIn, if given
  • noRun – (bool) if True, print the command that would be executed, but don’t run it.
  • noDelete – (bool) if True, temporary files created by this function are not deleted (use for debugging)
Returns:

none

pygcam.query.sumYears(files, skiprows=1, interpolate=False)

For each file given, sum all values in each year column and create a file holding the result. Each resulting filename has the same basename as the input file but ending with ‘-sum.csv’.

Parameters:
  • files – (list of str) Filenames to process
  • skiprows – (int) the number of rows to skip prior to column headers
  • interpolate – (bool) if True, interpolate annual values between time-steps
Returns:

none

pygcam.query.sumYearsByGroup(groupCol, files, skiprows=1, interpolate=False)

Group data for each time-step (or interpolated annual values) by the given column (with categorical data like region or sector), and sum all members of each group to produce a time-series for each group. Equivalent to a SQL “group by” operation. For each input file, a new .CSV file is written with the name formed by the basename of the original file, followed by “-groupby-” and the groupCol. For example, given the file “foobar.csv” and groupCol “region”, the file “foobar-groupby-region.csv” would be generated. Tests that all rows have the same units; otherwise raises an error.

Parameters:
  • groupCol – (str) the column with categorical data to group by.
  • files – (list of str) Filenames to process
  • skiprows – (int) the number of rows to skip prior to column headers
  • interpolate – (bool) if True, interpolate annual values between time-steps
Returns:

none

Raises:

CommandLineError – if the rows in the input file don’t all have the same units

pygcam.query.writeCsv(df, filename, header='', float_format='%.4f', index=None)

Save a DataFrame a file in “standard” GCAM csv format’, which means without a numerical index, and with column headers on the second line.

Parameters:
  • df – (pandas.DataFrame) a DataFrame holding the data to write
  • filename – (str) the name of the file to create
  • header – (str) a string to write as the first line of the file, (a single line preceding column headers is standard GCAM query result format.)
  • float_format – (str) a format string indicating how to represent numeric values. Default shows 4 decimal places. To limit results to, for example, 2 decimal places, use float_format=”%.2f”.
Returns:

none