Performance
There are some adjustments you can make to improve the performance of Production System, if necessary.
Configure the Number of Threads
By default, Production System processes one table at a time. To increase the output speed, you can enable multithreading. This will allow Production System to process more than one table at a time.
To change the number of threads, either:
- Set the
NumberOfThreads
in theTableProcessing
section of the configuration file. - Use the
-nt
command line option.
Multiple threads are not supported when you are using recode caching (see below).
Recode Caching
Some recodes, particularly geographic recodes, can be extremely large. If you have multiple TXD files that reference the same recode then you can configure a caching mechanism that may reduce processing overhead. It will also reduce disk space because it will only be necessary to store the recode once.
To configure recode caching:
Create your tables in SuperCROSS without the recodes in the tables, and save the tables in TXD format.
It is important to save the TXD file without the recodes. Any recodes defined in the TXD file itself will take precedence over the recode cache, so if you include the recodes in the table then Production System will use the ones from the TXD file instead of using caching.
Use the Fields window in SuperCROSS to define the recodes, and then save them in Textual Recode (.TXT) format (note that you must save the recodes in text format and not the binary .rcd format).
Create a recode list file. This is a standard text file containing a list of recode files you want to load, in the following format:
CODEDBID <dataset_id> <recode_filename> <recode_filename> ... DBID <dataset_id> <recode_filename> <recode_filename> ...
For example:
RecodeList.txt
CODEDBID bank GenderRecode.txt PostcodeRecode.txt MaritalStatusRecode.txt DBID people EducationRecode.txt
- Configure Production System to use the recode list file. There are two ways you can do this:
- Update the
RecodeListFile
setting in the Production System configuration file; or - Use the
-rl
command line option.
You will also want to set up your table list file and tell Production System to use it (because recode caching is clearly only going to be of any use when you are processing more than one table at a time). For example:
CODEsa2ps -tl TableList.txt -rl RecodeList.txt
- Update the
Dimension Item Caching and Mapping
Mapping
SuperSTAR tables support three dimensions (rows, columns and wafers). When tables are output to a file, these three dimensions need to be mapped to two dimensions. There are several Production System settings you can change to control how this mapĀping takes place.
When the Production System is run the table items are rearranged in a field or axis, and a mapping is created between their current position and their original (natural) position. This mapping can be either dense or sparse:
- Dense mapping stores data for every item in a dimension regardless of whether or not it has been moved. With this method, very large dimensions can use a lot of memory. This method is also inefficient for dimensions that have had little or no item rearrangements.
- Sparse mapping stores information on the items moved. It is significantly more efficient when only a small number of items are rearranged, and especially so with very large dimensions. However, the disadvantage of sparse mapping is that the computational expense for resolving the mappings increases exponentially with the number of items rearranged. When a large number of items are rearranged, the memory usage eventually overtakes the equivalent of dense mapping. Large dimensions with a large number of rearrangements are therefore more efficiently completed using dense mapĀping.
You can control which method is used by changing the MaxSizeForDenseDataCube
setting in the configuration file. You may need to experiment with different settings to find the optimal strategy for your tables.
Caching
When the item mappings are resolved, they can be cached for future reference, thus negating the need for resolution in the future.
With CSV format:
- The primary dimension (axis) is iterated across once.
- The secondary dimension is iterated across once for every item in the primary dimension.
- The tertiary dimension is iterated across once for every iteration across every item in the secondary dimension.
The benefit of caching items is greatest for the items that are accessed most often (for example the tertiary dimension). If items are accessed only once, then no caching is the best option. Most production tables have one very large dimension and one smaller, more complicated dimension. Therefore it is best to make the large dimension the secondary dimension, and the smaller more complicated dimension the tertiary dimension.
The MaxSizeForCacheAll
setting controls caching of items in a dimension. If the number of items in the dimension exceeds the configured amount then no caching will be used.