%This defines the key/value documentation pairs %Then, key_print calls them, in alphabetical order.


Where the user specifies on which variables she would like to perform consistency checks. The parameters for the variables declared in checks are obtained from input/types.


Where the user specifies on which variables she would like to perform consistency checks. The parameters for the variables declared in checks are obtained from input/types.


The database to use for all of this. It must be the first line in your spec file because all the rest of the keys get written to the database you specify. If you don't specify a database than the rest of the keys have nowhere to be written and your spec file will not get read correctly.


User-supplied Explicit ratios for basic items.

group recodes

Much like recodes (qv), but for variables set within a group, like eldest in household. For example, \begin{lstlisting}[language=] group recodes { group id : hh_id eldest: max(age) youngest: min(age) household_size: count(*) total_income: sum(income) mean_income: avg(income) } \end{lstlisting}

group recodes/group id

The column with a unique ID for each group (e.g., household number).


Provides a column in the data set that provides a unique identifier for each observation. Some procedures need such a column; e.g., multiple imputation will store imputations in a table separate from the main dataset, and will require a means of putting imputations in their proper place. Other elements of Tea, like flagging for disclosure avoidance, use the same identifier. This identifier may be built by a recode.


The key where the user defines all of the subkeys related to the doMImpute() part of the imputation process. For details on these subkeys, see their descriptions elsewhere in the appendix.


Denotes the categorized set of variables by which to impute your output vars.

impute/draw count

How many multiple imputations should we do? Default: 1.

impute/earlier output table

If this imputation depends on a previous one, then give the fill-in table from the previous output here.

impute/input table

The table holding the base data, with missing values. Optional; if missing, then I rely on the sytem having an active table already recorded. So if you've already called {\tt doInput()} in R, for example, I can pick up that the output from that routine (which may be a view, not the table itself) is the input to this one.

impute/input vars

A comma-separated list of the independent, right-hand side variables for imputation methods such as OLS that require them. These variables are taken as given and will not be imputed in this step, so you probably need to have a previous imputation step to ensure that they are complete.

impute/margin table

Raking only: if you need to fit the model's margins to out-of-sample data, specify that data set here.


Specifies what model to use to impute output vars for a given impute key.

impute/min group size

Specifies the minimum number of known inputs that must be present in order to perform an imputation on a set of data points.

impute/near misses

If this is set to any value, then the EM algorithm (the only consumer of this option) will weight nearby cells when selecting cells to draw from for partial imputations. Else, it will use only cells that match the nonmissing data.

impute/output table

Where the fill-ins will be written. You'll still need {\tt checkOutImpute} to produce a completed table. If you give me a value for {\tt impute/eariler output table}, that will be the default output table; if not, the default is named {\tt filled}.

impute/output vars

The variables that will be imputed. For OLS-type models, the left-hand, dependent variable (notice that we still use the plural "vars"). For models that have no distinction between inputs and outputs, this behaves identically to the "impute/vars" key (so only use one or the other).


The RNG seed


A comma-separated list of the variables to be put into the imputation model. For OLS-type models where there is a distinction between inputs and outputs, don't use this; use the "impute/input vars" and "impute/output vars" keys. Note that this is always the plural "vars", even if you are imputing only one field.


The key where much of the database/input related subkeys are defined. Descriptions of these subkeys can be found elsewhere in the appendix.


Each row specifies another column of data that needs an index. Generally, if you expect to select a subset of the data via some column, or join to tables using a column, then give that column an index. The {\tt id} column you specified at the head of your spec file is always indexed, so listing it here has no effect. Remark, however, that we've moved the function generate_indices(table_out) to bridge.c:428 to after the recodes.

input/input file

The text file from which to read the data set. This should be in the usual comma-separated format (CSV) with the first row of the file listng column names. We recommend separating|fields|with|pipes, because pipes rarely appear in addresses or other such data.

input/missing marker

How your text file indicates missing data. Popular choices include "NA", ".", "NaN", "N/A", et cetera.

input/output table

Name for the database table generated from the input file.


If {\tt n} or {\tt no}, I will skip the input step if the output table already exists. This makes it easy to re-run a script and only wait through the input step the first time. Otherwise, the default is to overwrite.

input/primary key

The name of the column to act as the primary key. Unlike other indices, the primary key has to be set on input.


Specifies the type and range of variables (which is used later in consistency checking).


The set to be merged in to join/host.


The name of the field appearing in both tables on which the join takes place. If you don't provide this, use the id key.


The main data set to be merged with.

join/output table

The name of the table (actually, a view) with the join of both tables. Use this as the basis for subsequent steps.

raking/all vars

The full list of variables that will be involved in the raking. All others are ignored.

raking/count col

If this key is not present take each row to be a single observation, and count them up to produce the cell counts to which the system will be raking. If this key is present, then this column in the data set will be used as the cell count.

raking/input table

The table to be raked.

raking/max iterations

If convergence to the desired tolerance isn't achieved by this many iterations, stop with a warning.

raking/run number

If running several raking processes simultaneously via threading on the R side, specify a separate run\_number for each. If single-threading (or if not sure), ignore this.

raking/structural zeros

A list of cells that must always be zero, in the form of SQL statements.

raking/thread count

You can thread either on the R side among several tables, or interally to one table raking. To thread a single raking process, set this to the number of desired threads.


If the max(change in cell value) from one step to the next is smaller than this value, stop.

rankSwap/max change

maximal absolute change in value of x allowed. That is, if the swap value for $x_i$ is $y$, if $|y - x_i| >$ maxchange, then the swap is rejected default = 1


The random number generator seed for the rank swapping setup.

rankSwap/swap range

proportion of ranks to use for swapping interval, that is if current rank is r, swap possible from r+1 to r+floor(swaprange*length(x)) default = 0.5


New variables that are deterministic functions of the existing data sets. There are two forms, one aimed at recodes that indicate a list of categories, and one aimed at recodes that are a direct calculation from the existing fields. For example (using a popular rule that you shouldn't date anybody who is younger than (your age)/2 +7), \begin{lstlisting}[language=] recodes { pants { yes | leg_count = 2 no | #Always include one blank default category at the end. } youngest_date { age/2 + 7 } } \end{lstlisting} You may chain recode groups, meaning that recodes may be based on previous recodes. Tagged recode groups are done in the sequence in which they appear in the file. [Because the order of the file determines order of execution, the tags you assign are irrelevant, but I still need distinct tags to keep the groups distinct in my bookkeeping.] \begin{lstlisting} recodes [first] { youngest_date: (age/7) +7 #for one-line expressions, you can use a colon. oldest_date: (age -7) *2 } recodes [second] { age_gap { yes | spouse_age > youngest_date && spouse_age < oldest_date no | } } \end{lstlisting}


Basic item names: listed in imputation order.


Basic item names: listed in imputation order.


Basic item names: listed in imputation order.


BFLD = # of basic items


BFLD = # of basic items


NEDFF = # of explicit ratios per category


NEDFF = # of explicit ratios per category


TOTSIC = # of explicit ratios per category


TOTSIC = # of explicit ratios per category


Once it has been established that a record has failed a consistency check, the search for alternatives begins. Say that variables one, two, and three each have 100 options; then there are 1,000,000 options to check against possibly thousands of checks. If a timeout is present in the spec (outside of all groups), then the alternative search halts and returns what it has after the given number of seconds have passed.