%This defines the key/value documentation pairs
%Then, key_print calls them, in alphabetical order.
checks
Where the user specifies on which variables she would like to perform consistency checks. The parameters for the variables declared in checks are obtained from input/types.
checks
Where the user specifies on which variables she would like to perform consistency checks. The parameters for the variables declared in checks are obtained from input/types.
database
The database to use for all of this. It must be the first line in your spec file because all the rest of the keys get written to the database you specify. If you don't specify a database than the rest of the keys have nowhere to be written and your spec file will not get read correctly.
ExpRatios
User-supplied Explicit ratios for basic items.
group recodes
Much like recodes (qv), but for variables set within a group, like
eldest in household.
For example,
\begin{lstlisting}[language=]
group recodes {
group id : hh_id
eldest: max(age)
youngest: min(age)
household_size: count(*)
total_income: sum(income)
mean_income: avg(income)
}
\end{lstlisting}
group recodes/group id
The column with a unique ID for each group (e.g., household number).
id
Provides a column in the data set that provides a unique identifier for each observation.
Some procedures need such a column; e.g., multiple imputation will store imputations in a
table separate from the main dataset, and will require a means of putting imputations in
their proper place. Other elements of Tea, like flagging for disclosure avoidance, use the
same identifier. This identifier may be built by a recode.
impute
The key where the user defines all of the subkeys related to the doMImpute() part of the imputation process. For details on these subkeys, see their descriptions elsewhere in the appendix.
impute/categories
Denotes the categorized set of variables by which to impute your output vars.
impute/draw count
How many multiple imputations should we do? Default: 1.
impute/earlier output table
If this imputation depends on a previous one, then give the fill-in table from the previous output here.
impute/input table
The table holding the base data, with missing values.
Optional; if missing, then I rely on the sytem having an active table already recorded. So if you've already called {\tt doInput()} in R, for example, I can pick up that the output from that routine (which may be a view, not the table itself) is the input to this one.
impute/input vars
A comma-separated list of the independent, right-hand side variables for imputation methods such as OLS that require them. These variables are taken as given and will not be imputed in this step, so you probably need to have a previous imputation step to ensure that they are complete.
impute/margin table
Raking only: if you need to fit the model's margins to out-of-sample data, specify that data set here.
impute/method
Specifies what model to use to impute output vars for a given impute key.
impute/min group size
Specifies the minimum number of known inputs that must be present in order to perform an imputation on a set of data points.
impute/near misses
If this is set to any value, then the EM algorithm (the
only consumer of this option) will weight nearby cells when selecting cells to draw
from for partial imputations. Else, it will use only cells that match the nonmissing data.
impute/output table
Where the fill-ins will be written. You'll still need {\tt checkOutImpute} to produce a completed table. If you give me a value for {\tt impute/eariler output table}, that will be the default output table; if not, the default is named {\tt filled}.
impute/output vars
The variables that will be imputed. For OLS-type models, the left-hand, dependent variable (notice that we still use the plural "vars"). For models that have no distinction between inputs and outputs, this behaves identically to the "impute/vars" key (so only use one or the other).
impute/seed
The RNG seed
impute/vars
A comma-separated list of the variables to be put into the imputation model.
For OLS-type models where there is a distinction between inputs and outputs, don't use this; use the "impute/input vars" and "impute/output vars" keys. Note that this is always the plural "vars", even if you are imputing only one field.
input
The key where much of the database/input related subkeys are defined.
Descriptions of these subkeys can be found elsewhere in the appendix.
input/indices
Each row specifies another column of data that needs an index. Generally, if you expect to select a subset of the data via some column, or join to tables using a column, then give that column an index. The {\tt id} column you specified at the head of your spec file is always indexed, so listing it here has no effect. Remark, however, that we've moved the function generate_indices(table_out) to bridge.c:428 to after the recodes.
input/input file
The text file from which to read the data set. This should be in
the usual comma-separated format (CSV) with the first row of the file listng column names. We recommend separating|fields|with|pipes, because pipes rarely appear in addresses or other such data.
input/missing marker
How your text file indicates missing data. Popular choices include "NA", ".", "NaN", "N/A", et cetera.
input/output table
Name for the database table generated from the input file.
input/overwrite
If {\tt n} or {\tt no}, I will skip the input step if
the output table already exists. This makes it easy to re-run a script and only wait
through the input step the first time. Otherwise, the default is to overwrite.
input/primary key
The name of the column to act as the primary key. Unlike other indices, the primary key has to be set on input.
input/types
Specifies the type and range of variables (which is used later in consistency checking).
join/add
The set to be merged in to join/host.
join/field
The name of the field appearing in both tables on which the join takes place. If you don't provide this, use the id key.
join/host
The main data set to be merged with.
join/output table
The name of the table (actually, a view) with the join of both tables. Use this as the basis for subsequent steps.
raking/all vars
The full list of variables that will be involved in the
raking. All others are ignored.
raking/count col
If this key is not present take each row to be a
single observation, and count them up to produce the cell counts to which the
system will be raking. If this key is present, then this column in the data set
will be used as the cell count.
raking/input table
The table to be raked.
raking/max iterations
If convergence to the desired tolerance isn't
achieved by this many iterations, stop with a warning.
raking/run number
If running several raking processes simultaneously via
threading on the R side, specify a separate run\_number for each. If
single-threading (or if not sure), ignore this.
raking/structural zeros
A list of cells that must always be zero,
in the form of SQL statements.
raking/thread count
You can thread either on the R side among several tables,
or interally to one table raking. To thread a single raking process, set this to the
number of desired threads.
raking/tolerance
If the max(change in cell value) from one step to the next
is smaller than this value, stop.
rankSwap/max change
maximal absolute change in value of x allowed.
That is, if the swap value for $x_i$ is $y$, if $|y - x_i| >$ maxchange,
then the swap is rejected
default = 1
rankSwap/seed
The random number generator seed for the rank swapping setup.
rankSwap/swap range
proportion of ranks to use for swapping interval, that is
if current rank is r, swap possible from r+1 to r+floor(swaprange*length(x))
default = 0.5
recodes
New variables that are deterministic functions of the existing data sets.
There are two forms, one aimed at recodes that indicate a list of categories, and one
aimed at recodes that are a direct calculation from the existing fields.
For example (using a popular rule that you shouldn't date anybody who is younger than
(your age)/2 +7),
\begin{lstlisting}[language=]
recodes {
pants {
yes | leg_count = 2
no | #Always include one blank default category at the end.
}
youngest_date {
age/2 + 7
}
}
\end{lstlisting}
You may chain recode groups, meaning that recodes may be based on previous recodes. Tagged
recode groups are done in the sequence in which they appear in the file. [Because the
order of the file determines order of execution, the tags you assign are irrelevant, but
I still need distinct tags to keep the groups distinct in my bookkeeping.]
\begin{lstlisting}
recodes [first] {
youngest_date: (age/7) +7 #for one-line expressions, you can use a colon.
oldest_date: (age -7) *2
}
recodes [second] {
age_gap {
yes | spouse_age > youngest_date && spouse_age < oldest_date
no |
}
}
\end{lstlisting}
SPEERfields
Basic item names: listed in imputation order.
SPEERfields
Basic item names: listed in imputation order.
SPEERfields
Basic item names: listed in imputation order.
SPEERparams/BFLD
BFLD = # of basic items
SPEERparams/BFLD
BFLD = # of basic items
SPEERparams/NEDFF
NEDFF = # of explicit ratios per category
SPEERparams/NEDFF
NEDFF = # of explicit ratios per category
SPEERparams/TOTSIC
TOTSIC = # of explicit ratios per category
SPEERparams/TOTSIC
TOTSIC = # of explicit ratios per category
timeout
Once it has been established that a record has failed a consistency
check, the search for alternatives begins. Say that variables one, two, and three each have 100
options; then there are 1,000,000 options to check against possibly thousands
of checks. If a timeout is present in the spec (outside of all groups), then the
alternative search halts and returns what it has after the given number of seconds
have passed.