6 Data Export

In the SPARK data export, one or more comma separated values (.csv) files can be downloaded with the data of a project. These files follow the following conventions.

The filename has the format organizationId_projectId{*}.csv, where the {*} bit is empty when a full dataset is exported, and is used to denote the selection of data in that file when data is downloaded in separate files (see below). Note that organization identifiers (organizationId) are set by the organization administrator, and are no persistent unique identifiers such as ROR identifiers (Research Organization Registry identifiers), and project identifiers are set by the project owner and has to be unique within the organization, but can collide with other project identifiers in other organizations.

The data will have the following column names and contents:

userId: the unique identifier of the user in the project.
timestamp: the date, time, and timezone in ISO8601 format (e.g. 2024-05-23T21:47:27+07:00).
datasourceType: the type of data source for which this row contains information. This can have the following values:
- answer: a participant’s answer to a question
- storage_node: manually stored data in a decision tree Data Storage Node
- storage_question: manually stored data in a Data Storage Question
- storage_state: manually stored data in a state transition
- spark_metadata: SPARK behavior metadata
datasourceProvenance: the data source provenance; contextual information specifying the data origins. This can have the following values:
- for data of the answer data source type, the identifier of the question set
- for data stored through a node, question, or state storage specification, the identifier of the relevant data source (e.g. of the third party API)
- for SPARK behavior metadata, in the current version of the SPARK, the only valid value is treePath, signifying that the value specified in this row is a Decision Tree Path string.
datasourceId: the identifier of the relevant data source. This has the following value:
- for data of the answer data source type, the identifier of the question to which the answer is specified.
- for data of one of the three storage types (storage_node, storage_question, or storage_metadata), the identifier of that data storage directive.
- For data of the spark_metadata data type, the relevant identifier; for a treePath, this is the identifier of the relevant tree.
value: the relevant value: the actual data point. For Tree Paths, this lists the node identifiers that were traversed from the tree’s root (the initial node, which has the tree identifier as its identifier) to the ultimate leaf (the terminal node), separated by a greater-than sign, for example treeId > firstNodeId > secondNodeId > terminalNodeId.

When downloading the data, users can choose whether they want to download:

one .csv file;
a .zip archive containing separate .csv files for each value of datasourceType (in which case the filenames contain in place of the {*} placeholder (see above) either _answer, _storage_node, _storage_question, _storage_state, or _spark_metadata);
a .zip archive containing separate .csv files for each value of datasourceProvenance (in which case the filenames contain in place of the {*} placeholder (see above) either an underscore immediately followed by the question set identifier, e.g. _moodQuestions; or an underscore immediately folowed by the data source API, e.g. _heartRate; or _treePath);
a .zip archive containing separate .csv files for each value of userId (in which case the filenames contain in place of the {*} placeholder (see above) an underscore immediately followed by the relevate user identifier).