Creating a New Project & Importing data

1. In Evince, a new project can be created from the file menu or from the project wizard.

 

2. The user can then browse for a data file to import. For example, Evince supports ascii (txt-, csv- and dat extensions), xls, mat, sdf and image files (gif-, tif-, bmp-, png and jpg extensions). Please see Appendix I for a complete list of import formats supported by Evince. The first tab in the first step of the Data Import Wizard is for the basic import of a single file. The second tab lets the user import two or more files at the same time. The third tab imports a file from a database.

 

3. For ascii (text) files, the next step in the import is choosing the appropriate data delimiter. This step is exclusive for text files. Most of the time, Evince will automatically detect which delimiter to use. If an incorrect delimiter is chosen, the user can change the delimiter manually. When the correct delimiter is chosen, the user should be able to view the imported data as a spreadsheet. For other types of files (such as matlab and excel), this step is replaced by a window showing information about the size of the imported data. For image files, this step will not appear at all.

 

4. The main import step lets the user view the entire data set and perform modifications on it. The size of the imported data can be seen in the field "Data summary" at the top-left of this window. Selections are done by either left- or right-clicking on the headers of the rows or the columns (represented by grey cells with numbers). With the right-click, only one row or column can be selected at the same time. With the left-click, the control-button on the computer's keyboard can be used for selecting several rows or columns at the same time.

The buttons below the "Identifiers" tab in the "Identifier" field set selected rows or columns to represent identifiers or categories. An observation or a variable can have several identifiers and categories. An identifier is a description for an observation or a variable and can for example be a name or a property. A category is something used for dividing observations or variables into different classes. The button "Primary var" sets an entire row to represent primary variable identifiers. The button "Variables" sets an entire row to represent secondary variable identifiers. The button "Primary Obs" sets an entire column to represent primary observation identifiers, while the button "Observations" sets an entire column to secondary observation identifiers. Below the "Identifier" field, there is a button for setting the "Datatype" to X, Y or SMILES (Simplified Molecular Input Line Entry Specification). Above the "Identifiers" tab, there is a check-box for transposing the imported data. Transposing a data set means that its observations become variables and vice versa. At the bottom of the "Identifiers" tab, there are two buttons, "Include" and "Exclude", which can be used for including and excluding the rows or columns to be part of the imported data.

 

Right-clicking on the header of a either a row or column will make the import menu accessible. This menu offers most of the functionality that is available by the buttons to the left of the window. The first segment of the menu is used for setting the selected variable(s) or observation(s) to primary or secondary identifiers, to category or to numerical data. In case of column selection, numerical data can be "X Data", "Y Data" or "SMILES". In case of row selection, only "Numerical Data" is available as "Datatype". The second segment is used for including and excluding rows or columns. The third segment of the menu is for shifting cells up/down or left/right. Columns can be shifted up or down while rows can be shifted left or right. For example, shifting a certain row to the right implies that all cells in that particular row will appear one step to the right of their original positions. The last segment is used for pasting new data into the imported data set. Evince will paste the last entry of the clipboard to a position before or after the selected row or column. Before the new data is pasted, the user must give the correct delimiter.

 

Below the "Tools" tab, there is a button labeled "Auto Identify", that will tell Evince to automatically identify all data cells, i.e. set them to observation identifiers, variable identifiers etc. The data cells are by default not identified if the number of data cells exceed 100 000. Also, all identifications are removed if the button "Clear" is used. Further below, there is a panel labeled "Missing values", which is used for specifying how to handle missing values in the imported data. Next to "Denoted by:" is a text field for entering the representation of missing values in the imported data. (To preview the missing values in the table press the return key.) The user can also choose to exclude observations and variables that have a certain percentage of missing values by clicking on the "Observations" and "Variables" buttons. The default missing value cut-off is 50% for both observations and variables.

Further down, there is a panel labeled "Variance Auto-exclude", which is used for removing observations and variables with low variation. During the import, Evince automatically checks for variables with no variation, i.e. all values are the same. In such case, a question dialog appears that notifies the user of the removal. With the "Observations" and "Variables" buttons, the user can also specify exactly what observations and variables should be removed according to their individual variances.

 

5. The last step of the data import is used for naming the project and for choosing which items to be created when finishing the import. The only item created by default is the "DataSet table". The user can choose to create a data table of the imported data and also to automatically create a multivariate PCA or PLS model along with model collection plots . If a PLS model is to be created, at least one variable must be set to Y in the previous step.

The user can also choose to change the used template. The template controls how data is processed and if any plots will appear when the data import is completed. The type of the imported file will determine which templates are available to the user. The standard template is chosen by default for most file types. This template will apply variable mean centering and unit variance scaling. The following templates are available in Evince:
- None; Only variable mean centering is applied to the DataSet
- Standard; Variable mean centering and unit variance scaling are applied to the DataSet
- Spectroscopy; Variable mean centering is applied to the DataSet and a spectral plot is created.
- Standard Image (only available for image data); Variable mean centering is applied to the DataSet and a spectral plot as well as an RGB image is created for the data.

At the bottom of the dialog of this step, there is a check-box for enabling the feature "Keep data on disk". When enabled, the imported file will be linked to the new project and its case. Changes in the DataSet can be applied to the file and vice versa. This feature preserves the amount of RAM (random access memory) available to the system as the DataSet is kept on the computer's hard drive.