Type Of Data
  • Data Types
  • Data Acquisition
  • Data Processing
  • Data Documents
  • Data Accessing
  • Data Processing
    CHIMGEN    2019-02-14


    1.Neuroimaging data preprocessing

    We developed a CHIMGEN pipeline to preprocess the multi-model neuroimaging data. The pipeline has integrated many popular neuroimaging tools:

    (1) dicm2nii (https://github.com/xiangruili/dicm2nii)

    (2) FreeSurfer (http://www.freesurfer.net/fswiki)

    (3) SPM (https://www.fil.ion.ucl.ac.uk/spm/software/spm12/)

    (4) FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/)

    (5) CAT12 (http://www.neuro.uni-jena.de/cat/)

    (6) Diffusion toolkit/TrackVis (http://www.trackvis.org)

    (7) DKI (https://github.com/NYU-DiffusionMRI/Diffusion-Kurtosis-Imaging)

    The integrated pipeline supports multi-clusters parallel computation, and the total computation time can be greatly reduced by using the Tianhe-1 super-computer (https://www.nscc-tj.cn). At last, a standardized automatic workflow is developed for each type of the neuroimaging data to increase preprocessing consistency across participants. The specific preprocessing procedures for each type of the neuroimaging data are shown as the follows:

    1.1 Structural MRI data

    1.1.1 Voxel-based morphometry (VBM)

    The VBM analyses are performed using the CAT12 software package (version r1364, http://dbm.neuro.uni-jena.de/cat) with a standard configuration.

    (1) Bias correction

    Image inhomogeneity caused by B1-field bias was corrected to accurately segment the brain tissues. The bias corrected images would have more uniform intensities within each type of the brain tissues.

    (2) Segmentation

    The bias-corrected structural images are segmented into gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF) using a reliable segmentation model. The segmentation model is based on an adaptive Maximum A Posterior (MAP) technique without the need for a priori information about tissue probabilities.

    (3) Creating population-specific tissue templates

    To improve the quality of registration, population-specific tissue probability templates in Montreal Neurological Institute (MNI) space are derived from 6000 CHIMGEN subjects by the DARTEL toolbox implemented in SPM12.

    (4) Spatial normalization

    The segmented images were spatially normalized to the population-specific templates using a two-step DARTEL algorithm and resampled into a cubic voxel of 1.5 mm. Modulation was performed on the normalized grey matter images to preserve the absolute volume of the GM tissue. For detailed information, please refer to “Features” part of the website: http://dbm.neuro.uni-jena.de/cat/index.html#VBM.

    (5) Smoothing

    The resulting images were smoothed with a full width at half maximum (FWHM) kernel of 8 mm to compensate for residual anatomical differences between subjects.

    1.1.2 Surface-based morphometry (SBM)

    The SBM analyses are performed using the most common software of the FreeSurfer version v6.0.0 package (http://surfer.nmr.mgh.harvard.edu/) with following steps:

    (1) Skull stripping

    The automated skull-stripping was performed to separate the brain from non-brain tissues in structural MR images.

    (2) Intensity normalization

    Intensity non-uniformity due to variations in the sensitivity of the reception coil and gradient-driven eddy currents were corrected, and intensity-normalized images were generated.

    (3) Tissue segmentation

    The skull-stripped and intensity-normalized images were processed by a series of tissue segmentation procedures based on intensity and neighbor constraints. This step generates the boundary between GM and WM. The segmentation of subcortical white and gray matter structures was then performed.

    (4) Surface reconstruction

    A two-dimensional tessellated mesh was constructed based on the boundary between WM and GM to generate the WM surface in each hemisphere, and the WM surface was extended outwards by tracking the gray matter intensity gradient to generate the pial surface. Topology correction was performed to repair topological defects.

    (5) Metric reconstruction

    Surface-based metrics such as the cortical thickness, surface area, cortical volume and cortical curvature were calculated based on the pial and white matter surfaces.

    (6) Spherical normalization

    Individual surfaces were then inflated into a spherical space and registered to a spherical atlas in MNI152 space (the fsaverage atlas). Surface-based metrics for each cortical area were extracted based on the predefined surface atlas after registrating them to individual spaces using the spherical registration parameters.

    (7) Smoothing

    The surface-based metrics were smoothed with a Gaussian kernel of 20 mm width.

    1.2 DTI

    (1) Brain extraction (BET)

    The non-brain tissues of the b = 0 images were removed by applying the brain extraction tool (BET) in FSL.

    (2) Motion and distortion correction (EDDY)

    A “EDDY_OPENMP” program implemented in the FSL 5.0.10 was used to evaluate and repair the image displacement and signal dropout caused by head motion, and image distortion caused by eddy current.

    (3) Tensor metric calculation (DTIFIT)

    The linear least square algorithm was used estimate the diffusion tensor and its derived metrics using the DTIFIT program in FSL. In this step, diffusion metrics, such as three eigenvalues, fractional anisotropy (FA), and mean diffusivity (MD) were generated.

    (4) Spatial normalization estimation (BBR+DARTEL)

    A two-step procedure was used to estimate the co-registration parameters between individual diffusion space and MNI standard space.

    (a) Individual b = 0 images were aligned to the corresponding structural images using the Boundary-Based Registration (BBR) algorithm implemented in FSL.

    (b) The BBR parameters were concatenated with the DARTEL deformation field (from individual space to MNI space) generated in VBM analyses.

    (c) The merged deformation field was used to register the individual diffusion data into the MNI space, or vice versa.

    (5) Deterministic fiber tracking

    The deterministic fiber tracking was executed to reconstruct the whole brain fibers using the HARDI/Q-Ball reconstruction method in Diffusion toolkit 0.6.4.1 (http://www.trackvis.org) with an angle threshold of 80 degree and an FA threshold of 0.1.

    (6) Probabilistic fiber tracking (BEDPOSTX+PROBTRACKX)

    A partial volume model (or named ball-stick model) was applid to estimate the diffusion orientation distribution with the following parameters: maximum number of fibers per voxel = 3, burnin period = 1000 and number of iterations = 1250, deconvolution model = sticks with a range of diffusivities. This was realized using the bedpostx program in FSL 5.0.10. Then, probabilistic fiber tracking was performed using the probtrackx program with predefined seeds and the following parameters: number of samples = 5000, and angle threshold = 80 degree.

    (7) Metric normalization

    The diffusion metrics were normalized into the MNI space using the merged deformation field (BBR+DARTEL) that generated in step 4 and resampled into a cubic voxel of 2-mm.

    (8) Generation of white matter skeleton

    In this step, we used a revised TBSS pipeline to create the white matter skeleton. Rather than the standard TBSS pipeline that directly nonlinearly align the individual FA images to the averaged FA template (FMRIB-58) in MNI space using the FNIRT program in FSL, we co-registered the individual FA images using the merged deformation field (BBR+DARTEL) that generated in step 4. Then a mean FA image was created and a mean FA skeleton of the white matter was generated using the center-of-gravity method. Each subject's aligned FA images were then projected onto the mean FA skeleton by filling the mean FA skeleton with FA values from the nearest relevant tract center, which was achieved by searching perpendicular to the local skeleton structure for maximum value.

    1.3 DKI

    (1) Brain extraction (BET)

    The non-brain tissues of the b = 0 images were removed by applying the brain extraction tool (BET) in FSL.

    (2) Motion and distortion correction (EDDY)

    An “EDDY_OPENMP” program implemented in the FSL 5.0.10 was used to evaluate and repair the image displacement and signal dropout caused by head motion, and image distortion caused by eddy current.

    (3) Kurtosis metric calculation

    The Marchenko Pastur primary component analysis (MPPCA) was used to reduce noise caused by high b-value in the preprocessed DKI data. The weighted linear least squares estimation algorithm was used to estimate the DKI parameters, which were used to calculate DKI metrics (both tensor and kurtosis metrics). The tensor metrics include three eigenvalues, FA and MD. The kurtosis metrics include the radial (RK), axonal (AK) and mean kurtosis (MK), and white matter microscopic metrics, such as the axonal water fraction (AWF), extra-axonal space (EAS) and intra-axonal space (IAS). This step is carried out based on the Diffusion Kurtosis Imaging toolbox (https://github.com/NYU-DiffusionMRI/Diffusion-Kurtosis-Imaging)

    (4) Spatial normalization estimation (BBR+DARTEL)

    A two-step procedure was used to estimate the co-registration parameters between individual diffusion space and MNI standard space.

    (a) Individual b = 0 images were aligned to the corresponding structural images using the Boundary-Based Registration (BBR) algorithm implemented in FSL.

    (b) The BBR parameters were concatenated with the DARTEL deformation field (from individual space to MNI space) generated in VBM analyses.

    (c) The merged deformation field was used to register the individual diffusion data into the MNI space, or vice versa.

    (5) Metric normalization

    The diffusion kurtosis metrics were normalized into the MNI space using the merged deformation field (BBR+DARTEL) that generated in step 4 and resampled into a cubic voxel of 2-mm. the normalized kurtosis metrics were also projected onto the white matter skeleton that generated in the DTI pipeline.

    1.4 rs-fMRI

    (1) Discarding unstable volumes

    Five first functional volumes were discarded to allow signal to reach equilibrium and ensure the participants to adapt to scanning noise. After deletion, all the subjects have 175 time points.

    (2) Slice timing correction

    The remaining volumes were corrected for intra-volume temporal differences using sinc-interpolation.

    (3) Head motion correction

    Inter-volume head motion correction is performed via a six-parameter rigid-body transformation. Specifically, each volume was first realigned to the first volume and then realigned to the mean of these volumes after the first correction.

    (4) Spatial normalization

    To improve coregistration, the non-brain tissue of the mean corrected functional images and structural images were first removed. Then, the mean corrected functional images were coregistered to the corresponding structural images using the BBR method. Finally, all motion-corrected functional volumes were spatially normalized to the standard MNI space using deformation fields derived from aforementioned VBM analysis and resampled to 3-mm isotropic voxels.

    (5) Smoothing

    For independent component analysis, the normalized fMRI data were smoothed with a FWHM of 8 mm. For other analyses, such as regional homogeneity (ReHo) and amplitude of low frequency fluctuation (ALFF), the smooth procedure (with the same smooth kernel) is performed after the measures were generated.

    (6) Regressing out sources of noise

    We regressed out several sources of variances, including the linear drift, Friston’s 24 head motion parameter, signal from a region centered in the white matter and signal from a ventricular region of interest. Of note, it is still an ongoing debate on the regression of the global signal. Thus, both data with and without regressing global signal are obtained in the data preprocessing.

    (7) Scrubbing

    We also calculated frame-wise displacement (FD), which indexes volume-to-volume changes in head position. These changes were obtained from derivatives of the rigid-body realignment estimates that were used to realign fMRI data. The movement-contaminated time points are defined by FDJenkinson > 0.5 mm in the current study. Then, the contaminated volumes, as well as 1 forward and 2 backward from these volumes, were deleted and imputed by using cubic spline interpolation. If the number of imputed volumes is more than one third of the total length, the participant is removed from the following analyses.

    (8) Filtering

    The temporal band-pass filtering (0.01-0.08 Hz) was performed on time series of each voxel to reduce the effects of low-frequency drift and high-frequency noise.

    1.5 ASL

    The ASL images were obtained using the spiral 3D FSE sequence, and the CBF maps can be calculated immediately after the acquisition. Thus, here we do not include the steps for CBF calculation.

    (1) Merge raw data (FSLMATHS)

    The raw ASL images using spiral 3D-FSE sequence had perfect SNR but poor GM/WM contrast, while the ASL subtraction map had perfect GM/WM contrast while poor SNR. In order to integrate the advantages of these two types of images, we merged them using the following equation: merged map = (subtraction map × ASL map) / mean (all voxels in the subtraction map). The merged map was then used for later processing.

    (2) Skull stripping (BET)

    The non-brain tissues in the merged map were stripped using brain extraction tool (BET) in FSL.

    (3) Spatial normalization

    A two-step procedure was used to estimate the co-registration parameters between individual diffusion space and MNI standard space.

    (a) Individual skull-stripped images were aligned to the corresponding T1WI structural images using the BBR algorithm implemented in FSL.

    (b) The BBR parameters were concatenated with the DARTEL deformation field (from individual space to MNI space) generated in VBM analyses.

    (c) The merged deformation field was used to register the individual CBF map into the MNI space and resampled into a cubic voxel of 2-mm.

    (4) Scaling

    Early studies had shown that CBF values can be significantly influenced by labeling parameters and arterial circulation properties. In order to minimize the effect of the labeling variance across subjects, the coregistered CBF map of each subject was further scaled by the demean scaling and z-score scaling methods.

    (5) Smoothing

    Each scaled CBF map was spatially smoothed with a Gaussian kernel of 6 mm × 6 mm × 6 mm FWHM.

    2. Genetic data preprocessing

    Genotyping with the Illumina Infinium Asian Screening Array (ASA) will generate a dataset for each qualified participant. Since the high homogeneity of racial identity and a proper sample size of the CHIMGEN cohort, we deal with quality control for the genome-wide association studies (GWAS) in a way which is typical in literature. Briefly, genetic quality control was performed both in variant-level and sample-level. In variant-level quality control, we will firstly exclude duplicated markers, and then abandon markers with high allele missing rate or departure from Hardy-Weinberg equilibrium (HWE). In sample-based quality control, we will identify samples with sex mismatching, exclude poor quality samples via combination of heterogeneity and missing rate, deal with related samples and account for effects of potential population structure using principal component analysis (PCA).

    2.1 Genotyping

    The Illumina Infinium ASA is used to identify ~750,000 (~650,000 ASA sites plus ~100,000 customized sites) genetic variants for each participant. In addition to the routine quality control schemes, we also include the following procedures:

    (1) Array selection

    The ASA is selected for genotyping because it is specially designed for East Asian populations, which is well matched with the CHIMGEN population.

    (2) Batches

    In the first stage of the CHIMGEN study, 7,000 participants are collected and will be genotyped by two batches (5,000 and 2,000). This strategy would reduce variations from the batch effects.

    (3) Blind test

    The blind test is designed to ensure the quality of genotyping. The blind test repetition rate should be more than 99%. If the detection rate is less than 97%, the corresponding sample needs to be genotyped again. Additionally, samples from the ASA chip that contains unqualified sample should be re-genotyped again to ensure the call rate.

    2.2 Variant-level quality control

    The removal of poor-quality markers is critical to the success of GWAS because they can present as false positives and reduce the power to identify true associations with the interested phenotypes. The specific procedures for quality control are as follows:

    2.2.1 Deleting duplicated markers

    It is a routine to design some duplicated markers in array to ensure correct genotyping for significant markers and for markers located in regions enriched with AT or GC. There are 16,601 duplicated markers in ASA. We retain one marker for each locus via excluding the duplicated one with higher genotype missing rate among samples.

    2.2.2 Deleting markers with high missing rate

    Conventionally, markers with call rates less than 95% are removed from the further analysis. We chose the same threshold as most of literatures.

    2.2.3 Deletion of markers departure from HWE

    In GWAS, markers with significant deviation from Hardy-Weinberg equilibrium (HWE) should be deleted because it is an indication of a genotype-calling error or population stratification. The statistical threshold for declaring SNPs departure from HWE in this study is set to be 10-5.

    2.3 Sample-level quality control

    The sample-level quality control of the CHIMGEN genetic data consists of four steps to identify samples with sex mismatching and samples with poor genotyping quality, to deal with related samples, and to provide the ancestral background for samples based on genetic information.

    2.3.1 Identification of samples with sex mismatching

    We used genotype data from the X chromosome to check for the discordance in sex information. The mismatch between inferred and reported sex may be caused by sample mix-up, transgender individuals, or DNA contamination. Theoretically, males have one copy of X chromosome and should be homozygous for any marker outside the pseudo-autosomal region in the X chromosome. In this study, male samples are expected to have a homozygosity rate of more than 0.8, and female samples are expected to have a homozygosity rate of less than 0.2. Samples with sex mismatching are excluded from the following analysis.

    2.3.2 Identification of duplicated and related samples

    A basic feature of the standard population-based association studies is that all samples are unrelated. If duplicates and relatives are present, a bias may be introduced in the study because the genotypes within one family will be overrepresented. Although we tried our best to exclude duplicated and related participants during the recruitment, it is still possible that a few duplicated or related participants have been recruited in this study. To identify the duplicated and related individuals, we calculated the identity by descent (IBD) for each pair of individuals. The extended linkage disequilibrium (LD) regions are removed entirely to obtain independent SNPs. The remaining regions are pruned so that no pair of SNPs within a given window (100 markers) is correlated (r2< 0.2). Any pair of individuals with an IBD value more than 0.1875 was considered as duplicated or related individuals. In this situation, the one with higher call rate was reserved.

    2.3.3 Identification of outlying genotype missing and heterozygosity rates

    The extreme heterozygosity and high missing rate are indicators of poor sample quality due to DNA contamination or other causes. The heterozygosity is defined by (N-O)/N, where N is the number of non-missing genotypes and O is the observed number of homozygous genotypes for a given sample. Individuals will be excluded if the genotype failure rate is greater than 0.03 or the heterozygosity rate is out of ± 3 standard deviations of the mean.

    2.3.4 Identification of samples with divergent ancestry

    Population stratification is a major source of bias in population-based GWAS, in which significant associations are generated because of different population origins rather than any genetic effect on the interested phenotypes. Principal component analysis (PCA) was used to capture population structure within the CHIMGEN cohort. The participants with significant deviation from population are excluded and the top 20 components will be regressed as confounding factors in the GWAS.