Potential Biomarkers Selection for Bipolar Disorder Identification

A biomarker is a measurable indicator of the severity or presence of some disease. A biomarker is anything that can be used as an indicator of a particular disease state or some other physiological state of an organism. The space Decomposition-Gradient-Regression (DGR) method has been developed (Li et al., 2012; Li et al., 2015) to select biomarkers for schizophrenia. This study performs the DGR approach on data for bipolar disorder patients, which contains 56 biomarkers and 8 infectious agent’s antibodies. Serum specimens were collected from 132 United States military service members (118 males and 14 females) with a diagnosis of bipolar disorder from 1992 to 2005 and their matched healthy controls.. Trefoil Factor3 (TFF3), Gliadin, prolactin (PRL), Apolipoprotein A-II (Apo A-II) and Immunoglobulin A (IGA) were found to be significant predictors of Bipolar Disorder (BD) in males. Macrophage-Derived Chemokine (MDC), Alpha-1-Antitrypsin (AAT), Gliadin, Beta-2-Microglobulin (B2M) and Monocyte Chemotactic Protein 2 (MCP-2) might be used to identify www.scholink.org/ojs/index.php/rhs Research in Health Science Vol. 2, No. 3, 2017 263 Published by SCHOLINK INC. bipolar disorder in females. A predictive biomarker panel for BD offers the potential to aid in the diagnosis, initiate treatment earlier and ideally alter the course of disease with reduced morbidity and functional impairment.


Introduction
Bipolar Disorder (BD) is a mental disorder characterized by periods of elevated mood and periods of depression (Anderson et al., 2012;American Psychiatric Association, 2013). During the period of mania the patient feels or acts abnormally happy, energetic, or irritable (Anderson et al., 2012). They often make poor decisions without considering the consequences. The need for sleep is usually reduced (American Psychiatric Association, 2013). The causes are not clearly understood, but both genetic and environmental factors play a role (Anderson et al., 2012). Changes in many biomarkers or genes, each with a small effect, may contribute to risk of BD. About 3% of people in the US have bipolar disorder at some point in their life (Schmitt et al., 2014). Rates appear to be similar in males and females (Diflorio & Jones, 2010). The World Economic Forum report estimates the global cost of mental illness at nearly $2.5T (two-thirds in indirect costs) in 2010, with a projected increase to over $6T by 2030 (Bloom et al., 2011).
A biomarker is defined as "a characteristic that is objectively measured and evaluated as an indicator of normal biologic or pathogenic processes, or pharmacologic responses to a therapeutic intervention" (Atkinson et al., 2001). Those characteristics can be used to categorize the disease risk in a population or as an adjunct to diagnosis. For example: (a) prognostic biomarkers classify cancer patients into subgroups with distinct expected risks, but they do not inform the choice of therapy; while (b) predictive biomarkers can identify cancer patients, for example, whose tumors are likely to have therapeutic sensitivity or resistance based upon marker status (Simon & Altman, 1994;Sargent et al., 2005). The identification of biomarkers, which often have a weak relationship individually, may help to develop new diagnostic tests for early identification and treatment. Successful assay verification and biological validation of such biomarkers and selection of high risk populations can be of benefit to both patients and society.
The precise etiology of BD remains uncertain and is most likely multifactorial and complex. BD is often misdiagnosed or diagnosed late in the course of the disease, leading to adverse social and medical consequences (Houenou et al., 2011). Multiple studies examining neuroimaging, peripheral markers, and genetics have provided important insights into the underlying pathophysiologic processes. While there is a large body of research examining various factors associated with BD, some of these results are inconsistent. Due to the variety in clinical presentations and course of BD, the etiology of BD is highly unlikely to be limited to a single risk factor and likely includes interactions of genetic, epigenetic, and environmental factors. Occasionally, families may exist in which a single biomarker www.scholink.org/ojs/index.php/rhs Research in Health Science Vol. 2, No. 3, 2017 264 Published by SCHOLINK INC.
plays the major role in determining susceptibility, but the majority of BD involves the interaction of multiple biomarkers or more complex genetic mechanisms (Craddock & Jones, 1999).
In other areas of medicine, validated biomarkers now inform clinical decision-making (Frey et al., 2013). The complexity of the task is compounded by the heterogeneity of BD, which is reflected in the broad variety of its clinical presentations, some of which could be a result of different etiopathogenic pathways (Washington University, 2014). Many single biomarkers might have a very small effect size in statistical models, but together, they may have considerably larger effect, hence, we need to develop an approach to identify, measure, and analyze a combination of multiple potential predictors. The detection of multiple biomarkers, with small individual statistical effects, requires large sample sizes, a large number of measured biomarkers, and appropriate statistical approaches to ensure that valuable information is not lost.
In our earlier work on biomarkers for schizophrenia (Li et al., 2015) it was noted that regression of high dimensional data is difficult. When the sample size is small, traditional regression methods, such as the Ordinary Least Sqaure (OLS) approach perform poorly (Tibshirani, 1996). When using a high number of biomarkers as some predictors are often highly correlated (multicollinearity), and the multiple regressions may lead to erratic changes in the effects of individual biomarkers and large standard errors of the coefficient estimates in response to small changes in the model. A high degree of multicollinearity may also lead to either software failure in matrix inversion or inaccurate results. The result is that the selection of biomarkers is difficult and the estimated effect of the predictor variables is expected to be biased. If there is a group of variables among which the correlations are very high, then most regression approaches, including OLS and Least Absolute Shrinkage and Selection Operator (LASSO) regression only select one variable from the group. The ideal biomarker selection method should be able to do two things: (1) eliminate the trivial biomarkers, and (2) include whole groups of biomarkers into the model once one is selected (Zou & Hastie, 2005).
Principal Component Analysis (PCA) and Ridge regression (Hastie et al., 2001) are commonly used to solve the collinearities and associated bias in traditional regression methods. Principal component analysis is also a common method used to reduce the number of predictive variables, but PCA does not use information of the dependent variable for the construction of these linear combinations. The first principal component is often not the linear combination of the input variables that is most significantly associated with the dependent variable of disease state (Johnson & Wichern, 1982;Bair et al., 2006).
Neither does PCA guarantee that only a few principal components can fit the model well. For the Hald's data (Chatterjee & Hadi, 2012) with four predictors, the 4th eigen vector with the smallest eigen value was the only significant predictor. The ridge regression has the same difficulty as OLS when the sample size is small.
We applied a Decomposition-Gradient-Regression (DGR) method with the goal to find a new method that works as well as classical regression, when the assumptions are valid, and which works better than the classic approach when multicollinerity exists (Li et al., 2015). DGR eliminates the trivial predictors and automatically includes whole groups consisting of highly correlated predictors into the model once one predictor among them is selected. In this study, we use the DGR to study bipolar biomarker data to identify individual and groups of biomarkers associated with the risk of BD diagnosis. Li et al. has used US military data and simulation data with a binary outcome Y and 100 predictors to examine the effects of the gradient and the orthogonal scores (Li et al., 2012;Li et al., 2015). The results from both kinds of data showed that no score other than the gradient score had a significant effect to distinguish the binary outcome. The gradient score consisted of nearly all the information from all biomarkers in the gradient.
Similar analyses were also performed for bipolar data with similar findings. Therefore only gradients scores are used in the reduction process for DGR in this study. The purpose of this study was to select biomarkers that are predictive of a diagnosis of BD.

Data
Demographic and clinical data for US military service members who received medical discharges with a diagnosis of BD from 1992 to 2006 were obtained from the US Army Physical Disability Agency, the Secretary of the Navy Council of Review Boards, and the US Air Force Personnel Center, Physical Disability Division (Niebuhr et al., 2011). Those aged 18 and older who were on active duty at the time of their bipolar diagnosis, and who had at least one serum sample of 0.5 ml or greater stored in the Department of Defense Serum Repository (DoDSR) obtained prior to diagnosis were selected as potential study cases. The time of BD onset was estimated as the earliest date of either the first hospitalization with psychiatric disorder International Classification of Disease 9th Revision (ICD-9-CM) codes (290-319), or the date the medical or physical evaluation board reviews were initiated. Control subjects were selected from the active duty US military service population who were 18 or older, had no inpatient or outpatient mental health diagnoses, and who had at least one serum sample of 0.5 ml or greater stored in the DoDSR. All control subjects were matched to their cases on sex, race, branch of military service, date of birth (+/-12 months), and date of military enlistment (+/-12 months Baltimore, MD, USA for laboratory testing.

Statistical Analysis
To avoid collinearity bias, we first separated the correlated biomarkers into different groups based on linear correlations among biomarkers, which we called subspaces. Second we found the gradient direction which is the normal vector of a hyperplane in each subspace that best separates the cases and controls within the subspace (Li et al., 2012). The gradient score is the linear combination of the standardized values of biomarkers used in each subspace. Scores were generated for the gradient and their perpendicular vectors in each subspace. The gradient score and the other significant vector scores from each subspace were used as factors in the statistical modeling. Third, we eliminated the biomarkers with weakest effect backwards by examining the coefficients of the gradient and the effect on the gradient score model in each subspace. Then the regression model was utilized on the gradient scores to select biomarkers.
Given that multiple serum samples were collected for each subject from different times prior to diagnosis, the Generalized Estimating Equation (GEE) was used to estimate the unknown parameters (Liang & Zeger, 1986). The odds ratios and 95% confidence intervals or p-values were reported using Bonferroni correction. The degrees of freedom of the Wald chi-square value of the gradient score was adjusted by the number of biomarkers used in the gradient vector. The coefficient of a biomarker in the gradient vector describes its contribution to distinguish bipolar cases from the controls. If the coefficient of a given biomarker is near zero, it implies that it has no effect on bipolar diagnosis identification, and can then be eliminated from the gradient without loss of information by a backwards elimination process.
Two approaches for the number of biomarkers to be selected were used. The first approach used surface figures constructed by sensitivity estimates generated with regression modeling and the numbers of biomarkers used in subspaces A and B by using simulation. For each simulation, two-thirds of subjects were randomly assigned to the training dataset and used to fit to the model, and the remaining one-third to the testing set were used to verify the model. The second approach used the Akaike Information Criterion (AIC) AIC=2p-2Log L, which is a measure of the quality of fit of a statistical model for a given set of data, where p is the number of parameters in the regression. For longitudinal GEE regression, the Quasi-Likelihood Information Criterion (QIC), which uses the quasi likelihood to replace the likelihood in AIC, is commonly used (Pan, 2001). When we use the gradient score in the model because only partial information of the individual biomarker is used, we modify the AIC as Modified Akaike Information Criterion MAIC=k-2Log L, where k is the number of biomarkers used in the gradient score rather than AIC=2-2LogL for the gradient only. When logistic regression was performed with gradient scores, the individual biomarker effect was estimated from the gradient score effect multiplying by the percent contribution of that biomarker to the gradient, which is the square of the coefficient in the unit gradient vector.

Results
The 64 biomarkers and antibodies to infectious agents used in this study are listed in (Table 1). Due to case-control match, only the case subject distribution and sample distribution are needed to be shown (Table 2). About 14 (%) of the 132 bipolar patients were females. The literature has shown inconsistency of biomarker effects Pruijm et al., 2013) based on sex and therefore, we focused the analyses on males. A similar decomposition gradient and reduction analyes for females is reported for simple comparison and discussion.

Biomarker Selection for Males
Among the 64 biomarkers and antibody agents, one pair (Myeloperoxidase (MPO) and Neutrophil Gelatinase-Associated Lipocalin (NGAL)) was highly correlated ( The effect of gradient C and the effects of individual biomarkers were approximated as zero. The gradient scores were highly significant in both subspaces A and B by using Bonferroni correction (adjusted p<0.05/k A and p<0.05/k B respectively, where k A and k B were the number of biomarkers used in the gradients in subspaces A and B). Using the backward elimination approach for the data on males, the biomarkers with a coefficient nearest to zero in the gradient vector were eliminated one by one. The average sensitivity for BD status among 100 simulations was used to make the surface graphs by the number of biomarker used in subspaces A and B (Figures 1 and 2 respectively).
For subspace A, the sensitivity increased with the number of biomarkers up to peak for the training set of 10 biomarkers and then decreased. A similar pattern was observed for the testing set. Considering sensitivities for both training and testing sets, we should select between 8-10 biomarkers from the subspace A. The effect of the number used in subspace B was limited and three biomarkers were selected. Further increases in the number of biomarkers did not yield any improvement in sensitivity for the training set, but caused a lowering of the sensitivity for testing set. The MAIC curve by the number of biomarkers in subspace A, is shown in (Figure 3), which was minimized at k=9. The number of biomarkers in subspace B (Figure 4) was minimized at k=3. The two approaches provided nearly the same results for nine biomarkers from subspace A and three biomarkers from subspace B, and these biomarkers were selected for further analyses.

The Effects of Selected Biomarkers among Males
To test the effect of selected biomarkers, the selected gradient models with (5, 3), (9, 3), and (9, 5) biomarkers in subspaces A and B, respectively, were used. The first number is the biomarkers used to generate the gradient score in the subspace A, and the second number is the biomarkers used to generate the gradient score in subspace B. Both three and five biomarkers in subspace B were used to check the robustness of the model. The gradient scores are the predictor in the logistic model. Wald Chi-square, a conservative approach, was used to test significance (Agresti, 2012). The adjusted Chi-squared test p-value for gradient score was based on the number of biomarkers used. The results (Table 3) were as follows: 1.
The effects of both gradient scores are significant.

2.
The size of the OR of individual gradient in either subspace increased minimally as the number of biomarkers used increased. For example, the OR of gradient A was 2.16 in Model (5, 3) and 2.60 in Model (9, 3) suggesting that a minor contribution associated with adding four additional biomarkers.

3.
The effect of the gradient score from subspace A changed little when the number of biomarkers used in the subspace B changed. For example the OR of gradient A was 2.60 in Model (9, 3) with three biomarkers used in B and 2.64 in Model (9, 5) with five biomarkers used in B. Similar performance was also observed for the gradient of subspace B. The OR of Gradient B was 1.70 in Model (5, 3) and 1.75 in Model (9, 3), inferring that the gradient scores from different subspaces were almost independent. No multicollienarity was found when performing the regression with gradient scores.

The Sensitivity of Biomarker Selection among Males
In order to check the sensitivity of the biomarker selection, 100 males were randomly selected from the 118 males with bipolar disorder, and 11 females were randomly selected from 14 females with bipolar disorder. The selection was repeated 100 times to derive 100 data sets with their matched controls. Then the backward gradient elimination process was performed for each sample. The biomarkers with the highest selection frequency and the average coefficents in the gradient vector are listed in Table 5  3 (TFF3), gliadin and vaccinia were selected over 98 times. Comparing the results in Tables 4 and 5 for males, the selected biomarkers are the same and the coeffients are consistent. The top three significant biomarkers were selected over 99% in subspace A, and the two significant biomarkers in subspace B were selected over 91%. The standard deviations (std) of the coefficients in the gradient were less than one fifth to one tenth of the coefficients for males.

Biomarker Selection among Females
The number of female cases was limited to 14 which was unreliable for performing the regression analysis, hence only Decomposition-Gradient-Reduction was used. It is unreliable to make a solid conclusion about the biomarker effect on BD based on the small sample size. Among the 100 simulations, Table 5 shows that the top three biomarkers in subspace A and for the top two biomarkers in Subspace B were selected over 60%. The standard deviation was less than one third of the coefficient.

Macrophage-Derived Chemokine (MDC), Alpha-1-Antitrypsin (AAT) and gliadin in subspace A and
Beta-2-Microglobulin (B2M) and monocyte chemotactic protein 2 (MCP-2) were selected. Using DGR, as we did for males, all of these biomarkers were significantly associated with identifying BD cases.

Discussion
In the military sample subject dataset and the 64 biomarkers and infectious agents antibodies examined, TFF3, Gliadin, PRL, ApoA-II and IgA showed a significant relationship to BD diagnostic status in males and MDC, AAT, Gliadin, B2M and MCP-2 are suggested to identify BD in females. There are similarities between this study and other research. MDC was found previously to be associated with a diagnosis of schizophrenia (Pruijm et al., 2013). AAT was found to be increased in patients with anxiety disorders and bipolar I and bipolar II disorders (Sachar et al., 1973;Levin et al., 2010). MIF has been found to be increased with anxiety and depression-like behaviors and with impaired hippocampus-dependent memory (Conboy et al., 2011;Schmechel et al., 2012). An association between BD and B2M has been observed (Musil et al., 2011). Patients with anxiety, depression and schizophrenia have also been found to have high levels of B2M (Rybokowski et al., 2013).
Many studies have found a relationship between cancer and TFF3, but there are no prior studies assessing the relationship between BD and TFF3 directly. Cerebral TFF3 has been reported to be involved in several processes such as fear, depression, learning and object recognition, and opiate addiction (Bernstein et al., 2015). This study demonstrated that an increase of one standard deviation of TFF3 results in a 14% increased risk of BD. Several studies have demonstrated that PRL was increased among females with BD (Sachar et al., 1973;Sher et al., 2003;Schmidt et al., 2013 (Paing et al., 2011).
We found that prolactin is inversely related (0.83) with BD in males which is the opposite effect we found in a schizophrenia study (Ramsey et al., 2013). Further study of the gender effect of PRL on BD.
Few studies regarding the relationship between Apo II and BD were found in the literature. There is considerable similarity across the apo family and previous studies on apo AI have noted that, in BD patients, the Apo I level is increased which is consistent with prior findings by Herberth (Herberth et al., 2011;Sussulini et al., 2011;Pruijm, et al., 2013). Furthermore, we found that Apo A II was a significant risk factor (OR=1.2 per one standard deviation increase) for BD.
We report a novel three step biomarker selection process to identify bipolar cases. The first step involved decomposition of the sample space by examining the dependency of 64 biomarkers, separated into three subspaces to avoid collinearity. The second step involved biomarker selection using the gradient in each subspace parameter in the regression to select important biomarkers without losing information. The third step identified the effects of biomarkers in combination and individually.
Compared with other approaches, such as Classification and Regression Trees and LASSO, the advantage of DGR is that the magnitude and direction of the individual biomarker effects can be estimated (Li et al., 2015) with the association expressed as an odds ratio which is more easily understood as an estimate of risk.
We used a novel approach to identify a potential panel of biomarkers that are strongly associated (OR over 2) with the diagnosis of BD. The reliable identification of biomarkers from high dimensional data is a key discipline in modern pharmaceutical and biotechnical research. Once these biomarkers have been found and validated they can be used to identify patients at either high or low risk of BD diagnosis potentially earlier than relying on clinical diagnostic criteria. Selection of predictive biomarkers by DGR has a number of epidemiological and statistical analytic advantages: 1) the risk of over-fitting is reduced, which improves the predictive accuracy; 2) the number of parameters is reduced, which decreases the sample collection costs; 3) models based on fewer factors are often easier to interpret; and 4) both joint and individual biomarker effect can be estimated.
This hypothesis-generating study must be validated in other populations, as the methods of specimen collection, storage and testing of serum specimens may vary, the underlying populations of cases and controls may differ, and some significant biomarkers in our study might have been selected by chance.
In addition, the biological mechanisms of the biomarkers in the pathophysiology of BD should be studied to avoid selection bias in the final predictive model. A predictive biomarker panel for BD offers the potential for earlier diagnosis and initiation of treatment for BD with the long term goal of reduced morbidity and functional impairment in BD patients.