IN THIS SECTION

Data set download
Data set description
Data description
Disease Comorbidity
Measuring comorbidity

Search
Data
Figures
Help
Contact
Share/Save/Bookmark

When using this data, please kindly reference: Hidalgo CA, Blumm N, Barabasi A-L, Christakis NA. PLoS Computational Biology, 5(4):e1000353 doi:10.1371/journal.pcbi.1000353
This data is free for academic use only. If you are interested in using this data for non-academic purposes please contact Cesar A. Hidalgo at cesar_hidalgo at ksg.harvard.edu

DOWNLOAD DATA
( to download: right click + save file as)

 

VIDEO

 

 

 

Disease Associations

 

Brief Data Summary

 

DATASET DESCRIPTION

All files are tab delimited text files

File Structure
Column
Description
1
ICD-9 code disease 1
2
ICD-9 code disease 2
3
Prevalence disease 1
4
Prevalence disease 2
5
Co-ocurrence between diseases 1 and 2
6
Relative Risk
7
Relative Risk 99% Conf. Interval (left)
8
Relative Risk 99% Conf. Interval (right)
9
Phi-correlation
10
t-test value

 

 

DATA DESCRIPTION

Source Data and Study Population:

Hospital claims offer reliable, systematic, and complete data for disease detection [1,2,3]. Each record of our original dataset consists of the date of visit, a primary diagnosis and up to 9 secondary diagnoses, all specified by ICD9 codes of up to 5 digits. The first three digits specify the main disease category while the last two provide additional information about the disease. In total, the ICD-9-CM classification consists of 657 different categories at the 3 digit level and 16,459 categories at 5 digits. For a detailed list of currently used ICD9 codes see www.icd9data.com. We compiled raw Medicare claims [4,5] based on MedPAR records for 1990-1993 that contain information on 96% of elderly Americans whether they use health care or not [6].

For the 32 million elderly Americans aged 65 or older enrolled in Medicare and alive for the entire study period, there were a total of 32,341,347 inpatient claims, pertaining to 13,039,018 individuals (the remaining individuals were not hospitalized at any point during this period). Demographically, our data set consists of patients over 65 years old (see Fig 1 for the age distribution) and is composed mainly of white patients, with a higher percentage of females (Fig 2). Yet, the data set is large enough to estimate race and gender specific comorbidity patterns.

Data Limitations

The medical claims were made available to us is in the ICD-9-CM format, representing a controlled nomenclature constructed mainly for insurance claim purposes. Therefore in some cases more than one code corresponds to a particular disease, whereas in other cases codes are not specific enough for research purposes. For example, at the 5-digit level there are 33 diagnoses associated with hypertension, which reduce to five at the 3-digit level. The vast majority of diseases however, can be univocally assigned to an ICD9 code.

While hospital claims have been proposed as a reliable method for disease detection [7,8,9], our data does not capture a complete cross section of the population. The dataset consists of medical claims associated with hospitalizations of elderly citizens in the United States, thus it contains limited information about diseases that are not common among elders from an industrialized country, such as many infectious diseases or pregnancy related conditions. Nor does it contain information on patients who were not hospitalized. It does contain however, a wealth of information about different types of heart diseases and cancers, which are highly prevalent among elderly patients and are of major interest to the medical community.

We distinguish four main groups in the data set given by (Males = M, Females=F, White=W, Black = B)

Number of Patients per Demographic Group

M

F

M+F

W

4910362 (37.66%)

6835054 (52.42%)

11745416 (90.08%)

B

386663 (2.97%)

596432 (4.57%)

983095 (7.54%)

B+W

5297025 (40.62%)

743148(56.99%)

12728511 (97.62%)

B+W+Other

5440490 (41.72%)

7598529 (58.28%)

Other (Hispanic+Asian+Native American+Other)

310507 (2.38%)

Total

13039018 (100%)

REFERENCES:

1 Zhang J, Iwashyna TJ & Christakis NA. (1999) The Performance of Different Lookback Periods and Sources of Information for Charlson Comorbidity Adjustment in Medicare Claims. Medical Care 37: 1128-1139.

2 Cooper GS et al. (1999) The sensitivity of Medicare claims data for case ascertainment of six common cancers. Medical Care 37: 436-44.

3 Benesch C et al. (1997) Inaccuracy of the International Classification of Disease (ICD-9-CM) in identifiying the diagnosis of ischemic cerebrovascular disease. Neurology 49: 660-664.

4 Lauderdale D, Furner SE, Miles TP & Goldberg J. (1993) Epidemiological uses of Medicare data. American Journal of Epidemiology 15: 319-327.

5 Mitchell JB et al. (1994) Using Medicare claims for outcomes research. Medical Care 32: S38-JS51.

6 Hatten J. (1980) Medicare's Common Denominator: The Covered Population. Health Care Financing Review 2, 53-64.

7 Zhang J, Iwashyna TJ & Christakis NA. (1999) The Performance of Different Lookback Periods and Sources of Information for Charlson Comorbidity Adjustment in Medicare Claims. Medical Care 37: 1128-1139.

8 Cooper GS et al. (1999) The sensitivity of Medicare claims data for case ascertainment of six common cancers. Medical Care 37: 436-44.

9 Benesch C et al. (1997) Inaccuracy of the International Classification of Disease (ICD-9-CM) in identifiying the diagnosis of ischemic cerebrovascular disease. Neurology 49: 660-664.

 

 

VIDEO

 

 

FIGURES

Fig 1. Age distribution for the study population.

 

Fig 2. Demographic breakdown of the study population.

 

DISEASE COMORBIDITY:

To measure relatedness starting from disease co-occurrence, we need to quantify the strength of comorbidities by introducing a notion of “distance” between two diseases. A difficulty of this approach is that different statistical distance measures have biases that over- or under-estimate the relationships between rare or prevalent diseases. These biases are important given that the number of times a particular disease is diagnosed (prevalence) follows a heavy tailed distribution (Fig 3), meaning that while most diseases are rarely diagnosed, a few diseases have been diagnosed in a large fraction of the population. Hence, quantifying comorbidity often requires us to compare diseases affecting a few dozen patients with diseases affecting millions.

We will use two comorbidity measures to quantify the distance between two diseases: the -correlation (Pearson’s correlation for binary variables Fig 4) and relative risk (RR Fig 4) (Box 1). Furthermore, given a fixed prevalence for diseases i and j, as the overlap between them decreases so do both ij and RRij. The two comorbidity measures are not completely independent of each other, and both measures have their intrinsic biases. For example, RR overestimates relationships among rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases. Given the complementary biases of the two measures, we constructed Phenotypic Disease Networks separately for each measure and discuss their respective relevance to specific disease groups.

One crucial question is: how does the predictive power of comorbidity based relationships compare with that of heredity and known genetic markers? Of the two measures discussed in Box 1, Relative Risk (RR) enjoys the most widespread use in the medical literature [ 1-10], making it the most suitable for such comparison. We find that the relative risk of being diagnosed with one disease given another disease affecting a patient in our data varies in the range RR ~0.25-16 (Fig 4). Sibling studies have found that the relative risk of having a disease given that a sibling has the same disease typically ranges from RR ~3 for type 2 diabetes [1 ] to RR ~2-7 for early myocardial infarction [2 ], ~7-10 for bipolar disorder [3 ,4 ] and rheumatoid arthritis [5] and ~17-35 for Crohn’s Disease [6]. Most of these values fall in the range of relative risks associated with our observed comorbidities. Hence, statistically speaking, the magnitude of the disease risk predicted by comorbidity relationships is comparable to that of family history. Furthermore, we can compare comorbidity statistics with typical relative risk values found in genetic susceptibility studies. For example, the relative risk of type 2 diabetes for carriers of the at-risk allele TCF7L2 ranges between RR ~1.45 and 2.41 [7 ], whereas the rs2476601 SNP in the PTPN22 gene confers a genetic relative risk for rheumatoid arthritis of RR ~1.8 [8 ,9]. In contrast, the RR for a type 2 diabetes of a patient diagnosed with Ischemic Heart Disease is RR ~1.61, whereas a rheumatoid arthritis patient is at RR ~3.64 for the disease if he or she is diagnosed with osteoporosis [10]. The statistical strength of the observed comorbidities is therefore comparable to that found in siblings and genetic susceptibility studies, a favorable comparison that provides further motivation to use comorbidity data to explore disease risk.

Box 1: Quantifying Disease Associations

We use two different ways to measure the strength of comorbidity associations. We denote Cij as the number of patients that have been diagnosed with diseases i and j, N is the total number of patients in the population and Pi is the number of patients diagnosed with disease i.

-correlation


Pearson’s correlation for binary variables [11]


Pros : Good at discerning associations between diseases with similar prevalence.
Cons: Values are always low for diseases with extremely different prevalence.
Values:
>0 comorbidity is larger than expected by chance
<0 comorbidity is smaller than expected by chance
Range:
f~ [-1,1] for diseases with similar prevalence.
~ (P </P >) 1/2 [-1,1] for diseases with different prevalence where P < = min(P i,P j) and P >=max(P i,P j)

 

Relative Risk:


Fraction between the number of patients diagnosed with both diseases and random expectation based on disease prevalence.


Pros: Intuitive and easy to calculate.
Cons: Underestimates associations between highly prevalent diseases and overestimates associations involving rare diseases.
Values:
RRij>1 comorbidity is larger than expected by chance
RRij<1 comorbidity is smaller than expected by chance
Range:
RRij ~ [N/PiPj, , N/P>] where P>=max(Pi,Pj)

REFERENCES

1 Kobberling J. & Tattersall R. The Genetics of Diabetes Mellitus (Academic Press, London, 1982)

2 Lusis AJ, Mar R & Pajukanta P. Genetics of atherosclerosis. Annu. Rev. Genomics Hum. Genet. 5, 189-218 (2004)

3 Craddock N, O’ Donovan MC & Owen MJ. The genetics of schizophrenia and bipolar disorder: dissecting psychosis J Med Genet 42, 193-204 (2005)

4 McGuffin et al. The heritability of bipolar affective disorder and the genetic relationships to unipolar depression. Arch Gen Psychiatry 60, 497-502 (2003)

5 Wordsworth P and Bell J. Polygenic susceptibility in rheumatoid arthritis. Ann. Rheum. Dis. 50,343-346 (1991)

6 The Wellcome Trust Case Control Consortium, Genome-wide association study of 14000 cases of seven common diseases and 3000 shared controls, Nature 447 661-677 (2007)

7 Grant SFA et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes, Nature Genetics, 38 320-323.

8 Begovich AB et al. A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am. J. Hum. Genet. 75 330-337 (2004).

9 Hinks, A. Eyre, S. Barton A., Thomson W & Worthington J. Investigation of genetic variation across PTPN22 in UK rheumatoid arthritis (RA) patients. Ann. Rheum. Dis. 66, 683-686 (2006)

10 Both RR were calculated using ICD9 codes at the 5-digit level for the entire study population.

11 Cohen J, Cohen P, West SG, Aiken LS, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. (2003) Third Edition. Lawrence Erlbaum Associates, Publishers, Mahwah, New Jersey.

VIDEO

Quantifying Disease Comorbidity

RR Biases

-correlation Biases

FIGURES

Fig 3. Disease prevalence distribution

Fig 4 Comorbidity distribution calculated for all diseases using ICD9 codes at the 5 digit level.

QUANTIFYING COMORBIDITY STRENGTH:

-Correlation

We can quantify the strength of comorbidities by calculating the correlation coefficient associated with a pair of diseases as:

Where Cij is the number of patients affected by both diseases, N is the total number of patients in the studied population and Pi is the prevalence of the ith disease. The f correlation is the Pearson’s correlation for dichotomous variables, i.e. variables which only take 0 or 1 values. [1 ]

We can determine the significance of ≠0 by performing a t-test. This consists of calculating t according to the formula:

Where n is the number of observations used to calculate . In all of our tables we use n=max(Pi,Pj) << N, which represents the most stringent way in which t can be calculated given our data, as using n=N will produce a larger number of significant links most of which will not necessarily be strong predictors. To determine the level of significance of t it is necessary to look for it on a t-table which are available online or in most statistics books [1]. As a rule of thumb it is important to remember that for n>1000 any t≥1.96 is significant at the 5% level, whereas for the same n any t≥2.58 is significant at the 1% level.

Relative Risk

An alternative way of quantifying the correlation between two variables is to calculate their relative risk. The relative risk is the ratio between the observed co-occurrence and that of a null model. If diseases occurred completely independent from each other the number of patients affected by both diseases would be given by:

Hence the relative risk of a pair of diseases is given by:

which can also be written explicitly as probabilities as:

Calculating the significance of the relative risk can be done by using the Katz et al. method to estimate confidence intervals [2]. According to their calculations, the 99% confidence interval for the RR between diseases i and j is given by:

where ij is given by:

REFERENCES

1 Cohen J, Cohen P, West SG & Aiken LS. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. LEA, Mahwah, NJ (2003)

2 Katz, D., Baptista J., Azen, S.P. and Pike M.C. (1978) Obtaining confidence interval for the risk ratio in cohort studies. Biometrics34, 469-474