Sanjay M. Udoshi MD
Healthcare generates more data than any other industry on earth. A single patient encounter can produce thousands of data points — vital signs, laboratory results, medication records, imaging studies, clinical notes, billing codes, and device telemetry. Multiply that by the billions of encounters that occur across the United States each year, and the scale becomes almost incomprehensible.
Yet for all this data, healthcare outcomes in the United States remain stubbornly uneven. Preventable medical errors remain a leading cause of death. Recommended standards of care are delivered barely half the time. Readmission rates persist at levels that suggest fundamental gaps in care coordination. The challenge is not a lack of data. It is a lack of the infrastructure, methodology, and organizational commitment required to transform that data into actionable knowledge.
Data science — the discipline of extracting insight from structured and unstructured data using statistical, computational, and analytical methods — offers healthcare organizations a powerful set of tools for closing this gap. But the path from data to improved outcomes is neither straightforward nor automatic. It requires deliberate investment in foundations that are often overlooked in the rush to adopt the latest technology.
The single most important investment a health system can make in data science is not a machine learning platform or a visualization tool. It is data standardization. The reason is simple: analytical outputs are only as reliable as their inputs. If the same diagnosis is coded differently across departments, if medication names are inconsistent between pharmacy systems, if laboratory results lack standard reference ranges, then any analysis built on that data will be unreliable.
This is why the OMOP (Observational Medical Outcomes Partnership) Common Data Model has become the gold standard for healthcare analytics. OMOP provides a standardized schema that maps disparate clinical data into a consistent structure using established vocabularies — SNOMED CT for conditions, RxNorm for medications, LOINC for laboratory tests. Once data is transformed into OMOP format, it becomes possible to run identical analyses across institutions, compare populations, and participate in global research networks like OHDSI.
The transformation process — known as ETL (Extract, Transform, Load) — is technically demanding and requires deep understanding of both the source data and the target model. It is not glamorous work. But it is the foundation upon which every subsequent analytical capability depends. Organizations that skip this step and proceed directly to "AI" are building on sand.
Once data is standardized, predictive modeling becomes a powerful tool for anticipating clinical events before they occur. Predictive models use historical patterns to estimate the probability of future outcomes — hospital readmission, sepsis onset, disease progression, medication non-adherence, or clinical deterioration.
The clinical value of prediction lies in the ability to intervene. A model that identifies patients at high risk of 30-day readmission, for example, can trigger proactive care coordination — follow-up calls, medication reconciliation, home health referrals — that reduces the likelihood of the predicted event. A sepsis prediction model that generates alerts hours before clinical deterioration gives care teams time to investigate and intervene.
But predictive modeling in healthcare carries important caveats. Models trained on historical data inherit the biases present in that data — if certain populations have historically received less aggressive treatment, the model may learn to predict worse outcomes for those populations without recognizing that the disparity is a product of care delivery, not biology. Careful attention to fairness metrics, subgroup performance analysis, and ongoing monitoring for model drift are essential.
Beyond individual-level prediction, data science enables population-level analysis that can reveal patterns invisible at the point of care. Cohort analytics — the systematic comparison of defined patient groups — allows health systems to answer questions that are critical to quality improvement and resource allocation.
Which patient populations account for the highest utilization of emergency services? What treatment pathways are associated with the best outcomes for specific conditions? Where are the gaps between recommended care and actual care delivery? How do outcomes vary across demographic groups, geographic regions, or insurance categories?
These questions cannot be answered by reviewing individual charts. They require the aggregation, standardization, and systematic analysis of data across thousands or millions of encounters. The OHDSI network, which now spans more than 800 million patient records across dozens of countries, demonstrates the power of this approach. Network-based studies have generated real-world evidence on drug safety, treatment effectiveness, and disease epidemiology that would be impossible to produce through traditional clinical trials alone.
An estimated 80% of clinical data is unstructured — contained in physician notes, radiology reports, pathology results, and nursing assessments. This unstructured data contains rich clinical detail that is largely inaccessible to traditional analytics. A physician's narrative description of a patient's symptoms, for example, often contains information that is far more nuanced than anything captured in a structured diagnosis code.
Natural language processing (NLP) offers the ability to extract structured information from these unstructured sources. NLP pipelines can identify mentions of symptoms, medications, and diagnoses in clinical text; classify the sentiment and certainty of clinical assertions; detect negation and temporality; and map extracted concepts to standard vocabularies.
When combined with structured data in a standardized model like OMOP, NLP-derived information can dramatically enrich the analytical foundation available to clinicians and researchers. Phenotyping algorithms that combine structured codes with NLP-extracted features consistently outperform those based on structured data alone.
Technology alone is insufficient. Organizations that successfully leverage data science for improved outcomes share several characteristics: executive commitment to data as a strategic asset; investment in interdisciplinary teams that combine clinical, technical, and analytical expertise; governance structures that ensure data quality, privacy, and ethical use; and a culture that values evidence over intuition.
The interdisciplinary dimension is particularly important. Data science in healthcare cannot be the exclusive domain of technologists. Clinicians must be involved in defining the questions, interpreting the results, and validating the outputs. Informaticists must bridge the gap between clinical knowledge and technical implementation. Ethicists must ensure that analytical tools are deployed equitably. And operational leaders must create the organizational conditions that allow data-driven insights to translate into changed practice.
The journey from raw healthcare data to improved patient outcomes follows a well-established hierarchy: data becomes information when it is organized and contextualized; information becomes knowledge when it is analyzed and interpreted; and knowledge becomes wisdom when it is applied with judgment, empathy, and ethical awareness.
Data science provides the methods for the first three stages of this journey. But the final step — the application of knowledge with wisdom — remains irreducibly human. The clinician who uses a predictive model's output to have a difficult conversation with a patient about goals of care; the quality officer who interprets a cohort analysis to redesign a care pathway; the researcher who identifies a safety signal in real-world data and acts to protect patients — these are acts of wisdom that no algorithm can perform.
At Acumenus, our name reflects this aspiration. Acumen — the ability to make good judgments and quick decisions — is what we seek to enable through our platforms and services. We believe that the right data infrastructure, combined with the right analytical tools and the right clinical expertise, can make the gap between what we know and what we do in healthcare dramatically smaller. That is the promise of data science in healthcare — not to replace clinical wisdom, but to amplify it.
Dr. Udoshi is Medical Director of Informatics at Acumenus Data Sciences and a recognized authority on OMOP CDM implementation and healthcare data architecture.
More perspectives from the Acumenus team.