Data Warehousing, Mining and Migration

Introduction

Pharmaceutical companies, large and small, struggle with how best to manage data in general, but image data has its own set of challenges that increase the complexity, diversity and size of the data sets. In addition, the ability to warehouse this data in such a way that it can be migrated or accessed as new uses for the data are found or if data needs to be routed for other various reasons is a challenge for many organizations. Licensing, merging or divesting a drug or pipeline of drugs is an example of data requiring migration to a new company or location. In over 90% of the cases where imaging is used in a clinical study, it plays a role as an efficacy endpoint. This means, in the case of licensing, acquisition or when generally performing due diligence on the results of the efficacy of a drug, the image analysis performed, and therefore the images themselves, is one of the most important endpoints to determine the value of the drug by disclosing the effect it makes on patient disease management resulting in the financial importance of the drug. Without proper warehousing and storage of these large and complex datasets, it can be nearly impossible to rebuild the datasets, provide access to the measurements made on the images (i.e. the longest diameter of a target tumor in the oncology criteria RECIST1.1) and demonstrate the reproducibility of the evidence needed to prove the results that were likely provided to a regulatory agency. Without reproducibility in the results, efficacy is based on faith instead of hard evidence.

Most imaging studies are conducted at third-party locations such as independent or university corelabs. They have processes and systems built to manage the data for the duration of a study and through the conduct of a trial. When the trial has completed enrollment and all the analysis from the blinded reads are completed that data is returned to the pharmaceutical company. In most cases, they receive a box full of digital media, possibly a hard drive of the very important source images used for the blinded read and screen captures of the measurements that where produced by the readers, along with an output file of the case report form (CRF) information collected from the reader. In some cases, the clinical team receiving the data has no idea what to do with the data and sends it off in a box for long-term storage at places like Iron Mountain where is sits on fragile media and drives that have various shelf lives. If the corelab retains the information, it is also likely sent to long-term storage until recalled, which can take as long as three to six months to regain access.

The clinical team has no understanding of the usefulness and value of that data beyond the clinical study they just completed and is usually oblivious to the importance of the image data in other areas of research within the organization. The financial impact is that there has been an enormous amount of money paid to acquire that data in a clinical research study and there are groups like the researchers in the translational medicine field that have great interest in accessing that data. For example, an organization may want to find better and more accurate biomarkers for drugs in the pipeline or to quickly provide data back to decision makers on analysis affecting the strategic decisions concerning a drug or set of drug candidates. Without the access and knowledge of the existence of this data, those groups need to resort to running additional studies to acquire the data they need. The findings could can be less diverse (e.g. at a single center) than say a similar dataset acquired in a large Phase III global study, which could yield much more accurate information on new biomarkers, and could lead to faster and more accurate assessment for a drug and faster go/no-go decisions. In addition, the budget of the translational medicine team is likely limited only allowing for smaller studies to be run, leading to greater assumptions in the accuracy of a new biomarkers without additional data. The cost of the additional data and/or the financial cost of retrieving and re-organizing the data from a clinical study can have significant financial and time costs. This could be mitigated by proper management and organization of image data in a central repository with tools that are designed to mine image data and organize the measurements and results from the image data.

Case Assessment:

To explain it more clearly, let’s look at real-life examples of the financial cost of acquiring image data over a common oncology pipeline and the cost to maintain a warehouse archive. We will use this therapeutic segment as it represents the largest pipeline within drug development representing about one-third of all drug trials being conducted at any one time.

First, we need to understand the average variations in the cost of the imaging component of clinical trials by phase of research.

First Trial: Translational & Phase I based Oncology Clinical Trial

Studies that are run in Phase I that include imaging are similar in time and cost as to those that are run by translational medicine groups, and therefore, we will use this average sampling as a cost factor for both study types.

  • Phase I
  • Patients: 50
  • Sites: 2
  • Duration: 1 Year
  • 3 Timepoints/Patient
  • 1 Reader
  • 20% Data Clarifications/Queries Required to Clean Data

These Phase I studies can also exceed this cost estimate significantly in many cases. For instance, in several cases we have used Phase I imaging to determine the best possible imaging criteria to use for later phase studies. In a recent oncology vaccine study, we assessed RECIST, mRECIST and CHOI for every patient and when the tumor criteria overlapped. We used the same tumor measurements to help determine the best criteria to use for the next phase of clinical development. In these cases, the cost estimates can be similar to the cost of running a Phase II study as seen below.

Second Trial: Oncology Trial Using RECIST 1.1 as the Criteria for imaging assessment

  • Phase II
  • Patients: 250
  • Sites: 10
  • Duration: 1 Year
  • 3 Timepoints/Patient
  • 2 Readers Plus Adjudicator
  • 20% Adjudication Rate
  • 20% Data Clarifications/Queries Required to Clean Data

Third Trial: Oncology Trial Using RECIST 1.1 as the Criteria for Imaging Assessment

  • Phase III
  • Patients: 600
  • Sites: 30
  • Duration: 2 Years
  • 5 Timepoints/Patient
  • 2 Readers Plus Adjudicator
  • 20% Adjudication Rate
  • 20% Data Clarifications/Queries Required to Clean Data

Average Cost of Imaging Study at a Site:

CT: $1500
MRI: $2,600
Assume: 95% CT and 5% MRI

For Phase I and Phase II, insurance coverage is 75% because insurance deductibles will usually require out-of-pocket costs of at least the first imaging study, which would likely fall back on the trial sponsor to pay the costs.

For Phase III, 50% of the cost of imaging may be paid by insurance, but additional studies may be imaged that could fall outside insurance coverage.

Below we have taken the publicly-disclosed pipeline of Abbive from its website as a case example to see how the costs stack up.

Abbvie Pipeline from Website:

Phase 1 Phase 2 Phase 3

ABBV-838

Multiple

Myeloma

ABT-199

AML

ABT-199

CLL

(Relapsed/Refactory)

ABT-399

Solid Tumors

ABT-199

iNHL/DLBCL

ABT-199

CLL

(Front-line; Unfit)

ABT-165

Solid Tumors

ABT-199

Multiple Myeloma

Imbruvica

Pancreatic Cancer

RTA-ABT 408

Solid Tumors

Duvelisib

iNHL

(R/R)

Imbruvica

DLBCL

(TN)

ABBV-075

Solid Tumors &

Hem Onc

ABT-414

GBM

Imbruvica

FL

(R/R)

ABBV-221

Solid Tumors

Imbruvica

Multiple Myeloma

Imbruvica

MCL

(TN)

BTK Inhibitor

Autoimmune

Imbruvica

AML

Duvelisib

CLL

(R/R)

Imbruvica

Solid Tumors

Imbruvica

ALL

Elotuzumab

Multiple Myeloma

(TN)

ABBV-084

SLE

Imbruvica

MZL

(R/R)

Veliparib

NSCLC

(Squamous)

ABBV-672

Alzheimer’s

Imbruvica

MZL

(R/R)

Veliparib

NSCLC

(Non-squamous)

ABT-957

Alzheimer’s

Imbruvica

Graft V Host

Veliparib

Breast Cancer

(Neoadjuvant)

ABBV-8E12

PSP & AD

ABT-122

RA

Veliparib

Breast Cancer

(BRCA)

ABBV-974

Cystic Fibrosis

ABT-122

PsA

Veliparib

Ovarian Cancer

 

ABT-494

Crohn’s Disease

ABT-494

RA

 

ABT-494

Crohn’s Disease

ABT-494

RA

 

ABT-981

Osteoarthritis

ABT-493 / ABT-530

HCV

 

ALX-0061

RA

Elagolix

Endometriosis

 

Elagolix

Uterine Fibroids

(Phase III start

1Q16)

Atrasentan

Diabetic Nephropathy

Based on the above pipeline the estimated number of imaging studies by phase would be as follows:

Total costs to implement Imaging in each of these studies based on the estimates given above per phase:

This total implies that for a pipeline this rich in oncology drugs there would be approximately $46M in costs associated to using imaging. Of that, the total costs of outsourcing the imaging component of these studies to corelabs is estimated as follows:

The above table shows the estimated total of approximately $17M in outsourcing dollars going to external corelabs.

That is approximately $46M in money spent on imaging data alone and $17M of those dollars going to outsourcing the management and handling of that data without regard to how it can be warehoused or mined post study. That is a tremendous loss of valuable data if it is not organized and made usable for research with the potential to be mined to discover, qualify and validate biomarkers for other studies in the pipeline. In addition, mining of this data could also discover additional indications for use, specific populations of patients and tumor types most responsive to a drug. This could also help to better study and understand the mechanism of action that could take a promising drug in early stages that failed in late stage trials and discover additional uses and possibly revive those drugs-- especially after such a significant amount of investment has been made in bringing the drug so far along only to have it fail in Phase II or Phase III.

The cost to warehouse such a vast repository of data is minimal compared to the cost to acquire the image data and the amount of time and effort needed to manually mine data without using a tool to organize and manage the data. The amount of time needed to manually re-organize data could be significant enough that it wouldn’t be feasible to provide any insight on important decisions about a drug or drug candidate possibly eliminating the most important data analysis available for strategic decisions, which could affect budgeting and future indications.

If we use the estimates above to determine the effort and time needed to sort, organize and perform the analysis of such a dataset we can estimate the values below:

Warehousing and Mining Financial Costs:

Assumptions of Manual Management of Imaging Data:

  • Imaging Studies per Timepoint: Chest, Abdomen, Pelvis. For the sake of estimation we will leave out Brain
  • Average Size of Imaging Study: With an estimate of 200 slices/images represents approximately 100MB per study. Approximately 300MB/Timepoint
  • Assuming Only the Read Data is Organized
  • Data Checked/Cleaned and Organized into Folders/Directories

We have estimated 15 to 30 minutes to review the image data to verify its content; that the dataset is accurate, complete and clean; as well as that the data corresponds to the assessment files and is stored with correct data identifiers. This is routinely needed even though data has been evaluated in a QC process by corelabs. This process was also the subject of an award-winning poster presented at Society of Clinical Data Management (SCDM), in 2015.

Based on the estimates above, it would cost between: $430,400 - $860,800 or approximately $645,600 to manually QC and organize the above pipeline of image data, not including the hardware costs.

As a yearly cost assuming the estimated 25 clinical studies from the Abbvie website pipeline: $197,600/Yr. – $496,000/Yr. or approximately $346,800/Year is the cost of manually managing the image data, which estimates to approximately four full-time employees needed to manage this data.

This estimate would also be incurred along with the time estimates for any due diligence that would be performed for business-related purposes, such as licensing or divestiture of a drug or pipeline. Additionally, it may be necessary to have image data for review by a regulatory body either for audits or in the approval process at the sponsor’s disposal at all times. Also, as there is no information on the number of Translational medicine studies performed, it’s not possible to estimate a value; however, typically it’s in the range or 10 -15 additional studies at costs similar to those of Phase I studies.

If the system is setup and is the primary repository for an ongoing clinical study these above batch import costs can be eliminate, but a QA/QC process would still likely be needed. This can be done via contracted technologists or ala carte services from the system vendor, if needed. If an external corelab is still being used, either batch import of the data post study or a DICOM push can be set up within the system to provide the data to the corelab though the system while giving the study team instant access to the image data as well as immediately organizing the data during the study.

Annual System Licensing costs: $100,000 – $250,000/Year or approximately $175,000/Yr.

This works out to about: $4,000 – $10,000 or $7,000 on average per study per year in licensing costs.

Total estimated cost of online warehouse and data mining tools: $142,240 – $292,240 or approximately $217,240/Yr. with the inclusion of the time cost estimate above.

These costs are based on the hosting and hardware being supplied by the sponsor.

In addition to the cost savings defined above, there is a time cost savings as well to gain access to the data. The data access time, when using a warehousing tool, is measured in days vs. weeks or months. The ability to quickly organize data into meaningful datasets for analysis and the ability to maintain analysis files with the data also equate to time savings. It also provides a method to allow controlled access to the data for a global research team. Such global access could have substantial additional costs if access is needed by remote teams as the data is very large in comparison to most other data types.

When utilizing systems and tools to organize and warehouse image data, it creates an easy-to-use access point for many powerful image analysis systems and it minimizes the time and effort necessary to perform analysis. For example, the ability to automatically create a cohort of data from a mining tool, which defines a dataset specifically organized based on a set of search criteria of the associated meta data, and to instantly provide access to the researchers via a folder-based access point (with only the defined cohort exposed in the access point), gives the researcher the ability to instantly run analyses using the most commonly available tools, such as MATLAB that has a large number of tools kits specifically developed for image analysis (i.e. MIAKAT or SPM). In addition, any analysis files created can be stored back to the repository, which maintains an association to the data used for the analysis for easy analysis management and reproducibility by having the data and analysis remain together.

Additionally, a complete solution of a data warehousing and mining systems linked with a workflow management system to manage the image data through the clinical trial life cycle and integrated with EDC (electronic data capture) systems and read analysis tools can be utilized to conduct central reviews of the image data stored in the repository/warehouse. With such a complete solution as described here, full corelab functionality can be brought in-house or perhaps smaller-scaled blinded reviews can be conducted on cohorts of data to define new indications for use of specific target populations as an example. A system with this extended capability can be used in studies to allow site readers access to the image data to perform reads versus a central review at an expensive corelab, such as suggested in the widely discussed oncology audit methodologies discussed at the ODAC meeting at the FDA in July of 2012, which are being used in numerous clinical studies. The additional system functionality as described above, would incur additional fees, but considering the estimate of $17M in external outsourcing, it’s likely to be more economical than completely outsourcing it. In addition, there are many external contract groups specializing in imaging in clinical trials that can provide all the extended resource needs without the need for industry corelab resources, which are rarely ala carte and tailored to the study team’s resource needs.

Conclusion:

It’s an economical, time saving and valuable resource to have the ability to warehouse and mine image data from clinical studies. It provides researchers with essential access from anywhere in the world to data and analysis tools. It also allows clinical sites the ability to view and analyze the imaging endpoints as needed, as well as a means to quickly bring additional functionality of a corelab within an online environment, if and when needed. Furthermore, it allows for greater flexibility as to when outsourcing of imaging studies is truly necessary verses in sourcing for small studies or small cohorts of a study in which imaging is being used or for those considering bring the entire corelab in-house.

In addition, the ability to quickly provide the due diligence needed for making impactful business decisions, as well as the potential to revive drugs into the pipeline with new sets of endpoints and patient populations is also a stronger case for why it’s a good idea to maintain and warehouse images. The use of this vastly important data source to be used to help define and build new and innovative biomarkers to determine efficacy and safety is also a major benefit. Image data stored in a warehouse can also be mined to help determine the most efficient criteria to assess in later phase research studies to give a clearer picture of the efficacy or safety endpoints.

Overall, there is tremendous value in keeping imaging data at the disposal of researchers. Warehousing and mining tools have only recently been developed to specifically address the needs of the clinical research community. The trend in maintaining image data post study and using it to further research and define better biomarkers is growing from the increased access to researchers. Through researchers use of the data faster decisions are being made as a result of better endpoints knowledge and refinements.