Toward a CRO Management Solution – Workflow Informatics Corp.

As a research informatics consultant, I have the pleasure of observing trends across both small and large customers. One undeniable trend is the outsourcing synthetic chemistry and biological assays. In many cases, there are multiple outsource partners providing similar services. Not surprisingly, the transfer and management of scientific data to and from the CROs can be challenging. Here, I will try to summarize the workflows and some of the biggest challenges. First, let’s examine the data workflows related to synthetic chemistry. Prior to the commencement of synthesis, we must first identify target chemical structures. Medicinal chemists generate a virtual library of molecules. These molecules must be stored in some sort of chemical database. Next, a decision must be made as to which structures will be synthesized and by whom. This may be a combination of in-house chemistry and one or more CROs. In most cases, a suggested synthetic plan is devised and transferred to the CROs. This represents a challenge, because most chemical and reaction databases were not designed to store or manage experimentals. Chemical ELNs may be the closest equivalent to a solution for the management of this data, but they lack some of the important tracking capabilities. Companies need to track the request and fulfillment of synthetic targets. There may be multiple IDs to track at this point as well, such as the request ID from the virtual library database, and the external batch ID from the CRO. It can be very difficult to manage ideation, request, and chemical registration in a single database. In our experience, compounds that are synthesized and delivered are better off being registered in a distinct chemical database. Next, there is the creation of samples from synthesized molecules. Often, samples of samples are created and transferred to the entity that will carry out the biological assays. This introduces the need for a sample database, with information about the current volume, concentration, molecular identity, and lineage of a sample. In addition, there must be a mechanism for the request and fulfillment of samples, and potentially plate formats with dilution schemes. CROs will often maintain their own sample inventory. It’s not uncommon to find duplication of sample inventories between the CRO and the requesting company. It is also common for companies to track inventory in a spreadsheet for as long as possible. This usually becomes untenable at around 3-5K compounds. The next issue is the transfer, preparation and storage of assay data. Initial assay data results almost universally find their way into excel spreadsheets with some sort of molecular batch IDs associated with biological results. Numerous conversations about prefixes, number of digits for compound IDs, and whether to attach the batch ID with a hyphen will occur. Beyond the identifiers, there is a general lack of standardization for the assay data itself (both in format and nomenclature), so they need to be transformed prior to loading into an assay database. Once data are loaded, batch IDs need to relate back to the chemical registration database to provide chemical awareness. Finally, SAR studies and modeling are applied to the results in order to guide the selection of new synthetic targets, which brings us back to the start of the iterative workflow. Even within a single organization, it can be difficult to manage this workflow, but with the transfer of information to and from multiple CROs, there is an added layer of complexity. One of the weakest links in this workflow is the transfer of synthetic schemes, update of these schemes, and reporting on success and failure of syntheses. Typically, some sort of shared drive platform (e.g. Egnyte, Sharepoint) is used to transfer synthetic schemes. Tracking the progress of synthetic targets often involves annotation of individual molecules the virtual library database (e.g. ordered, in progress, failed, completed). Email may also play a central role in request, fulfillment, and tracking. As actual batches are synthesized, they are often recorded in the chemical registration database, with an identifier to track it back to the virtual library database. QC data are often uploaded to the chemical registration database as well. A lot of this data may also be duplicated in ELNs. The result is that the scientists consult a variety of disconnected data sources in order to get a full understanding of a project’s progression. Another weak link is the transfer of biological assay results. CROs often develop their own spreadsheet templates for reporting results. Invariably, these spreadsheet data need to be transformed before loading into a biological assay result database at the receiving company. We often see metadata (e.g. study ids, report dates) incorporated into separate spreadsheet tabs or into the file name. Transforming the data into a loadable format is generally a tedious task. Furthermore, “standardized” spreadsheet formats often evolve, such that they are no longer standardized enough to be consumed by automated methods. Currently, there is no single software solution that manages this entire workflow effectively. Below, we map out the workflow. Chemical synthesis request and fulfillment statuses:

Ideation (creation/enumeration of synthesis candidates)
Chemical registration (loading of structure into chemical registration database)
Selection (selection of representative or specific synthetic targets)
Request (request synthesis)
- Electronic transfer requested structure and experimental
- CRO acknowledges receipt of request
Synthetic progress
- Assignment of synthesis to chemist(s)
- Synthesis initiated (in progress)
- Analysis/QC completed
- Synthesis failed
  - Abandon
  - Initiate new Request
- Synthesis completed
Transfer/storage of physical material
- Transfer to another department in existing CRO
- Transfer to another CRO
- Transfer to requesting company
Receipt (electronic acknowledgement that the synthesis is complete)
Chemical registration (loading of structure into chemical registration database)
Samples prepared (solid and or solution)
Samples loaded into sample database

One scenario that can reduce confusion, as alluded to above, is to have 2 distinct chemical registration databases (one for ideation, and one for compounds that are successfully synthesized). IDs from the ideation database can double as request IDs. Furthermore, the statuses of these molecules can be tracked within the Ideation database. The other database only contains molecules that have been successfully sythesized. The original request ID from the Ideation database can be entered as a compound level synonym into the synthesized molecule database. These molecules can be requested for assays using the original synthesis request ID. The sample database is generally a separate entity that does not track chemical structure, but can be related back to the synthesized chemical database via a lot level ID. The request and fulfillment process for outsourced biological assays follows a very similar path. Below we map out this workflow. Assay request and fulfillment statuses:

Ideation (creation of protocols for biological assays)
- Ontology based naming where possible
Assay/Protocol registration (registration/loading of protocol and result definitions into assay database)
- Consider data export and visualization when defining data formats and calculated results
Request (request assay)
- Electronic transfer of requested assay protocol
- Electronic transfer of sample list to be assayed
- Possible physical transfer of samples
  - Sample IDs need to be consistent
  - Plating may occur prior to or after transfer
    - Plate maps are needed
    - Tracking of stamping, and dilutions occurs here
    - Sample database updated
- CRO acknowledges receipt protocol and samples
Assay Progress
- Assignment to biologist
- Possible test runs
- Assay initiated
- Analysis/QC completed
- Assay Failed
  - Abandon
  - Re-run
- Assay completed
  - Electronic transfer of results and meta-data
Raw result data receipt
- QC of raw data
Result data transformation for loading
Result data loaded

The introduction of a rigid ontology for assay naming can help in a variety of ways. We recommend checking out these resources: http://bioassayontology.org/bioassayontology/ https://www.bioassayexpress.com/ This will also pay dividends if a data migration is needed. When setting up assays in your assay database, closely consider how to organize and group complex result data. Consider how results will be visualized, and format accordingly. For the transfer, transformation, and loading of raw assay data, there are a variety of best practices. The maintenance of a rigid, ontology-based directory system (or database schema) is recommended. Data will need to be transformed prior to loading into your assay database. Push as much of the data transformation as possible back on your CRO. Be sure to QC the raw data vigorously prior to loading. A robust sample management database system is recommended. The original sample is in pure, neat form, but for most testing, samples will be solubilized. Generally, the company and or the CRO will maintain the solubilized samples in a standard format (tube type and concentration). Laboratories should consider purchasing a tube store as they accumulate thousands of samples. As compoujnds are picked from the tube store, they are often plated and diluted for assays. Consider tracking lineage of samples throughout plating, dilutions, stamping, randomization, etc. All of these should be functions of the sample inventory database. While there are several commercially available sample databases, they often fall short of tracking request and fulfillment. Both synthetic chemistry and biological assay CRO workflows share the need for a tool to track the request and fulfillment of orders. Generic tools generally fall short for these specific workflows. @workflowinformatics corp has experience implementing these types of custom solutions. This is certainly not a comprehensive overview of the CRO/biotech company workflow, but we hope that some of the weak points links in the process have been exposed. The entire solution at almost every company involves multiple software packages. A solution that ameliorates these weak points will need to integrate with multiple different software platforms via APIs.