As a research informatics consultant, I have the pleasure of observing trends across both small and large customers. One undeniable trend is the outsourcing synthetic chemistry and biological assays. In many cases, there are multiple outsource partners providing similar services. Not surprisingly, the transfer and management of scientific data to and from the CROs can be challenging. Here, I will try to summarize the workflows and some of the biggest challenges.
First, let’s examine the data workflows related to synthetic chemistry. Prior to the commencement of synthesis, we must first identify target chemical structures. Medicinal chemists generate a virtual library of molecules. These molecules must be stored in some sort of chemical database. Next, a decision must be made as to which structures will be synthesized and by whom. This may be a combination of in-house chemistry and one or more CROs. In most cases, a suggested synthetic plan is devised and transferred to the CROs. This represents a challenge, because most chemical and reaction databases were not designed to store or manage experimentals. Chemical ELNs may be the closest equivalent to a solution for the management of this data, but they lack some of the important tracking capabilities. Companies need to track the request and fulfillment of synthetic targets. There may be multiple IDs to track at this point as well, such as the request ID from the virtual library database, and the external batch ID from the CRO. It can be very difficult to manage ideation, request, and chemical registration in a single database. In our experience, compounds that are synthesized and delivered are better off being registered in a distinct chemical database.
Next, there is the creation of samples from synthesized molecules. Often, samples of samples are created and transferred to the entity that will carry out the biological assays. This introduces the need for a sample database, with information about the current volume, concentration, molecular identity, and lineage of a sample. In addition, there must be a mechanism for the request and fulfillment of samples, and potentially plate formats with dilution schemes. CROs will often maintain their own sample inventory. It’s not uncommon to find duplication of sample inventories between the CRO and the requesting company. It is also common for companies to track inventory in a spreadsheet for as long as possible. This usually becomes untenable at around 3-5K compounds.
The next issue is the transfer, preparation and storage of assay data. Initial assay data results almost universally find their way into excel spreadsheets with some sort of molecular batch IDs associated with biological results. Numerous conversations about prefixes, number of digits for compound IDs, and whether to attach the batch ID with a hyphen will occur. Beyond the identifiers, there is a general lack of standardization for the assay data itself (both in format and nomenclature), so they need to be transformed prior to loading into an assay database. Once data are loaded, batch IDs need to relate back to the chemical registration database to provide chemical awareness. Finally, SAR studies and modeling are applied to the results in order to guide the selection of new synthetic targets, which brings us back to the start of the iterative workflow.
Even within a single organization, it can be difficult to manage this workflow, but with the transfer of information to and from multiple CROs, there is an added layer of complexity. One of the weakest links in this workflow is the transfer of synthetic schemes, update of these schemes, and reporting on success and failure of syntheses. Typically, some sort of shared drive platform (e.g. Egnyte, Sharepoint) is used to transfer synthetic schemes. Tracking the progress of synthetic targets often involves annotation of individual molecules the virtual library database (e.g. ordered, in progress, failed, completed). Email may also play a central role in request, fulfillment, and tracking. As actual batches are synthesized, they are often recorded in the chemical registration database, with an identifier to track it back to the virtual library database. QC data are often uploaded to the chemical registration database as well. A lot of this data may also be duplicated in ELNs. The result is that the scientists consult a variety of disconnected data sources in order to get a full understanding of a project’s progression.
Another weak link is the transfer of biological assay results. CROs often develop their own spreadsheet templates for reporting results. Invariably, these spreadsheet data need to be transformed before loading into a biological assay result database at the receiving company. We often see metadata (e.g. study ids, report dates) incorporated into separate spreadsheet tabs or into the file name. Transforming the data into a loadable format is generally a tedious task. Furthermore, “standardized” spreadsheet formats often evolve, such that they are no longer standardized enough to be consumed by automated methods.
Currently, there is no single software solution that manages this entire workflow effectively. Below, we map out the workflow.
Chemical synthesis request and fulfillment statuses:
- Ideation (creation/enumeration of synthesis candidates)
- Chemical registration (loading of structure into chemical registration database)
- Selection (selection of representative or specific synthetic targets)
- Request (request synthesis)
- Electronic transfer requested structure and experimental
- CRO acknowledges receipt of request
- Synthetic progress
- Assignment of synthesis to chemist(s)
- Synthesis initiated (in progress)
- Analysis/QC completed
- Synthesis failed
- Abandon
- Initiate new Request
- Synthesis completed
- Transfer/storage of physical material
- Transfer to another department in existing CRO
- Transfer to another CRO
- Transfer to requesting company
- Receipt (electronic acknowledgement that the synthesis is complete)
- Chemical registration (loading of structure into chemical registration database)
- Samples prepared (solid and or solution)
- Samples loaded into sample database
- Ideation (creation of protocols for biological assays)
- Ontology based naming where possible
- Assay/Protocol registration (registration/loading of protocol and result definitions into assay database)
- Consider data export and visualization when defining data formats and calculated results
- Request (request assay)
- Electronic transfer of requested assay protocol
- Electronic transfer of sample list to be assayed
- Possible physical transfer of samples
- Sample IDs need to be consistent
- Plating may occur prior to or after transfer
- Plate maps are needed
- Tracking of stamping, and dilutions occurs here
- Sample database updated
- CRO acknowledges receipt protocol and samples
- Assay Progress
- Assignment to biologist
- Possible test runs
- Assay initiated
- Analysis/QC completed
- Assay Failed
- Abandon
- Re-run
- Assay completed
- Electronic transfer of results and meta-data
- Raw result data receipt
- QC of raw data
- Result data transformation for loading
- Result data loaded