Workflow Informatics 10-year retrospective (Part 1 of 3)

This year Workflow Informatics Corp. celebrates 10 years in business. From humble beginnings as a one-man shop doing Pipeline Pilot consulting, Workflow Informatics has grown steadily, increasing in size every year. As we grew, we expanded our capabilities, building expertise in laboratory informatics, robotics, molecular modeling, cloud computing, DNA encoded libraries, bioinformatics, and most recently artificial intelligence and machine learning (AI/ML). Throughout the years, we have seen many common themes in our research informatics consulting work. In this blog, we will attempt a retrospective analysis of the evolution of research informatics over the past decade. This is by no means an all-encompassing analysis, just a compilation of some of the common themes in roughly chronological sequence.

1. Laboratory Operations
2. CRO data management
3. Data warehouses
4. Application integration
5. Automation
6. Data standards
7. Cloud computing & software
8. The rise of AI

Chapter 1: Laboratory Operations

Initially, Workflow Informatics concentrated on cheminformatics, primarily around medicinal chemistry, QSAR, predictive modeling, and data visualization challenges. We began by working with many start-up drug discovery companies, since few of these companies had dedicated research informatics personnel. This is understandable, as most start-ups begin with heads of Biology and Chemistry and a specific technological approach. Generally, they do not start with a lot of data, typically taking 6-12 months to start laboratory operations.

Once laboratory operations commence and they begin to generate data, the data sets usually end up in spreadsheets and documents. Eventually, these flat files become insufficient and ineffective data storage mechanisms, since they are not sufficiently searchable, there tend to be many versions/copies of the same documents, they are difficult to analyze, and errors are often introduced. Entity registration, sample inventory, and request & fulfillment are the workflows affected by this lack of proper data management. We have found that there are many excellent systems for entity registration and sample inventory, but these tools generally lack effective request and fulfillment tracking systems.

The standard request & fulfillment tracking process is familiar to most drug discovery scientists:

- A decision is made to test specific samples against specific assays.

- These samples are requested, prepared, and delivered to the assay scientist.

- Samples are tested, and results are reported.

Early-stage drug discovery organizations often have a hard time tracking everything from the request to the reporting of data, because this is typically handled by emails and spreadsheets. When mistakes happen, they are difficult to track down:

- Which samples were ordered for testing?

- When was the order for these samples made?

- Who ordered them?

- When were they tested?

- Against what assay were they tested?

- What were the conditions of that assay?

- When was the order fulfilled?

Workflow Informatics solution for this was the Workflow Request and Tracking^TM solution (WRT). This software solution offers a simple sample inventory system coupled with request and fulfillment workflows. Underlying the web-based user interface are relational database tables that capture all the metadata surrounding sample request and fulfillment workflows.

Chapter 2: CRO Data

The introduction of offshore CROs into typical drug discovery workflows has made data management even more complicated. Ten years ago, the use of CROs by start-ups was beginning to gain momentum, and very rapidly took hold as standard operating procedure. Over the past several years we have worked with multiple “virtual” drug discovery companies, for whom all laboratory operations were performed by external CROs. This has added another layer of complexity to the drug discovery process.

There are two primary areas of research that are outsourced: synthetic chemistry and biological assays. The most commonly outsourced assays are ADME/PK.

Synthetic Chemistry Outsourcing Workflow

Let’s review a solid example of the request and fulfillment of outsourced synthetic chemistry. Generally, the process for request and fulfillment of synthetic chemistry targets begins with ideation and identification of potential synthetic targets by a chemist at the virtual pharma company. This involves registration of the molecule in a virtual database. Each proposed molecule is registered in a virtual database, given a virtual ID, and its status is tracked. Statuses include proposed, requested, synthesized, abandoned. When it is determined that a proposed molecule should be synthesized,a request is submitted to the CRO. The molecule and a proposed synthesis need to be delivered. This is often delivered via a file server system in combination with email. When a molecule is successfully synthesized, the status is updated to synthesized, and it is registered into another chemical registration (chemreg) database that only contains synthesized molecules. The virtual database identifier serves as a molecular synonym in chemreg. In addition to the molecular level information, the chemreg also tracks batch and sample level information. Batches are defined as synthetic instances of a molecule. Samples are defined as each containerization of a batch (or another sample). For more detailed information on molecule, batch and sample levels, see our BioChemUDM publication. This method is adequate if the process is adhered to strictly, but it often breaks down over time. As an alternative, we would recommend a request and fulfillment application (like WRT), possibly in combination with an ELN system. WRT is backed by a database that can track each stage of request and fulfillment, along with metadata, providing accountability and metrics associated with the process.

Assay Outsourcing Workflows

Most of the CROs deliver assay results in multi-sheet Excel workbooks attached to an email. As discussed, this is not an ideal medium for the delivery of data. Furthermore, data sets are delivered in a wide variety of formats with often pivoted or separated into multiple tables based on experimental conditions (e.g. Species, Time). Data types are not always consistent across result fields (e.g. “N/A” or “ND” in a numeric field). In short, bad data practices abound. Over the years, we have built a great many parsers for these types of results using Pipeline Pilot, Knime, and Python. Parsers can comb through the different sheets of the workbook, extracting important data fields, checking for adherence to data types, unpivoting, etc., and ultimately delivering the data in a format that is easily loaded to assay databases. In many cases, we have leverage the APIs of various software for assay data (e.g. CDD Vault) to automate the upload of data. The lack of data standards seen across these ADME results is also addressed in our BioChemUDM publication. Workflow Informatics has implemented the Biochem UDM at several drug discovery organizations, and we believe the implementation of these standards greatly enhances the ability of organizations to perform data analysis and data migrations.

Managing CROs and the data they produce remains a challenge for most organizations. We have a vision for dealing with this and we have begun to implement point solutions at various organizations. Organizations need a mechanism to ingest CRO data that enforces data standards and automates the loading of data into assay data repositories. We can provide these types of solutions.