Original article was published by Sahar Mor on Artificial Intelligence on Medium
Screening for Ethics at Scale
How frontrunners such as OpenAI should leverage human-in-the-loop to screen for ethical use at scale
Last June OpenAI released the most powerful language model ever created, which became the topic of much discussion among developers, researchers, and entrepreneurs. Its capabilities of zero- and one-shot learning blew people’s minds, with many GPT-3 powered applications going viral on twitter every second day.
This API is being released in an era when polarization and bias have never been as intense, with technology that is powerful, scalable, and potentially dangerous — imagine a fake news generator or a social media bullying bot powered by the human-like GPT-3.
Understanding the harmful potential of its API technology, OpenAI has taken a unique Go To Market approach, strictly limiting access to a small number of vetted developers. By doing so, it became one of the first companies to voluntarily forfeit short-term profits in favor of being socially-responsible.
As our understanding of AI evolves, other companies developing advanced AI technologies such as ScaleAI might follow a similar path.
The combination of a for-profit company, a powerful technology, and the decision to screen for access is novel, raising several questions:
- What are the challenges of scaling a human-in-the-loop workflow? What are the specifics when screening for ethical usage?
- How can you scale a subjective screening process which is based on human intuition and OpenAI’s guidelines?
- How can you scale these operations to cater to an ever-growing customer base?
- What are the implications of a screening process that doesn’t scale well?
In this article I’ll discuss these questions and explore different methods to mitigate negative outcomes, using OpenAI as an example.
OpenAI and the GPT-3 boom
OpenAI took precautionary measures with the release of GPT-3, screening for potential misusages, and revoking access if it suspected an unfair or a risky usage. ‘Risky’, as defined by OpenAI, is the malicious usage of their API for physical, emotional, or psychological harm to people.
To get API access, one needs to apply via this form. The current waiting times can be as much as forever, with developers that applied in late June and are still waiting for a response.
Once you’ve built an app that is ready for production, you’ll be required to fill another form. It might take up to 7 business days for OpenAI’s team to review a request. In its reviewing process, the following axes are being evaluated, among others:
- How risky is the app?
- How do you plan to address potential misuse?
- Who are the end-users of your app?
Based on the information provided, your app will be put into one of three buckets: Green Zone, Yellow Zone, or a Red Zone. Each one implicitly conveying if your app has been approved.
After your app has been approved, you’re free to go.
Given the fact OpenAI’s API serves millions of calls per day, it’s yet to be clear if, and to which extent, apps are being monitored in production.
Pre-onboarding screening is already happening in other domains. In FinTech, there are the Know Your Customer (KYC) and Know Your Business (KYB) processes, aimed at fighting money laundering, tax crimes, etc. Those are mandatory processes imposed by the regulator and it’s a common practice for FinTech companies to outsource them to 3rd parties.
In the absence of clear regulation, OpenAI has decided to build its own screening process, which on one hand provides tighter control and flexibility, yet on the other, forces it to spend resources and focus on areas that are not part of its core technology.
Automating the manual screening for ethics
The task at hand leading to this multilayered filtering process is answering:
“According to our guidelines, is this a legitimate use of our API?”
With production-approved apps introducing even a greater challenge:
“According to our guidelines, is this *still* a legitimate use of our API?”
The current human-in-the-loop screening process comes across as semi-manual at best, with humans making decisions influenced by judgment and subjective interpretations of OpenAI’s guidelines. This introduces several challenges I’ve constantly witnessed when building and observing human-in-the-loop products and are key when executing tasks at scale:
Making the right decision, i.e. accuracy
Accuracy in OpenAI’s context is the ability to rightfully classify if a developer’s app is Green, Yellow, or Red — making it a 3-classes classification problem.
In the attempt of making the right decision, one has to take the following into consideration:
- Implications of making a mistake. Having a malicious app being approved can be judgmental. Imagine an app that provides medical advice based on user’s input, powered by GPT-3.
- The bare definition of “what’s a Green or a Red application?” is highly subjective; and that’s bad news for automation when humans can’t anonymously agree on a result, as it means there are multiple nuances requiring abundant data for a machine to accurately generalize.
- Fraud — imagine a developer applying with a Green-like application, yet using it for a different purpose on production.
- Even if a 99.99% accuracy was reached at the time of review, how would one deal with cases where the app has changed post-deployment? This can happen either when the prompts being used has changed throughout time — similar to concept drift, just prompts drift, or when the developer has intentionally changed scope without applying again.
Low accuracy can lead to socially harmful apps, and among others bad PR, externally and internally, as no one wants to work for a company that, as an example, fuels fake news at scale.
Instead, you’d rather take a hit on recall, leading to fewer revenues and customers in favor of higher accuracy.
Making a cost-efficient decision, i.e. costs
A task such as reviewing an application by a human on a case-by-case basis means there is an operational cost associated as the company scales:
- Execution per task – in OpenAI’s case, that means someone who can follow the guidelines and make a judgment call based on a subjective understanding of what is an approved app. Most probably, an internal employee due to the essence of secrecy and pace of policy adjustments in GPT-3’s early-stage.
- Escalation and a consensus mechanism – in cases when a human reviewer doesn’t have the confidence in making a decision, the same task should be shown to another same-level reviewer, or if needed — escalated to an ‘expert’. Both incur extra costs.
- Hiring and training of human reviewers.
Different types of tasks require different types of qualifications, e.g. extracting information from documents will mainly require a pair of eyes, while the classification of benign tumors in CT scans requires deep radiological knowledge. The more proficiency required, the higher the operational costs are.
For example, the operational costs of reviewing one application for a human-reviewer that’s being paid 25$/hour fully-loaded would be 50$.
Making a timely decision, i.e. turnaround time
Turnaround time is the time a developer needs to wait until an application is being reviewed. This was one of the most prominent characteristics of GPT-3’s recent launch — people doing everything possible to expedite their turn in the long GPT-3 access queue.
Unlike cloud services, humans are harder to scale at a push of a button, so turnaround time can become a scaling issue, especially if the same application has to be reviewed by multiple human reviewers due to its complexity.
In a world where it still takes days and weeks before an API application is being approved — will developers turn to other solutions, as competition and research catch up? It’s hard to imagine it now with OpenAI (and Microsoft) being the market leader, but that might change soon.
A recipe for scaling
Gradually automating a human-heavy process is not an outcome of one initiative. As such, I’ll outline a machine-human-hybrid approach to the screening of ethical use at scale.
#1 Cybersecurity to the rescue
During my time in Cybersecurity, one of the most successful techniques to detect malware was through anomaly-based detection. Anomaly detection first involves training the system with a normalized baseline, followed by comparing new activities against it. Once an abnormal event appears — an alert is fired. For example, a user logging into his office computer at 2 am.
Unlike other detection methods, it’s scalable as it defines ‘malicious’ without the need to manually characterize it per the ever-growing vector of cyberattacks.
We can apply the same technique to detect malicious API usage. Here are a few examples:
- Having an ongoing app classification engine based on prompts and completions, looking for discrepancies between what was submitted for review and app’s actual use. Differences can occur due to concept drift or intentional change of the app’s purpose. For example, the recurring monitoring logic can surface an app submitted as a role-playing game that is now starting to show medical topics. This is suspicious.
- Look for significant changes in the model’s configuration that might alter the app’s behavior. In OpenAI’s example, those could be temperature, completion length, etc.
- Monitor and alert about abnormal usage patterns such as a sudden spike in daily tokens used.
Training for a normalized baseline can be done within a monitored sandbox environment, where the applicant runs the app in an environment that is as close as possible to a production setting.
#2 Policy detection
Rule-based policy enforcement is another safety net. Here are a couple of examples:
- A classifier to detect if an app is prone to harmful bias and issues in fairness based on its application form. The idea is to codify the same logic humans apply when reading an application that states “this is a bot to generate fake news”, therefore concluding it shouldn’t be approved.
Feeding applications’ input fields as features to an ML algorithm can lead to a strong classifier given the thousands of already-reviewed applications.
- A classifier flagging apps that don’t follow API’s policy. In OpenAI’s case, it could be about surfacing for further inspection apps discussing risky topics such as suicide and politics or having a toxic language. Research in these topics is picking up, as language models become more robust and capable (Gehman et al. 2020, Kurita et al 2019).
Both classifiers would run as part of the above-mentioned sandbox environment, as well as the ongoing monitoring engine to surface potentially-malicious apps.
#3 Alerts dashboard
Each alert should be processed by a team that is the equivalent of a Security Operation Center (SOC) team, getting potential misuse alerts and further investigating.
#4 Developers feedback-loop
At my previous startup, we were automatically extracting information from invoices, yet our precision for some fields was <95%. Allowing our customers to flag wrongly-extracted fields, feeding those back into our ML pipeline, has led to a significant increase in precision in a matter of weeks.
Having more examples of harmful, biased, or wrongly generated text would help fine-tune OpenAI’s API, as was recently reiterated with their paper Learning to Summarize with Human Feedback.
Capturing developers’ feedback as part of a streamlined API will enable building a massive dataset of non-safe generated completions, fast. This dataset can be then used to retrain GPT-3 or to train a robust safety detector.
#5 Automated qualification
Building a qualification questionnaire, with as many closed questions as possible, will allow a rule-based logic to automatically define if a given application is approved. This is a similar approach to what Superhuman did to qualify and prioritize prospects.
Except for reducing human-reviewing load, this leads to a better UX as developers get immediate feedback and, potentially, access.