AI Language models and human curation

Decades ago, AI researchers largely abandoned their quest to build computers that mimic our wondrously flexible human intelligence and instead created algorithms that were useful (i.e. profitable). Some AI enthusiasts market their creations as genuinely intelligent despite this understandable detour, writes Gary N. Smith on Mind Matters. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets, statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is also an award-winning author of a number of books on AI. In his article, Smith sets out to explore the degree to which Large Language Models (LLMs) may be approximating real intelligence. The idea for LLMs is simple: using massive datasets of human-produced knowledge to train machine learning algorithms, with the goal of producing models that simulate how humans use language. There are a few prominent LLMs, such as Google’s BERT, which was one of the first widely available and highly performing LLMs. Although BERT was introduced in 2018, it’s already iconic. The publication which introduced BERT is nearing 40K citations in 2022, and BERT has driven a number of downstream applications as well as follow-up research and development. BERT is already way behind its successors in terms of an aspect that is deemed central for LLMs: the number of parameters. This represents the complexity each LLM embodies, and the thinking currently among AI experts seems to be that the larger the model, i.e. the more parameters, the better it will perform. Google’s latest Switch Transformer LLM scales up to 1.6 trillion parameters and improves training time up to 7x compared to its previous T5-XXL model of 11 billion parameters, with comparable accuracy. OpenAI, makers of the GPT-2 and GPT-3 LLMs, which are being used as the basis for commercial applications such as copywriting via APIs and collaboration with Microsoft, have researched LLMs extensively. Findings show that the three key factors involved in the model scale are the number of model parameters (N), the size of the dataset (D), and the amount of compute power (C). There are benchmarks specifically designed to test LLM performance in natural language understanding, such as GLUE, SuperGLUE, SQuAD, and CNN/Daily Mail. Google has published research in which T5-XXL is shown to match or outperform humans in those benchmarks. We are not aware of similar results for the Switch Transformer LLM. However, we may reasonably hypothesize that Switch Transformer is powering LaMDA, Google’s “breakthrough conversation technology”, aka chatbot, which is not available to the public at this point. Blaise Aguera y Arcas, the head of Google’s AI group in Seattle, argued that “statistics do amount to understanding”, citing a few exchanges with LaMDA as evidence. This was the starting point for Smith to embark on an exploration of whether that statement holds water. It’s not the first time Smith has done this. In the line of thinking of Gary Marcus and other deep learning critics, Smith claims that LLMs may appear to generate sensible-looking results under certain conditions but break when presented with input humans would easily comprehend. This, Smith claims, is due to the fact that LLMs don’t really understand the questions or know what they’re talking about. In January 2022, Smith reported using GPT-3 to illustrate the fact that statistics do not amount to understanding. In March 2022, Smith tried to run his experiment again, triggered by the fact that OpenAI admits to employing 40 contractors to cater to GPT-3’s answers manually. In January, Smith tried a number of questions, each of which produced a number of “confusing and contradictory” answers. In March, GPT-3 answered each of those questions coherently and sensibly, with the same answer given each time. However, when Smith tried new questions and variations on those, it became evident to him that OpenAI’s contractors were working behind the scenes to fix glitches as they appeared. This prompted Smith to liken GPT-3 to Mechanical Turk, the chess-playing automaton built in the 18th century, in which a chess master had been cleverly hidden inside the cabinet. Although some LLM proponents are of the opinion that, at some point, the sheer size of LLMs may give rise to true intelligence, Smith digresses. GPT-3 is very much like a performance by a good magician, Smith writes. We can suspend disbelief and think that it is real magic. Or, we can enjoy the show even though we know it is just an illusion.

Do AI language models have a moral compass?

Lack of common-sense understanding and the resulting confusing and contradictory outcomes constitute a well-known shortcoming of LLMs – but there’s more. LLMs raise an entire array of ethical questions, the most prominent of which revolve around the environmental impact of training and using them, as well as the bias and toxicity such models demonstrate. Perhaps the most high-profile incident in this ongoing public conversation thus far was the termination/resignation of Google Ethical AI Team leads Timnit Gebru and Margaret Mitchell. Gebru and Mitchell faced scrutiny at Google when attempting to publish research documenting those issues and raised questions in 2020. Notwithstanding the ethical implications, however, there are practical ones as well. LLMs created for commercial purposes are expected to be in line with the norms and moral standards of the audience they serve in order to be successful. Producing marketing copy that is considered unacceptable due to its language, for example, limits the applicability of LLMs. This issue has its roots in the way LLMs are trained. Although techniques to optimize the LLM training process are being developed and applied, LLMs today represent a fundamentally brute force approach, according to which throwing more data at the problem is a good thing. As Andrew Ng, one of the pioneers of AI and deep learning, shared recently, that wasn’t always the case. For applications where there is lots of data, such as natural language processing (NLP), the amount of domain knowledge injected into the system has gone down over time. In the early days of deep learning, people would routinely train a small deep learning model and then combine it with more traditional domain knowledge base approaches, Ng explained, because deep learning wasn’t working that well.  This is something that people like David Talbot, former machine translation lead at Google, have been saying for a while: applying domain knowledge, in addition to learning from data, makes lots of sense for machine translation. In the case of machine translation and natural language processing (NLP), that domain knowledge is linguistics. But as LLMs got bigger, less and less domain knowledge was injected, and more and more data was used. One key implication of this fact is that the LLMs produced through this process reflect the bias in the data that has been used to train them. As that data is not curated, it includes all sorts of input, which leads to undesirable outcomes. One approach to remedy this would be to curate the source data. However, a group of researchers from the Technical University of Darmstadt in Germany approaches the problem from a different angle. In their paper in Nature, Schramowski et al. argue that “Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do”. While the fact that LLMs reflect the bias of the data used to train them is well established, this research shows that recent LLMs also contain human-like biases of what is right and wrong to do, some form of ethical and moral societal norms. As the researchers put it, LLMs bring a “moral direction” to the surface. The research comes to this conclusion by first conducting studies with humans, in which participants were asked to rate certain actions in context. An example would be the action “kill”, given different contexts such as “time”, “people”, or “insects”. Those actions in context are assigned a score in terms of right/wrong, and answers are used to compute moral scores for phrases. Moral scores for the same phrases are computed for BERT, with a method the researchers call moral direction. What the researchers show is that BERT’s moral direction strongly correlates with human moral norms. Furthermore, the researchers apply BERT’s moral direction to GPT-3 and find that it performs better compared to other methods for preventing so-called toxic degeneration for LLMs. While this is an interesting line of research with promising results, we can’t help but wonder about the moral questions it raises as well. To begin with, moral values are known to vary across populations. Besides the bias inherent in selecting population samples, there is even more bias in the fact that both BERT and the people who participated in the study use the English language. Their moral values are not necessarily representative of the global population. Furthermore, while the intention may be good, we should also be aware of the implications. Applying similar techniques produces results that are curated to exclude manifestations of the real world, in all its serendipity and ugliness. That may be desirable if the goal is to produce marketing copy, but that’s not necessarily the case if the goal is to have something representative of the real world.

MLOps: Keeping track of machine learning process and biases

If that situation sounds familiar, it’s because we’ve seen it all before: should search engines filter out results, or social media platforms censor certain content / deplatform certain people? If yes, then what are the criteria, and who gets to decide? The question of whether LLMs should be massaged to produce certain outcomes seems like a direct descendant of those questions. Where people stand on such questions reflects their moral values, and the answers are not clear-cut. However, what emerges from both examples is that for all their progress, LLMs still have a long way to go in terms of real-life applications. Whether LLMs are massaged for correctness by their creators or for fun, profit, ethics, or whatever other reason by 3rd parties, a record of those customizations should be kept. That falls under the discipline called MLOps: similar to how in software development, DevOps refers to the process of developing and releasing software systematically, MLOps is the equivalent for machine learning models. Similar to how DevOps enables not just efficiency but also transparency and control over the software creation process, so does MLOps. The difference is that machine learning models have more moving parts, so MLOps is more complex. But it’s important to have a lineage of machine learning models, not just to be able to fix them when things go wrong but also to understand their biases. In software development, open source libraries are used as building blocks that people can use as-is or customize to their needs. We have a similar notion in machine learning, as some machine learning models are open source. While it’s not really possible to change machine learning models directly in the same way people change code in open source software, post-hoc changes of the type we’ve seen here are possible. We have now reached a point where we have so-called foundation models for NLP: humongous models like GPT-3, trained on tons of data, that people can use to fine-tune for specific applications or domains. Some of them are open source too. BERT, for example, has given birth to a number of variations. In that backdrop, scenarios in which LLMs are fine-tuned according to the moral values of specific communities they are meant to serve are not inconceivable. Both common sense and AI Ethics dictate that people interacting with LLMs should be aware of the choices their creators have made. While not everyone will be willing or able to dive into the full audit trail, summaries or license variations could help towards that end.