From doomsday scenarios of malevolent robots to the real concerns around facial recognition software and automated decision-making systems, as illustrated in the film Coded Bias, the power of artificial intelligence has been accompanied by fears of its misuse. At the most recent NetIKX seminar, Ahren Lehnert of Synaptica addressed some of these issues from a knowledge and information management perspective.
As information professionals, the aspect of AI which we will encounter most frequently is machine learning (ML), the study of computer algorithms which improve automatically through experience and by the use of data. Machine learning requires ‘big data’, but this data must be clean, accurate and appropriate to the problem we are trying to solve. Many companies are yet to work out how to de-silo and clean their data, or how to analyse it once the preparatory work is complete. The main reasons why ML ‘fails’ are inaccurate, inconsistent or insufficient data (it is more difficult to find patterns in inconsistent data) coupled with a tendency to spend too much time on data cleansing rather than improving machine learning models.
Bots and bias
Common use cases for machine learning include content tagging, recommendation engines, recruitment, crime prevention, finance and autonomous vehicles. However, ML algorithms rely on a dataset of training data in order to make predictions or decisions – and this is where many of the ethical issues arise. A recent example was Microsoft’s chatbot Tay, which ‘learned’ from the data of interactions with Twitter users: in less than 24 hours, the bot was repeating abusive and offensive language.
The main source of bias, as Ahren pointed out, is that content is generated by people, and people have biases. When we refer to an algorithm as ‘racist’, we actually mean that the human-generated data behind the algorithms reflects racist assumptions – for example, when identifying crime suspects or evaluating mortgage applications. A related issue which is of particular concern to information professionals is that taxonomies can also have an inbuilt bias, as they are built from content which usually reflects an organisational viewpoint and which may well be flawed both in content and coverage. The Library of Congress classification scheme has recently attracted attention because of its structure and terminology, both of which are increasingly viewed as problematic.
What is ethical and who decides?
So what is ‘ethical AI’ and how can we achieve this? A useful starting point is provided by the FAST Track Principles (Fairness, Accountability, Sustainability, Transparency) developed by the Alan Turing Institute. Transparency requires ML and AI outcomes to be explainable – too often, ML systems are ‘black boxes’ where the user has no insight into the reason behind the results delivered by an algorithm. What all these factors – most obviously accountability- have in common is the importance of keeping the human in the loop throughout the entire process, from design to implementation.
How do we reach a consensus on what is ethical? There are currently no codified standards, so who writes and enforces codes of conduct? Within an organisation, the emphasis needs to be on industry and geo-specific regulatory requirements and conducting appropriate risk assessments, as well as a willingness to invest in legal resources and to understand social concerns. There are some existing frameworks which can be used as guidelines: Ahren suggested finding something that works and adapting it for your own organisation.
A virtuous content cycle – KM to the rescue?
Knowledge organisation principles can help here by providing a controlled perspective. Many problems are solved by classification, so providing examples of desired classification will help to train the model. Labelling training data with predefined classes, using text analytics to process and extract new controlled values (and to cluster known and unknown values from unsupervised learning tasks) and using taxonomy and ontology management software to map these values are all vital to avoid the pitfalls of statistical bias, ambiguity and incompleteness. Thus, we can create a ‘virtuous content cycle’, using content to build a taxonomy, then using the taxonomy to tag content using controlled values and retrieving content in search based on the tags and text keywords.
One crucial step to success is to create trust in the data and systems, which can be achieved by following the FAST Track principles as detailed above and by following good data governance, content and knowledge management practice. The time invested in setting up an information architecture which prepares, cleanses and governs data appropriately will be a worthwhile investment in comparison with the time potentially spent cleansing data for each new machine learning activity. Clearly communicating the source and nature of data will help to increase trust in the system and its resulting analytics and reporting.
Ahren left us with three discussion questions: whose responsibility is AI? what can we do in our roles as information professionals to work toward ethical AI? and: is Star Wars or Star Trek better, and can we even make such a claim? – the final question relating specifically to issues of ethics, artificial intelligence and superintelligent AI. Although many of us were not working directly with AI or machine learning in a professional capacity, we were all able to identify ethical aspects of AI which impacted on our everyday lives and work tasks. Thanks to Ahren and all who participated for a stimulating and entertaining seminar!