"Academia has developed an amazing tree of knowledge which is arguably the most important data for Large Language Models to be trained on," writes Stuart Leitch. "Where does the scholarly communication community fit in?" It's a good question, and one we need to be careful about as we answer. We know that the quality of the data given to an AI impacts the quality of the results. I saw a LinkedIn post a day ago (now unfinable because it's LinkedIn) complaining about the representation of "Irish professors" as uniformly older white men. Do a search on Google for that subject, though, and you'll get pretty much the same result. And it could be that the majority of Irish professors actually are older white men (I can't say one way or another). AI tends to reflect back the patterns it sees in the data.
AI will get better the more it becomes 'multi-modal'. We'll be hearing this term a lot. But there are two separate meanings. One sense of the term means that AI will be able to use things like images, videos, websites, and even real life, as input data. But the other sense of multi-modal AI is the sense connoted in this article: of governing the statistical inferences an AI makes with solid factural data as produced by, say, the scholarly communication community. I agree with Stuart Leitch: "rather than trying to get premium scholarly information out of LLM training sets, we should fight to get it in there, on terms that are economically sustainable." But that does not mean, as suggested here (and previously) combining it with symbolic model-driven AI (see the illustration). It can learn from symbolic models, but it should not be controlled by them.
Today: 1 Total: 122 [Share]
] [