Thinking about LLMs and Multi-modality
In this post, I would discuss some of my thoughts on Multi-modality and future LLMs directions. Welcome to post your thoughts in the comments and discuss with me.
Thoughts on Multi-modality
- Since OpenAI (GPT-4) and Baidu (Wen Xin Yi Yan) released LLMs with different design for multi-modality. GPT-4 unifies the embedding representation between text and images, while WXYY seems to unify the ability to generate different medium content (text, image, voice and video).
- In my point of view, multi-modality should support machine's comprehension by encoding different medium into a unified latent space, bridging the information between different modality. It is a progressive simulation of human's learning approach, from reading only to read and see.
- In Xiaotong Fei's perspective (Chinese anthropologist and sociologist) from nearly 80 years ago, written language has intrinsic deficiency which could convey only incomplete and ambiguous meanings compare to real life situation. For example, we could tell from voice who is visiting us, or from body language to understand other's emotion. It is therefore natural to gradually integrate the ability to take in and understand different medium as AI develops.
- By quoting Xiaotong Fei's ideas years ago, I want to emphasize that we could to draw inspirations from studies which tries to explain how human learns and perceives. Artificial Intelligence, should always be rooted in the study of human intelligence.
Current LLMs Situation
Not hitting human-level intelligence yet. Though it perform well in exams and chatting, it is still not good tasks like few-shot learning, long text comprehension and factual generation.
GPT-4 performed well in many exams. Though it's not human-level intelligence yet, it is time when we start to think about the essence of human intelligence, or more specific, what makes us human unique compared with LLMs.
Human’s unique competence in AI’s perspective
Here is the answer GPT-4 gave me (which I used my holistic understanding to summarize them into the below key points). OpenAI also published a relevant study about LLMs' impact at the labor market. #### Empathy and Emotion intelligence
- Ability to understand, respond emotions.
Collaboration
- Collaborate and work as a team with others, combining individual strengths.
Creativity and imagination
- Think divergently, create original ideas and innovative plans,
imagine novel concepts.
- In AI generated art for instance, we could see the potentials of human imagination empowered by a drawing tool that understands human prompts.
Adaptability and learning
- Adaptability is kind of few-shot learning capability. Humans could adjust fast to new situations and environments.
Intuition and decision-making
- Making decisions with incomplete information. Similar to the concept tacit knowledge.
Morality and ethics
- AI lack a true understanding of moral and ethical values.
Holistic understanding
- Ability to integrate information from various sources and modality to form a holistic understanding.
Critical thinking
- Ability to evaluate information based on logic and intuition.
Context and common sense
- Understanding of the problem situation and make reasonable judgements.
Future LLMs directions
Guesses based on current LLM's deficiency.
Multimodal comprehension
- to enable different interaction methods.
- Multi-modal comprehension could enable more training data for models. Imagine AI learns how to do carpenter work by watching a 3D video of a master carpenter.
Explainability
- Understanding different part of LLMs, why and how they draw certain conclusions, could help eliminating hallucination.
Safety
- Fact-based, combined with thought chain for inference
- Generate more accurate content could prepare AI for more responsible positions.
- Detecting AI generated content.
- Being able to judge AI generated content is essential when the generated content contains hallucination.
- Related Research - Guo et al., 2023
- Ensure data privacy during training / fine-tuning
- Debates about what data could be used during training. Consensus needs to be drawn.
- Online GPT models may be attacked
- Indirect prompt injection by human invisible content on web.
Accessibility
- Model distill technology could be used to lower the cost of use.
- We could also see a lot of contributions from the academia and tech
companies aiming for democratic training and inferencing for LLMs.
- Meta proposed a series of LLMs trained based on publicly accessible data only - LLaMA Official Repo, which uses a lot fewer parameters to achieve similar results compared with large ones. More detail in the paper
- Stanford research group follows the previous research and fine-tuned a better performing model Alpaca.
- A new optimisation of LLaMA was proposed in this repo which allows Mac to run the large model.
Bionic Structure?
- For now, LLMs seems to be stacking transformers (or say GPUs) to achieve better performance. I suppose we could improve the performance by using different networks combinations (change of architecture) to mimic human brain's architecture. This reverse-engineering process may be hard, but is worthwhile as it could also bring better understanding of the nature of human intelligence.
- Here is an inspiring research - spiking neural network helps prevent catastrophic forgetting
- We had long followed the mode to draw inspirations from bionics in tech development, I suppose it's necessary to collaborate with intelligence-related domains (such as neuroscience) for future study.
Advanced Reinforcement learning?
- With the improved multi-modality comprehension, we could expect more intuitive learning mechanism for the RLHF proposed.
What could down-stream companies do?
Model As A Service (MaaS)
- Integrate LLM API as additional service - help the company focus on
solving domain specific tasks.
- Advantage
- Controllable response in specific domains
- Ability to integrate with customized services and products.
- Disadvantage
- Extra efforts needed to develop models that distinguishes LLMs' job and own product’s.
- sense of fragmentation between own service & API service if the performance gap is huge.
- Advantage
- Or just use the basic models as the entire product
- Advantage
- Quick to deploy
- Disadvantage
- restricted controllability & steerability (no exposure for own product)
- Potential risks with unsafe AI generated answers.
- Advantage
Fine-tuning domain-specific model
- Use domain specific data (as it's protected)
- For example
- with psychological counseling conversation data to mimic the style.
- with git messages & comments & stackoverflow data as a coding copilot
- with structured financial statements & annotated analytical report as a finance advisor
- Haven't seen much examples in practice