Thinking about LLMs and Multi-modality

In this post, I would discuss some of my thoughts on Multi-modality and future LLMs directions. Welcome to post your thoughts in the comments and discuss with me.

Thoughts on Multi-modality

Since OpenAI (GPT-4) and Baidu (Wen Xin Yi Yan) released LLMs with different design for multi-modality. GPT-4 unifies the embedding representation between text and images, while WXYY seems to unify the ability to generate different medium content (text, image, voice and video).
In my point of view, multi-modality should support machine's comprehension by encoding different medium into a unified latent space, bridging the information between different modality. It is a progressive simulation of human's learning approach, from reading only to read and see.
In Xiaotong Fei's perspective (Chinese anthropologist and sociologist) from nearly 80 years ago, written language has intrinsic deficiency which could convey only incomplete and ambiguous meanings compare to real life situation. For example, we could tell from voice who is visiting us, or from body language to understand other's emotion. It is therefore natural to gradually integrate the ability to take in and understand different medium as AI develops.
By quoting Xiaotong Fei's ideas years ago, I want to emphasize that we could to draw inspirations from studies which tries to explain how human learns and perceives. Artificial Intelligence, should always be rooted in the study of human intelligence.

Current LLMs Situation

Not hitting human-level intelligence yet. Though it perform well in exams and chatting, it is still not good tasks like few-shot learning, long text comprehension and factual generation.

GPT-4 performed well in many exams. Though it's not human-level intelligence yet, it is time when we start to think about the essence of human intelligence, or more specific, what makes us human unique compared with LLMs.

Human’s unique competence in AI’s perspective

Here is the answer GPT-4 gave me (which I used my holistic understanding to summarize them into the below key points). OpenAI also published a relevant study about LLMs' impact at the labor market. #### Empathy and Emotion intelligence

Ability to understand, respond emotions.

Collaboration

Collaborate and work as a team with others, combining individual strengths.

Creativity and imagination

Think divergently, create original ideas and innovative plans, imagine novel concepts.
- In AI generated art for instance, we could see the potentials of human imagination empowered by a drawing tool that understands human prompts.

Adaptability and learning

Adaptability is kind of few-shot learning capability. Humans could adjust fast to new situations and environments.

Intuition and decision-making

Making decisions with incomplete information. Similar to the concept tacit knowledge.

Morality and ethics

AI lack a true understanding of moral and ethical values.

Holistic understanding

Ability to integrate information from various sources and modality to form a holistic understanding.

Critical thinking

Ability to evaluate information based on logic and intuition.

Context and common sense

Understanding of the problem situation and make reasonable judgements.

Future LLMs directions

Guesses based on current LLM's deficiency.

Multimodal comprehension

to enable different interaction methods.
Multi-modal comprehension could enable more training data for models. Imagine AI learns how to do carpenter work by watching a 3D video of a master carpenter.

Explainability

Understanding different part of LLMs, why and how they draw certain conclusions, could help eliminating hallucination.

Safety

Fact-based, combined with thought chain for inference
- Generate more accurate content could prepare AI for more responsible positions.
Detecting AI generated content.
- Being able to judge AI generated content is essential when the generated content contains hallucination.
- Related Research - Guo et al., 2023
Ensure data privacy during training / fine-tuning
- Debates about what data could be used during training. Consensus needs to be drawn.
Online GPT models may be attacked
- Indirect prompt injection by human invisible content on web.

Accessibility

Model distill technology could be used to lower the cost of use.
We could also see a lot of contributions from the academia and tech companies aiming for democratic training and inferencing for LLMs.
- Meta proposed a series of LLMs trained based on publicly accessible data only - LLaMA Official Repo, which uses a lot fewer parameters to achieve similar results compared with large ones. More detail in the paper
- Stanford research group follows the previous research and fine-tuned a better performing model Alpaca.
- A new optimisation of LLaMA was proposed in this repo which allows Mac to run the large model.

Bionic Structure?

For now, LLMs seems to be stacking transformers (or say GPUs) to achieve better performance. I suppose we could improve the performance by using different networks combinations (change of architecture) to mimic human brain's architecture. This reverse-engineering process may be hard, but is worthwhile as it could also bring better understanding of the nature of human intelligence.
Here is an inspiring research - spiking neural network helps prevent catastrophic forgetting
We had long followed the mode to draw inspirations from bionics in tech development, I suppose it's necessary to collaborate with intelligence-related domains (such as neuroscience) for future study.

Advanced Reinforcement learning?

With the improved multi-modality comprehension, we could expect more intuitive learning mechanism for the RLHF proposed.

What could down-stream companies do?

Model As A Service (MaaS)

Integrate LLM API as additional service - help the company focus on solving domain specific tasks.
- Advantage
  - Controllable response in specific domains
  - Ability to integrate with customized services and products.
- Disadvantage
  - Extra efforts needed to develop models that distinguishes LLMs' job and own product’s.
  - sense of fragmentation between own service & API service if the performance gap is huge.
Or just use the basic models as the entire product
- Advantage
  - Quick to deploy
- Disadvantage
  - restricted controllability & steerability (no exposure for own product)
  - Potential risks with unsafe AI generated answers.

Fine-tuning domain-specific model

Use domain specific data (as it's protected)
For example
- with psychological counseling conversation data to mimic the style.
- with git messages & comments & stackoverflow data as a coding copilot
- with structured financial statements & annotated analytical report as a finance advisor
Haven't seen much examples in practice

Informal Essays

#Artificial Intelligence #Generative AI

Thinking about LLMs and Multi-modality

https://delusion4013.github.io/2023/03/21/Thinking-about-LLMs-and-Multi-modality/

Author

Chenkai

Posted on

March 21, 2023

Licensed under

SD - 2.How to use diffusion web ui? - img2img Previous

SD - 1. How to use diffusion web ui? - txt2img Next