Although the field of artificial intelligence (AI) and machine learning has not been in the mainstream before 2023, quietly behind the scenes rapid advancements have been made at an exponentially faster rate than Moore’s law. The most groundbreaking discoveries have concentrated on natural language processing (NLP) which has opened the door for large language models (LLMs). This in turn has created a hide-and-seek narrative between generative models that create content and detection models that aim to distinguish between human-made content and synthetically generated content.
Alongside the most popular generation models produced by AI giants like OpenAI, Google, and Meta, other collaborative projects from institutes like the BigScience workshop have also worked to make LLMs (large language models) more democratic and accessible. In reading this recently published paper, it seems clear that in the field of AI, there’s been a quasi arms race between creating AI applications that can mimic human interaction and detection algorithms that work to ensure that such applications are used responsibly.
Look below for a deeper dive into the major AI LLM models that have been produced in the last few years as well as the detection algorithms that have been created in their response.
Developed by a group of senior researchers at Google, the “Text-to-Text Transfer Transformer” or “T5” transformer presents one of the first models introducing transfer learning. Transfer learning in this case indicates where a model is first trained on a task before being fine-tuned on that task. This type of learning has proven to be an effective technique for NLP (natural language processing) based tasks, leading this team to create a unified framework to convert all such language-based problems into a text-to-text format. The training of this model requires both an input sequence and a corresponding target sequence in a process called teacher forcing.
Developed by the team at OpenAI, GPT-3 represents the third iteration of their generative pre-trained transformer. This model is highly significant for the company and the AI industry as a whole as it acted as the precursor to ChatGPT, the first model to be publicly available and also marks the first identifiable instance of artificial intelligence in the mainstream lexicon. This model was inspired by benchmark improvements being made on NLP tasks by pre-training models/transformers on large volumes of texts and subsequently “fine-tuning” output and performance for specific tasks.
As such, OpenAI put together GPT-3 as an autoregressive language model with 175 billion parameters (at which point of its release was 10x more than any other model) without any gradient updates or specific “fine-tuning”.
Announced by Chinese tech giant Huawei back in April 2021, the PanGu cloud AI model included China’s (then) largest NLP and pre-trained model. This model also presents the first major shift from the market standard focus of developing AI solutions for the everyday consumer to a more industrial approach.
At the time of release, the PanGu suite offered four distinct models, each optimizing for; NLP, computer vision (CV), multimodal, and scientific computing applications that are commonplace in the workplace. Throughout the PanGU cloud suite, each model was continuously trained to have high generalizability and accuracy while requiring relatively low sample sizes to learn.
Introduced by a group of researchers at Meta AI (formerly known as Facebook), the OPT (Open Pre-trained Transformer) model represents the democratization of access to LLMs (Large Language Models). As described by the team at Meta, these models normally present exorbitantly high computational costs, making them exceedingly difficult to replicate and test without significant capital. These costs would make the understanding and development of such models limited to large corporations. As such, the Meta team developed OPTs as a suite of open-sourced pre-trained transformers that resemble and perform similarly to OpenAI’s GPT-3.
Another addition by the research team at Google, the Pathways Language Model (PaLM) represents the first iteration of Google’s in-house machine learning technique called “Pathways”. The model was developed to demonstrate the scaling capability of the Pathways system and further improve the understanding of the role scale plays in few-shot learning. PaLM was trained on 540 billion parameters on 6144 separate chips using data parallelism, successfully setting a new benchmark for scaling and efficiency. This model and all its derivatives were kept private until Google decided to release an API for developers during its Google I/O conference in May 2023.
A group of researchers from Beijing, China, highlighted an issue of reduced performance when applying pre-trained architectures such as autoregressive (GPT) and encoder-decoder (T5) on tasks that included NLU (natural language understanding), unconditional generation, and conditional generation. To overcome this problem, the researchers developed another type of LLM called a“General Language Model” (GLM). This type of model utilizes a technique denoted as autoregressive blank infilling. By using this method, GLMs are able to encode positionally, attributing a random order to predict spans which results in improved performance on NLU-based applications.
In part funded by the french government and hugging face, the BLOOM model was developed by hundreds of prominent machine learning and computer vision experts under the big science workshop.
At the time of its release, the BLOOM model was heavily touted as the world’s largest open-source and multilingual language model. Much like the OPT model discussed previously, BLOOM is a monumental progression in the AI space as it reduces the barriers to access for non-academic or institutionally backed individuals that may not have resources necessary to test and experiment the large language models.
The BLOOM furthers the democratization of this technology by introducing its 176 billion parameters to 46 spoken languages and 13 programming languages, while being the first model available for many of these languages.
Much like the BLOOM model, mT0 was similarly developed under the Big Science umbrella. mT0 differs from BLOOM in that mT0 was developed by applying multitask finetuning (MTF) techniques to the mT5 model developed by Google. By applying this technique, the pre-trained multilingual models are able to produce better zero-shot results in English and other languages for prompts that are machine-translated and human-written. In this way, this model along with other MTF-based models points towards improved capabilities that are both language and task-agnostic.
Maybe the most influential addition to this timeline, ChatGPT marks the first time that everyday consumers are able to interact conversationally with large language models. Built off the GPT-3 platform, ChatGPT was developed as a sister model to another OpenAI creation, InstructGPT. Both models were trained using a technique called “Reinforcement Learning from Human Feedback” (RHLF), where human AI trainers supervised fine-tuning of the model by providing it with conversations in which they played the role of both the human and the AI assistant.
The LLaMA (Large Language Model Meta AI) model was developed and introduced by Meta (formerly known as Facebook). The main objective behind the creation of this model was to produce a high-performing AI model that was entirely trained on publicly available datasets. At the time of its release, LLaMA was able to outperform models trained on exclusive and proprietary datasets, such as GPT-3. This notion downplayed the need to train state-of-the-art models with private datasets, further encouraging the free and public exploration of large language models.
Although originally trained over trillions of tokens, Meta offered LLaMA in several parameter sizes (7B, 13B, 33B, 65B) to allow people with resource limitations to interact with the model.
When it was released, the Bard model announced by the AI team at Google was the first major competitor to OpenAI’s ChatGPT. Much like ChatGPT, Bard was slowly rolled out to select beta users before being fully available to the general public. Bard represents the conversational AI chatbot based on Google’s previously released “LaMDA” (Language Model for Dialogue Applications). A year after Bard was announced, in Feb 2024, it was rebranded as Gemini when Google announced the IOS and android applications of its AI model.
Building on the success of ChatGPT and GPT-3, OpenAI announced its latest and most advanced model, GPT-4. One of the most impressive advancements made in this iteration was its relative benchmark performance when compared to past versions. For example, GPT-4 notably achieved a BAR exam score in the top 10% of lawyers. Another monumental takeaway from this model is that it was conceived to institute optimization methods such that predictability was greatly improved. In doing so, programmers and engineers were able to accurately predict how GPT-4 would perform based on models 0.001% of the size and computational cost.
Presented by the Pennsylvania State University, “Authorship Attribution for Neural Text Generation” presents one of the first attempts at detecting AI-generated text-based content. This seminal work by the team of researchers at Penn State focuses on being able to discern the “author” of a given text.
How this is determined is by testing whether a single “author” was likely to write multiple texts provided. If so, are those texts likely to be produced by a neural method such as; CTRL, GPT/GPT2, GROVER, XLM, XLNET, etc?
Although the researchers were able to confidently find consistent differences between human “authored” text and AI-generated content for most neural models, it singled out GPT2, GROVER, and FAIR methods as being the most obfuscating and hardest to discern reliably.
Building off of previous works, “Through the Looking Glass: Learning to Attribute Synthetic Text Generated by Language Models” furthers the understanding of AI content detection through attribution of authorship. Developed at the Lahore University of Management Sciences in Pakistan, XLNET-FT is a fine-tuned version of XLNET, a generalized autoregressive pretraining method introduced by Carnegie Mellon University (Source).
The team behind XLNET-FT discerned that authorship attribution is possible by leveraging machine learning algorithms to distinguish marks that are inherent in the pre-trained language models that produce synthetic texts. It should be noted that this work was still preliminary at its time and was not applied to more diverse LLMs (large language models) such as GPT-3 as it had not yet been made publicly available.
Developed by Naver Labs Europe, “Unsupervised and Distributional Detection of Machine-Generated Text” marks a shift in approach for detecting AI-generated text-based content. The detection methods most commonly used at this point consisted of pretraining classifiers on given data to predict the authorship origin of a document not used in the training process. This work diverts from this method as it focuses on detection through distributional and unsupervised methods.
Instead of providing the model with annotated synthetic texts, this detection method works by providing the model with a large collection of unannotated documents and lets the model assume that a large portion of the documents were in fact produced synthetically by AI.
In another addition to the field of AI content detection, the Pennsylvania State University introduces TuringBench in “TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation”. Unlike the other works discussed previously, TuringBench is less of a detection model and works more of a benchmark environment that is used to determine how “human-like” a synthetic text is when generated by a given language model.
This benchmark environment is comprised of a dataset with 200K labelled document samples that are either human-authored or generated by a variety of different AI LLMs (Large Language Models). The benchmark tests each LLM using the “Authorship Attribution” method noted earlier and also another method referred to as the “Turing Test”. Much like the linguistic model introduced by Penn State, the TuringBench environment also signalled GPT-3 as generating the most human-like content in their preliminary testing.
During the 13th conference on language resources and evaluation held in Marseilles, France, a research team from Hangzhou, China representing the Fuxi AI lab presented “Unraveling the Mystery of Artifacts in Machine Generated Text”. This work was different from others in the field in that this team aimed at further explaining how/why text generation models (TGM) leave behind artifacts that aid in detecting their synthesis rather than detecting content itself. This work pointed towards key areas that could reliably hint at the use of AI generation.
For example, it was discovered that AI generation was much more likely to be found in contextual words rather than stopwords (common words that lack significant meaning such as; “the”, “is”, “and”, etc. ).
Continuing from the previous works and findings presented on the authorship attribution method of detection AI generated content, the National University of Singapore presented “Whodunit? Learning to Contrast for Authorship Attribution”. This team of researchers differentiates their findings from other preceding works as they determine that past detection methods rely on the datasets used to train the ensuing models. They found that features used to attribute authorship were manually designed for the specific dataset used to train the model.
To remove dependency on training datasets, this work presents the opportunity to attribute authorship by fine-tuning generic pre-trained language representations by using a contrasting objective. By using the contrasting objective, this type of model can produce language clusters that are found to reliably differentiate between authors.
Stanford University introduced DetectGPT with the publishing of “DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature”. This paper keys in on a particular property commonly found in the structure of most LLM’s probability functions that aids in improved detection of AI-generated content. DetectGPT works by identifying regions in an AI model’s log probability function and determining if those regions occupy negative curvatures.
By determining this, it is important to note that DetectGPT does not need to be trained using specific classifiers or particular datasets or even annotating text that is already confirmed to be synthetically derived. The Stanford research team finds that DetectGPT was more discriminative in detecting fake news articles when compared to other zero-shot detection methods.
RADAR (Robust AI-Text Detection via Adversarial Learning), is a concept in the field of AI content detection introduced by the Chinese University of Hong Kong. This introduction aims to present a framework for detecting LLM (large language model) based paraphrasing by training a detection model through adversarial learning. The RADAR model is trained by first creating an AI paraphraser that is specifically instructed to create content designed to evade detection.
The paraphraser then provides feedback to the RADAR model which then uses that feedback to improve the paraphraser’s ability to evade detection, further improving the RADAR model’s ability to robustly detect AI based paraphrasers.
OUTFOX models proposed by the Tokyo Institute of Technology in the published paper “OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples'' is similar to the work outlined in the RADAR model in that it narrows its focus on adversarial learning. This research team contends that preceding works in AI content detection lack robustness against attacks in that they tend to degrade in accuracy as LLMs paraphrase synthetically authored texts.
To counteract this notion, the OUTFOX framework improves detection robustness against paraphrasing by allowing the model to consider the output of the LLM and vice versa. By doing so, the model is better able to learn what iterative improvements the evading LLM is making to better fool the detection model, thus making the detection model learn against these iterations and correctly label the LLM as synthetically authored.
One of the latest advancements in the field of AI content generation was posed by the Lawrence Livermore National Laboratory. In the seminal work “Few-Shot Detection of Machine-Generated Text Using Style Representations”, this team of researchers contends that much, if not all, previous detection models presented are largely unsuccessful in maintaining their level of detection accuracy against LLMs (large language model) once those models are updated or improved.
They surmise that an effective detection approach should not require access to the models generating synthetic documents. Instead, an effective and long-lasting detection approach should be trained on representations of writing styles that are solely derived from human-authored text. Not only does this approach confidently distinguish amongst human authors, but the same framework can also distinguish between humans and LLMs and even amongst specific LLMs in select cases.
Much like the text-based models offered by OpenAI, GLIDE was introduced much earlier than many of its competitors. The model itself is based on diffusion models that are specifically geared for text conditional image synthesis. In layman’s terms, the model was designed to produce images portraying inputted text descriptions.
Another key takeaway from GLIDE was its comparison to another image-based model developed by OpenAI, DALL-E. When the two models were compared by human evaluators, GLIDE with no classifying guidance was preferred to DALL-E, which was guided by an expensive CLIP classifier.
Developed at the Ludwig Maximiilan University of Munich, Germany, latent diffusion models (LDMs) decompose image formation into a sequential application of denoising autoencoders. This means that constructing models in this way would allow for applying a guiding mechanism to control image formation without retraining the model itself. The main drawback of these models would be their cost.
To train powerful versions of this model, hundreds of GPUs (graphics processing units) would be needed to run concurrently for days and be evaluated sequentially. In order to work around this constraint, this team of researchers instead applied LDMs to pre-trained autoencoders, resulting in field-leading inpainting accuracy.
Presented by the “Brain Team” at Google Research, the Imagen model is a text-to-image diffusion model that focuses on photorealism and language understanding. When developing this model, the team at Google discovered that their previous transformer models (ex. T5) when trained on text datasets, were equally effective in encoding text for image synthesis.
They also found that sample size was correlated to image definition/quality as the greater the amount of data that was provided to the model, the more aligned the produced images were to their inputted text descriptions. Along with the Imagen model, the Google team also introduced “DrawBench”, a strenuous benchmark used to describe relative performance for text-to-image models.
Based and built on the prior research published by the Ludwig Maximilian University of Munich on LDMs (Latent Diffusion Models), Stability.ai released its Stable Diffusion model for image generation in August 2022. A major breakthrough stemming from the Stable Diffusion model is the relative ease of running the model, allowing users with consumer grade GPUs to access the model.
Researchers and users have been able to generate images at 512x512 pixels while using less than 10GB VRAM with consumer grade GPUs, further democratizing image generation. The model was trained on a LAION dataset containing 5 billion parameters and looks to instill an open ecosystem that can be easily experimented with and manipulated. Before its release, the Stable Diffusion model was tested by 10,000 beta testers, creating nearly 2 million images daily.
Developed and released by OpenAI, DALL-E 2 is the updated iteration of the DALL-E image synthesis model. The main structure and framework of the second DALL-E model were discussed in the published paper “Hierarchical Text-Conditional Image Generation with CLIP Latents” released by OpenAI in Q2 2022.
The DALL-E 2 model is constructed in two parts: one generates a CLIP (Contrastive Language–Image Pre-training) image embedding based on a given text description, and the next is a decoder that generates an image based on the embedding. This model is also based on diffusion techniques used in other models discussed earlier and is trained on hundreds of millions of captioned images sourced from the internet.
Stability.ai produced another text-to-image model with the release of its DeepFloyd IF model. The model consists of a modular frozen text encoder based on Google’s T5 transformer and combined with three cascading pixel diffusion models. The cascading models operate in stages of increasing pixel volume (Stage 1 = 64x64 pixels, Stage 2 = 256 x 256 pixels and Stage 3 = 1024 x 1024 pixels).
Compared to the stable diffusion model, IF is more computationally taxing, needing at least 16GB of VRAM to properly run instead of the single-digit figures demanded by Stable Diffusion. Much like Stable Diffusion and other AI offerings from Stability.ai, the Deepfloyd IF model is entirely open source, allowing users to manipulate and experiment with the model.
OpenAI releases DALL-E 3 as a continuance and upgrade over their previous DALL-E 2 text-to-image generator. OpenAI focused this iteration of DALL-E to improve noisy image outputs by creating and tailoring a new bespoke image captioner that was used to recaption the dataset used to train the model.
The resulting model benchmarks admirably and shows significant improvement in understanding nuance and detail from text descriptions, leading to more accurate image outputs. Unlike DALL-E 2 before it, DALL-E 3 was directly embedded into OpenAI’s conversational AI chatbot, ChatGPT, allowing users with a ChatGPT Plus subscription to create images directly from ChatGPT.
Now produced by Google Deepmind and the Google AI research division, Imagen 2 is the latest image-to-text diffusion model offered by Google. Imagen 2 was rolled out on various Google platforms such as Bard, ImageFX, Google Search, and Vertex AI. Google differentiates this version of Imagen from its predecessor due to the adaptation of a new “natural distribution” technique applied to its training data, instead of applying more pre-programmed techniques.
Moreover, Google developed enhanced image captions to improve Imagen’s understanding of the relationship between words and images. Google notes that this version of Imagen takes into account aesthetic stylings such as framing, exposure, and sharpness that help to make an image more preferable to the human eye.
Midjourney is an independent research lab located in San Francisco, California that focuses its efforts on developing state-of-the-art text-to-image generators. As an independent lab, Midjourney does not follow the open-source philosophy of its competitors, not allowing consumers to review any documentation/ code that would point toward how the model works. The lab follows a paid-service business model, offering a deluge of subscription plans that vary in performance and access depending on how much the consumer is willing to spend.
Amazon released Titan on their AWS Bedrock platform as a mixture of LLMs (Large Language Models) that allow consumers to generate text, summarize provided text, search semantically and generate images by providing text. For their text-to-image generator, Amazon offers a variety of features like iterative prompting to further refine image outputs and provide multiple variations of the same scene. Users are also able to fine-tune and tailor the model with their own data/images to stylize the image output to their preference.
Another differentiator with the Titan model is the embedded invisible watermarks that Amazon adds to generated content in order to encourage responsible use and deter against passing off synthetic images as human-derived work.
Building on the work carried out by Meta (formerly known as Facebook) in the published paper “Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack” introducing its text to image model, Emu, Meta quickly released an updated model, Emu Edit, before the end of 2023. Emu Edit slightly differs from Emu as it integrates editing and vision robustness to the original model. It does so by allowing users to provide Emu Edit with multi-task prompts that instruct the model to make a wide breadth of changes to any of the images it has generated. This robust feature set was trained by formulating generative tasks that would emulate the work normally carried out by a human editor.
Furthermore, alongside Emu Edit, Meta also released a benchmark environment that tests different LLMs on several image editing tasks.
As a joint project between Hugging Face and Stability AI the text-to-image model, aMUSEd, is made available to the public. aMUSEd is an open-source text-to-image model that has been based on the MUSE architecture. This derivative of MUSE is billed as a lightweight masked image model (MIM) that focuses on fast image generation. MIM proves to be beneficial as it only uses 10% of the parameters outlined by MUSE and requires fewer inference steps, making it more interpretable.
Most notably, the MIM nature of aMUSEd has allowed it to be fine-tuned to be “taught" additional vision styles with a single image. Currently, the aMUSEd model has been able to produce images in 256x256 and 512x512 resolutions.
At the Universität-Bonn, a team of researchers introduced the wavelet-packets method for AI-generated images with the paper “Wavelet-Packets for Deep Fake Image Analysis and Detection”. With this seminal work, this team of researchers presents a novel way to distinguish naturally derived images and images generated through the use of GAN (generative adversarial neural network) models.
Before the publishing of this paper, AI-generated image detection was largely reliant on the Fourier transform, which reduced images into their sine/cosine components. Wavelet packets instead use algorithms trained with classifiers that work to analyze images using wavelet coefficients to detect differences between naturally derived images and AI-generated images.
This research paper titled “Towards the Detection of Diffusion Model Deepfakes” highlights the current state of AI detection when it comes to image data. This team of researchers from the Ruhr University Bochum contends that a majority of the current efforts and literature available focus on detecting the use of GANs (generative adversarial neural networks) to synthesize images, leaving diffusion model synthetic images to be largely undetected.
It was determined that current detection models that are trained to detect GAN-generated images are unsuccessful in detecting AI involvement when the same images are generated using diffusion models. The team also found that those same detection models can be retrained on datasets with diffusion model-generated images to improve detections with one caveat: DM-generated images produce fewer detectable artifacts than GAN-generated images making them more difficult to detect.
In the paper “TruFor: Leveraging all-round clues for trustworthy image forgery detection and localization”, researchers from Google researchers and the University Federico II of Naples present a detection framework for more sophisticated deep learning based forms of AI-generated images. The TruFor framework uses real (natural) image data to embed artifacts derived from the camera’s internal/external processing to establish a pattern that can be used in self-supervised training models.
Any data that falls outside those patterns is determined to be forgeries or manipulated data. Experiments outlined in this work show that TruFor is able to reliably detect and localize artificially derived images that were formed using deep learning techniques.
Diffusion Reconstruction Error (DIRE) was conceived by the University of Science and Technology of China as a detection model that specifically targets artificial images generated using diffusion models. Presented in the paper “DIRE for Diffusion-Generated Image Detection”, DIRE measures the error between an input (natural) image and its reconstruction through pre-trained diffusion. This team of researchers finds that diffusion-generated images can be largely recreated by diffusion models but naturally derived images cannot.
This notion supports the idea that the DIRE model can be used as an effective detection model for diffusion-generated images. Furthermore, a benchmark environment containing 8 separate diffusion models was used to test DIRE showed significant improvements over current detection models.
Stepwise Error for Diffusion-Generated Image Detection (SeDID) aims to detect artificial images generated through the use of Denoising Diffusion Probabilistic Models (DDPM). Introduced in the paper “Exposing the Fake: Effective Diffusion-Generated Images Detection”, published by the University of Technology and Science of China, the SeDID model continues the work outlined in the DIRE model above and targets one of the most effective generation models, DDPM.
The SeDID model combines both statistical and neural network-based approaches into one model that detects denoising computation errors and other such artifacts generally present in DDPM-generated images. Using a similar benchmark environment outlined in the DIRE paper, SeDID was found to be efficiently superior to other existing detection methods.
WaveGrad2 is a joint project between the Center for Language and Speech Processing at John Hopkins University and the Google Research ``Brain Team” and was announced in the paper “WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis”. This model is a non-autoregressive generative model for image-to-speech synthesis. The model works by taking a phoneme sequence and refining it iteratively in order to generate an audio waveform.
The iterative process uses Gaussian noise and generally refines over 50 iterations to progressively reenact an audio sequence. Experiments conducted by the research team shows that WaveGrad 2 is successful in generating high fidelity audio.
Detect vocoder is an artificial audio detection model that was introduced by a research team from the University of Buffalo in the paper “AI-Synthesized Voice Detection Using Neural Vocoder Artifacts”. Much like detection models aimed at detecting synthetically derived images, detect vocoder works to detect synthetically derived audio by searching for and detecting artifacts present in the given audio signals.
This team has found that most audio generation models use a neural vocoder which is a type of neural network that generates waveforms from temporal-frequency representations. Like other detection models, detect vocoder is comprised of a learning framework with a binary classifier that is trained on a specific dataset.
Make-A-Video was developed by Meta (formerly known as Facebook) and builds on the progress accomplished by the company in its text-to-image generation models. In the paper “Make-A-Video: Text-to-Video Generation Without Text-Video Data”, Meta proposes an approach for directly translating text-to-image generation techniques to text-to-video. The approach consists of decomposing and approximating temporal modules and applying them to a spatial-temporal pipeline that generates video through a video decoder and an interpolation model.
The main advantages of this approach are the reduced training time and the fact that the model does not require accompanying text-to-image data.
Research lab Runway ML presents their video generation model with the paper “Structure and Content-Guided Video Synthesis with Diffusion Models”. Named Gen-1, this video-generation model uses latent video diffusion models to synthesize videos with the use of content/images described by text. This approach differs from past models that would extend image generation approaches to edit video, which would require prohibitively expensive pretraining of the model for each input image.
Instead, the GEN-1 model is trained on both images and videos simultaneously while being guided to have control over output characteristics. Furthermore, Meta presents an additional factorizing approach that is successful in animating images directly using user inputted text prompts.
Stability.ai added to its suite of AI generation models with the release of Stable Video, highlighted by the paper “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets”. This model mirrors the Stability.ai text to image model in that it is based off diffusion models, but this time focused on latent videos. This paper presents a multistage approach for training of latent video diffusion models. The first stage of text to image pretraining is followed by video pretraining and finally video finetuning.
Moreover, stability.ai poses a specific curation process, including image captioning and filtering strategies that are designed to build pre-training datasets which train strong base models and generate high quality videos.
Meta (formerly known as Facebook) continues its work toward providing accurate and efficient video generation models, building off Make-A-Video with the release of the paper “Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning“. Emu Video, unlike the previous Meta-built video generation model, works in multiple stages by first using a text-to-image generation model and then using the output image and the text description provided to derive that image from that model to synthesize a video.
The Emu Video model also allows the implementation of multi-stage training and adjustments on noise schedules for diffusion.
Coming from the University of Berkeley, California, the paper “Revisiting Generalizability in Deepfake Detection: Improving Metrics and Stabilizing Transfer” looks to deepen the understanding of “Generalizability” in the detection of AI-generated content. In the context of detection models, generalizability refers to the ability to have a model successfully be applied to datasets other than the one used for training the model. It is found by the research team at Berkeley that current AI-generated video fails to be detected by diffusion-based detection models.
Thus, the team proposes a novel randomized algorithm that synthesizes videos with visually apparent artifacts within the human face. When training detection models on these compromised videos, it is shown that the models begin to learn based on “reality” rather than look for a pattern of artifacts.
In the paper “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face”, a research team from Zhejiang University presents the issue of LLMs (large language models) handling complicated tasks autonomously. Their solution, named HuggingGPT, is an LLM-powered agent that leverages other LLMs such as ChatGPT to connect various AI models in machine learning networks like Hugging Face in order to solve AI tasks.
More specifically, HuggingGPT works to intake user requests and coordinate which LLM is most fit to fulfill the request while also executing subtasks using the selected AI model and returning a summarization of the output response. In this way, HuggingGPT is able to leverage the wide arching capabilities of ChatGPT and other AI models in the Hugging Face platform across a variety of modalities such as vision, speech, language, and more.
Presented by the CISPA Helmholtz Center for Information Security in the paper “DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models”, DeFake is a machine learning classifier that detects synthetically derived images. Once those images are correctly identified as synthetic, they are attributed to the source AI model that originally generated them.
The CISPA team then conducted a variety of experiments on various text-to-image generation models such as DALL-E, Stable Diffusion, Latent Diffusion and others. The major finding from the experiments was that image generation prompts of a character length between 25 and 75 or with the topic of “person” were found to generate the most “authentic” synthetic images.
Hierarchical Multi-modal Manipulation Reasoning Transformer (HAMMER) specifically targets forgery traces or artifacts across multiple modalities. Announced by the Harbin Institute of Technology in Shenzen, China in the paper titled “Detecting and Grounding Multi-Modal Media Manipulation”, HAMMER is primarily trained using a DGM (Detecting and Grounding Multi-Modal Media Manipulation) dataset to fully capture the interaction between different modalities.
HAMMER works by learning from separate uni-modal encoders that are manipulation-aware. The team also introduced a benchmark environment that specifically evaluates manipulation in media.
As we have seen, in the past 5 years, technological advancements have run rampant not only in the generation of artificially derived audio, video, text, and images but also in the detection models used to distinguish between “natural” and “fake” data. Although much of the limelight has been centered on Large Language Models (LLMs) like ChatGPT or Google Gemini, the field of Artificial Intelligence (AI) has been equally influenced by revolutionary detection models like DetectGPT or Turing Bench.
The constant hide-and-seek game currently happening as detection models furiously attempt to detect generated content from new LLMs underscores the need to balance the delicate relationship between creation and moderation when it comes to AI-generated content.