TML-Interaction-Small announced: Artificial intelligence will now listen and respond at the same time

Thinking Machines Lab announced a research preview of new interaction models that take the interaction between humans and AI out of the classic turn-based chat structure. The model developed by the company receives audio, image and text at the same time, continues to listen while the user is talking, can intervene when necessary, react to visual cues and continue longer operations with the use of tools in the background.

Artificial intelligence chat moves from a waiting structure to real-time collaboration. Thinking Machines Lab defines the main problem in current artificial intelligence systems as a collaboration bottleneck. In many models today, the user first completes his speech or text, and then the model starts producing a response. The model cannot receive new information as it responds; If the user intervenes, the process is interrupted.

Although this structure makes artificial intelligence powerful, it limits it from acting as a colleague working simultaneously with humans. The company’s new approach solves this problem within the model’s own architecture. The system, called Interaction model, treats interaction as the basic way of working of the model, rather than as a feature added later with external tools. Thus, the model works not only as an assistant waiting for commands, but also as a collaboration system that monitors the user’s work in real time and reacts according to the flow of the conversation.

The highlight of the new model is that it processes audio, image and text as constantly flowing data. Thinking Machines Lab used a multi-stream and micro-tour-based structure for this. Instead of waiting for a full user sentence to finish as in classical systems, the model processes input and output together in small pieces of 200 milliseconds. In this way, silence, interruptions, overlapping speech and visual changes become part of the context of the model.

This architecture allows artificial intelligence to behave more naturally in real-time conversations. The model tracks whether the user continues to think, leaves the conversation to the other party, corrects himself or waits for a response, without a separate dialogue management component. This structure transforms the prediction of when the conversation is over, which is common in voice assistants, into a deeper model behavior.

The interaction model can be activated before the user completes his speech. In the examples given by the company, the model intervenes verbally or visually depending on the context, can give brief feedback without interrupting the user’s speech, can speak at the same time, and can work simultaneously with the user in scenarios such as live translation. The model can also directly detect elapsed time. This makes a significant difference in exercise, practice, live explanation and guidance scenarios that require timing.

Another important aspect of the system is that it can call a vehicle during real-time conversation. The model can simultaneously search, browse the web, drive, or create a productive interface while continuing to talk and listen to the user. As soon as the results are ready, they are transferred to the user at the appropriate moment of the conversation. Thus, the AI becomes part of an ongoing work session rather than producing a one-off response.

Thinking Machines Lab states that the system consists of two main parts. The interaction model manages real-time interaction. Longer reasoning, planning, tool use and background operations are done by the background model. The interaction model continues the conversation with the user without breaking contact; It also adds the results from the background model to the natural flow of the conversation. This structure reduces the difference between fast-response models and strong reasoning models.

The user receives low-latency interaction as if they were talking to a real-time assistant; In more complex tasks, the background model performs deeper processing. The two systems share the same context. Therefore, the background model does not only receive a disjointed query, but works with a richer package that includes the entire conversation. Thinking Machines Lab announced that the model uses the “encoder-free early fusion” approach, which does not rely on separate large encoders on the audio and video sides.

The audio data is processed with a lightweight embedding layer in dMel format. Images are divided into 40×40 pieces and encoded with hMLP. Flow head is used for sound output. All components are trained from scratch with the transformer. Special optimizations were also made on the inference side for real-time operation. Since 200 millisecond chunks require very frequent and small operations, the standard structure in existing LLM inference libraries is not sufficient.

That’s why Thinking Machines Lab used a method called streaming sessions. In this method, the client sends each 200 millisecond fragment as a separate request, and the server adds these fragments to a permanent array in GPU memory. The company stated that it has also transferred a version of this approach to SGLang. On the security side, the different risks of the real-time interface were also taken into account. Thinking Machines Lab used vocal rejection and extreme rejection training data to produce more natural but clear rejection responses during conversation.

Multi-round rejection data was created with an automatic rejection teaming system to maintain security in long conversations. The company has tried to keep the model’s voice-based behavior compatible with text-based security limits. The model used in the research preview is called TML-Interaction-Small. Thinking Machines Lab states that this model provides strong results on both interaction quality and intelligence and command tracking.

In the FD-bench v1.5 test, the average interaction quality of the model was announced as 77.8. A result of 0.40 seconds was shared in FD-bench v1 simple turn delay. The company announced that with these results, the model stands out in terms of both high interaction quality and low latency. It was also announced that the model differs significantly from existing real-time systems in some new interaction tests. In the TimeSpeak test, TML-Interaction-Small achieved 64.7 percent macro accuracy, while the rate for GPT realtime-2.0 minimal was listed as 4.3 percent.

In the CueSpeak test, TML-Interaction-Small got 81.7 percent and GPT realtime-2.0 minimal got 2.9 percent. In tests that follow visual cues, Thinking Machines Lab’s model gave higher scores than systems that remained silent or responded at the wrong time. TML-Interaction-Small has a MoE structure with 276 billion parameters. The number of active parameters of the model was announced as 12 billion. Thinking Machines Lab states that the quality of interaction will improve with model scale, but states that larger pre-trained models are currently too slow for this real-time structure.

The company announced that it will release larger models later in the year. The limitations of the research were also clearly shared. Continuous audio and video quickly magnify context. The current streaming-session design works for short to medium-length interactions, but context management for very long sessions is still an area for improvement. Low-latency audio and video streaming also requires reliable connectivity.

When connection quality decreases, the experience deteriorates significantly. Thinking Machines Lab will open the interaction models research preview with limited access in the coming months. Wider use will come later in the year. The company will also provide research support to develop new evaluation methods on human-artificial intelligence collaboration and interaction quality.