No headings found on page

Voice and Gesture Control on a Cobot: IIT Indore × Addverb Syncro 5

SYNCRO

RESEARCH

PHYSICAL AI

5 mins

|

28 Apr 2026

Voice and Gesture Control on a Cobot: IIT Indore × Addverb Syncro 5

Eight students at IIT Indore built a dual-control framework on the Addverb Syncro 5 combining autonomous LLM-driven voice commands with real-time glove teleoperation on a single collaborative robot. 

In the first, a student says aloud "Pick up the bottle." No buttons. No code. No setup. The Syncro 5 cobot hears the words, interprets them, sweeps the table, finds the bottle, aligns to it, descends, grasps, retracts. The student never touches the robot. 


In the second, that same student slips on a glove. Each finger bend, each squeeze of the thumb, becomes a command. Move the hand forward, the arm moves forward. Tilt up the arm moves up. There's no joystick, no keyboard. Just a gesture, and the robot follows. 


Same robot. Same table. Two completely different ways of being in command. 


This is what a team of eight students at IIT Indore built in collaboration with Addverb: a complete dual-control framework on the Addverb Syncro 5 that lets an operator switch between full autonomy and intuitive human-in-the-loop teleoperation on the same hardware, through the same backend, inside the same session. 

Voice Control and Gesture Control: Two Ways to Command the Cobot


The IIT Indore team didn't pick the easy path. They built every layer themselves and offered the operator a choice of how to engage with it. 



In Voice Control mode, you simply speak. A large language model interprets the intent, computer vision finds the object in the frame, and a depth camera confirms it's within reach. The cobot closes the loop on its own and completes the grasp. From the operator's side, the only act of "control" is a sentence in plain English. It's the closest a collaborative robot has come to behaving like an assistant rather than a tool. 


In Gesture Control mode, you wear a custom wearable the team designed and assembled fitted with sensors that read finger position and thumb pressure in real time. A machine-learning classifier maps your gestures to motion, and the arm moves with you. There's no calibration step before you start; the model learns as you use it, and adapts to whoever's wearing the glove. 


The two modes share everything underneath the same SDK, the same control server, the same safety configuration. The operator chooses the mode that fits the moment. 

Why Dual-Control Cobots Matter for Industry


The most interesting result here isn't either mode taken in isolation. Plenty of labs have built voice-controlled robotic arms. Plenty have built teleoperated ones. What's rare is the dual-control paradigm itself the seamless switch between full autonomy and human-in-the-loop control, on a single shared infrastructure. 


That switch matters in the real world. Warehouses, manufacturing lines, and research labs don't run on one mode. A pick-and-place in a structured location should be autonomous; a delicate handover, an unfamiliar object, an exception that wasn't anticipated — those need a human in the loop. Most current automation forces a hard architectural choice between the two. This project demonstrates it doesn't have to. 


There's a deeper signal here too. When a cobot can interpret intent, find an object, confirm it's reachable, and read your hand in real time the operator's job stops looking like programming a machine. It starts looking like collaborating with one. That shift is what the next generation of cobot deployments in intralogistics, in advanced manufacturing, and in research will be built on. 


It's also the kind of work that's only possible when students get hands on a real industrial-grade collaborative robot. The dual-control paradigm only holds together because the underlying SDK, control loop, gripper drivers, and safety configuration all behave the way Addverb engineered them to. Take any of those layers away and the abstraction collapses. 

Syncro 5 in Academic Robotics Research


This is the kind of work the Addverb Syncro 5 cobot was designed to enable. Across the Addverb Academic Series so far, we've seen the platform support vision-based motion retargeting at IISc Bangalore, bimanual handover research at IIT Gandhinagar, and now LLM-powered voice and gesture control at IIT Indore. The pattern is consistent: students and faculty take a real industrial cobot, treat it as a research surface, and push it into territory the spec sheet doesn't cover. 


Each project is different. The platform underneath is the same. 



Build on Syncro 5 


If you're a faculty member, researcher, or student building on the Addverb Syncro 5 cobot or thinking about it, we want to hear from you. The team's repositories are open and live (links below), and we're actively supporting the next set of academic robotics projects. 


Write to us at automate@addverb.com, and explore open-source libraries, project templates, and academic resources at community.addverb.ai.  

Technical Appendix: Architecture & Implementation Details 

This section is intended for readers interested in the technical depth of the project. All four open-source repositories are linked at the end. 


Autonomous Voice Control Pipeline 


  • Natural-language command interpreted by Llama 3 via the Groq inference API 


  • YOLO object detection running on aligned RGB frames from an Intel RealSense D435i depth camera 


  • Depth-guided closed-loop visual servoing — pixel-space error driven to zero before approach 


  • Depth verification at the object centroid before the grasp is triggered 


  • Grasp, lift, and retract executed via the Cobot Python SDK 


Glove Teleoperation Pipeline 


  • Custom wearable built on an ESP32 microcontroller with flex sensors (finger bend) and a thumb pressure sensor 


  • Sensor windows streamed to a Python service over MQTT 


  • Statistical features (80th percentile, 20th percentile, mean) extracted per channel 


  • Online Hoeffding Tree classifier maps features to one of seven gesture classes — Neutral, Forward, Backward, Left, Right, Up, Down 


  • Online learning means no offline dataset collection and continuous adaptation per operator 


System Architecture (Five Layers) 


  1. Human Interface — natural-language input and sensor glove 


  1. Processing & Decision — LLM intent parsing, vision pipeline, gesture classifier 


  1. Communication — custom ASCII-over-TCP protocol on port 5000, MQTT, optional Bluetooth (rfcomm) 


  1. Backend / Real-Time Control — C++17 dual-threaded TCP server inside Docker on the cobot controller, motors driven over EtherCAT (SOEM) 


  1. Hardware & Simulation — physical Syncro 5 plus a complete ROS 2 + Gazebo digital twin 


Software Stack 


  • LLM / NLU: Llama 3 via Groq API 


  • Computer Vision: Ultralytics YOLO, OpenCV, pyrealsense2 


  • Streaming ML: river.tree.HoeffdingTreeClassifier 


  • Motion Planning & Control: MoveIt 2 (Servo), JointTrajectoryController, GripperActionController 


  • Simulation: ROS 2 Humble, Gazebo Ignition, gz_ros2_control, ros_gz_bridge 


  • Backend Libraries: Addverb system_manager, Orocos KDL, Eigen3, SOEM (EtherCAT) 


  • Languages: Python 3.10+ (client, vision, ML), C++17 (backend server) 


Hardware Platform 


  • Addverb Syncro 5 — 6-DOF industrial collaborative robot 


  • Intel RealSense D435i — aligned RGB-D streams at up to 90 fps, mounted at the end-effector 


  • Interchangeable grippers — Feetech, Dynamixel, DH (driven through Addverb's gripper framework) 


  • Custom ESP32 sensor glove — flex + thumb pressure sensors, MQTT over Wi-Fi 


Validation 


  • Full pipeline validated on the physical Addverb Syncro 5 


  • Parallel validation in the ROS 2 + Gazebo digital twin, with detachable-joint grasping and an end-effector camera plugin 


  • Both modes operate through a single shared SDK and control server, allowing seamless switching mid-session 


Roadmap 


  • Voice-to-text front-end so the operator can speak directly to the cobot rather than type 


  • Two-handed glove operation for richer six-axis control 


  • Tighter integration of voice and gesture modes within a single task e.g. autonomous approach with human-corrected final placement 


  • Extension of the LLM intent layer to multi-step tasks beyond single-object retrieval 


Open-Source Repositories 





 

By Varad Pendse, Yash Bhamare, Satyam Ashtikar, Hrishab Mittal, Keshav N., Dhananjay Dhumal, Sinam J., Atharva Chavan— IIT Indore, in collaboration with Addverb Technologies

Part of the Addverb Academic Series. Earlier articles featured RBCCPS / IISc Bangalore (vision-based motion retargeting) and IIT Gandhinagar (bimanual handover via the USAC-DS framework). Read the full series at community.addverb.ai. 


Addverb © All rights Reserved 2025