(the post is automatically translated by AI)
Introduction
MediaPipe is a collection of ML solutions maintained and developed by Google for various computer vision tasks, including detecting multiple body parts 1 — face, hands, torso, hair, and more (Figure 1).
![]() |
|---|
| Figure 1 |
One of the most common applications is using the Holistic 2 solution for full-body skeleton detection. Since it captures both facial and body landmark data, it can substitute for traditional motion capture hardware (e.g., MoCap suits) — especially useful for Vtuber applications that don’t require high-precision motion.
Once we have the MediaPipe skeleton data, we can map it to a personal character model (e.g., a VRM model) and drive its skeleton in real time through Unity or other engines.
However, while there are YouTube demos showing this mapping in action, the underlying code and approach are rarely shared — often because of commercial software constraints.
In this article, we explain how we map a MediaPipe skeleton to a Unity humanoid skeleton, and we open-source the conversion code in hopes that the community can help improve it.
GanniPiece/MetU: A mapping from Mediapipe skeleton to Unity humanoid skeleton. (github.com)
If you’re already familiar with coordinate system conversions and 3D rotation, you can jump directly to the Implementation section.
Coordinate System Conversion
MediaPipe uses a right-handed coordinate system 3, while Unity uses a left-handed one 4. Additionally, MediaPipe’s origin is at the top-right of the image, whereas Unity’s world origin is at the bottom-left.
![]() |
|---|
| Figure 2 |
Left: MediaPipe coordinate system. Right: Unity world coordinate system.
Note: MediaPipe’s
world pose landmarkandpose landmarkare both right-handed, but differ slightly. The former uses the person’s hip as the origin; the latter uses the top-left corner of the image.
3D Rotation
Computing rotations using dynamic Euler angles in 3D space is prone to Gimbal Lock 5. Using static (world-space) Euler angles avoids this problem.
What’s the difference? In dynamic Euler angles, three rotation matrices $R_x$, $R_y$, $R_z$ are applied sequentially. This means when point $p$ is rotated about $y$ by 90°, the $z$-axis rotation aligns with the $y$-axis, losing one degree of freedom — this is Gimbal Lock.
![]() |
|---|
| Figure 3 5 |
The solution is to use a fixed world coordinate system for all rotations. In Unity, this can be done with:
using UnityEngine;
Vector3.FromToRotation(fromDirection_a, toDirection_b);
FromToRotation returns a Quaternion representing the rotation from vector $a$ to vector $b$. Append .eulerAngles to get Euler angles.
Implementation
With these two prerequisites understood, here’s how we map the two skeletons. The main steps are:
- Convert the coordinate axis
- Set the rotation angle
From Figure 2, the Unity world coordinate system and MediaPipe’s world pose landmark differ by a sign flip on every axis. So after retrieving MediaPipe landmark positions, we negate all coordinates to convert to Unity’s world space:
$$P_{Unity}(x,y,z) = -\alpha \cdot P_{MediaPipe}(x,y,z) = \alpha \cdot P_{MediaPipe}(-x, -y, -z)$$
where $\alpha$ is a positive constant representing the scale ratio between the two coordinate systems.
For example, Unity’s positive $y$-axis (green arrow) corresponds to MediaPipe’s negative $y$-axis, and similarly for $x$ and $z$. So a MediaPipe point at $(-1, -1, -1)$ maps to $\alpha(1, 1, 1)$ in Unity.
After coordinate conversion, we calculate the rotation between the Unity skeleton vector and the corresponding MediaPipe vector. For instance, to set the upper arm rotation, we compute the (shoulder → elbow) vector in MediaPipe, compute the same vector in the Unity model, apply FromToRotation to get the rotation, and rotate the Unity bone accordingly.
The pseudocode for each joint:
MappingBoneFromMediaPipeToUnity (mediaPipeVec, unityVec, target):
var mediaPipeVecInUnity = -mediaPipeVec
var rotation = FromToRotation(unityVec, mediaPipeVecInUnity)
target.Rotate(rotation, Space.World)
Conclusion
In this article, we introduced MediaPipe as a real-time skeleton detection solution and provided a simple, understandable method for mapping its skeleton to Unity’s humanoid skeleton. We’ve also open-sourced the conversion code at GanniPiece/MetU — feel free to explore and contribute.
References
ML Solutions in MediaPipe: https://google.github.io/mediapipe/#ml-solutions-in-mediapipe ↩︎
Holistic Solution: https://google.github.io/mediapipe/solutions/holistic.html ↩︎
Right-handed coordinate system: https://learn.microsoft.com/en-us/windows/uwp/graphics-concepts/coordinate-systems ↩︎
Left-handed coordinate system: https://learn.microsoft.com/en-us/windows/uwp/graphics-concepts/coordinate-systems ↩︎
Gimbal Lock: https://en.wikipedia.org/wiki/Gimbal_lock ↩︎ ↩︎


