Intel Mac 에서 Intel GPU 로 Pytorch MPS 사용하기

Warning: MPS on Intel MacOS with Intel iGPU can produce incorrect results (because of `APPLE MPS library`)

check here: (click)

참고로 iGPU 를 쓰기 위해 아래의 글을 읽을 필요는 없다. Mac 에서 intel iGPU 는 매우 느리기 때문이다.

intel iGPU 의 성능을 openvino 공식 홈페이지에서 확인해볼 수 있다.

Windows, Linux 에선 최대 2배의 성능을 낼 수 있다고 하는데, Mac 에서는 오히려 2배 느려진다.

본문은 dedicated gpu 을 사용하고자 하는 유저들을 위한 글이다.

pytorch 는 어째서인지 기본적으로 intel mac 에서 metal 을 이용한 가속을 지원을 안한다.

(PyTorch does not support acceleration using Metal on Intel Macs by default, for some reason.)

pytorch 의 코드를 아주 약간 수정한 후 직접 빌드해주면 mps 을 사용할 수 있다.

(To use MPS with PyTorch, you need to make a few small changes to the PyTorch code and then build it yourself.)

아래의 명령어를 실행하자.

(First, run the following command)

git clone https://github.com/pytorch/pytorch.git ~/ws/pytorch
cd ~/ws/pytorch
git checkout v2.1.1
git submodule --init --recursive

그후 아래 링크의 커밋을 보며 동일하게 pytorch 의 소스를 수정해준다.

https://github.com/chengzeyi/pytorch-intel-mps/commit/16f1e0389145be8460f243d279b7e5642eb214d7

(Then, look at the commit linked below and make the same changes to the PyTorch source code.)

참고로 2.1.1 에서는 아래와 같이 수정하면 된다. (약간 차이가 있음)

(For 2.1.1, you can make the following changes)

// https://github.com/pytorch/pytorch/blob/4c55dc50355d5e923642c59ad2a23d6ad54711e7/aten/src/ATen/mps/MPSAllocator.h
// #L244
// m_scalar_pool(m_device, UsageFlags::SMALL | UsageFlags::SHARED | UsageFlags::SCALAR),
m_scalar_pool(m_device, UsageFlags::SMALL | UsageFlags::PRIVATE | UsageFlags::SCALAR),

// https://github.com/pytorch/pytorch/blob/4c55dc50355d5e923642c59ad2a23d6ad54711e7/aten/src/ATen/mps/MPSDevice.mm
// #L96
// if ([device isLowPower]) { // exclude Intel GPUs
    if (true) { // include Intel GPUs
      _mtl_device = [device retain];
      break;
    }

// https://github.com/pytorch/pytorch/blob/4c55dc50355d5e923642c59ad2a23d6ad54711e7/aten/src/ATen/native/mps/OperationUtils.mm
// #L391
// Scalar pools are only supported on devices with unified memory
  if (false) {
    scalar.buffer = getIMPSAllocator()->allocScalarBufferWithValue(&scalar.value, scalar.size);
    result = [[[MPSGraphTensorData alloc] initWithMTLBuffer:scalar.getMTLBuffer()
                                                      shape:@[ @1 ]
                                                   dataType:getMPSScalarType(scalar.type)] autorelease];

3번 째 변경 사항의 경우 intel mac 의 경우 MTLstorage 옵션을 private 로 하면 안되기에 다른 옵션을 선택해야하는데 git 에선 수정이 안되어있길래 추가로 바꿔줬다.

(In the third change, the MTLStorage option must not be set to private for Intel Macs, so another option must be selected. However, this change was not made in the Git repository, so I made the change myself.)

그 후 아래의 커맨드를 실행하여 wheel 을 생성해준다.

(Then, run the following command to create a wheel)

conda create -n pytorch-2.1.1-py11 python=3.11
conda activate pytorch-2.1.1-py11
pip install -r requirements.txt
conda install -y mkl mkl-include
conda install -y pkg-config libuv

CPLUS_INCLUDE_PATH=$(python -c "import numpy as np; print(np.get_include())") \
    MAX_JOBS=8 \
    MACOSX_DEPLOYMENT_TARGET=14.1 \
    USE_MPS=1 \
		python setup.py bdist_wheel -p macosx_14_0_x86_64

‣

테스트용 커맨드 (빠른 빌드)

MACOSX_DEPLOYMENT_TARGET 에는 terminal 에 sw_ver 을 실행하고 ProductVersion 에 적힌 숫자를 적어주면 된다.

(For the MACOSX_DEPLOYMENT_TARGET variable, you can use the sw_ver command to get the product version of your system:)

이렇게 생성된 wheel 을 pip 으로 설치해주고 다음 커맨드를 실행해보면 mps 가 문제 없이 사용되는 것을 확인할 수 있다.

(After installing the wheel, you can run the following command to verify that MPS is working properly)

python -c "import torch; print(torch.backends.mps.is_available(), torch.backends.mps.is_built()); torch.randn(1, 3).to('mps')"
# True True

만약 위의 커맨드가 실패한다면

(If the above command fails,)

python -c "import platform; print(platform.platform())"
# macOS-14.1-x86_64-i386-64bit

위 명령어를 실행하여 설치된 python 의 target mac OS 버전이 12.0 이상이 맞는지 확인한다.

(Run the following command to verify that the target macOS version of the installed Python is 12.0 or higher.)

참고로 맥의 OS 가 12.0 이상이더라도 위의 결과값이 macOS-10.9-x86_64-i386-64bit 과 같이 나올 수 있다. 이는 실행중인 파이썬 바이너리가 macOS-10.9 을 minimum version 으로 잡고 build 한 바이너리이기 때문이다.

(Note that even if the OS of the Mac is 12.0 or higher, the result of the above command may be macOS-10.9-x86_64-i386-64bit. This is because the Python binary that is currently running is a binary that was built with macOS-10.9 as the minimum version.)

이런 경우엔 직접 python 을 빌드해주는 것이 마음 편하다.

(In this case, it is more comfortable to build Python yourself.)

mac 에서 python 빌드는 아래의 커맨드를 실행한다.

참고로 필자는 번거로운 일이 생기는 게 싫어서 기존의 모든 파이썬 바이너리를 삭제후 진행하였다. 👻

(Python build on Mac is performed by running the following command. Note that I deleted all existing Python binaries before proceeding because I didn't want any trouble.)

# Mac 에서 python 설치
# 구 intel mac 에서 pytorch mps 사용하기 위해 python 3.11 직접 빌드하는 용도

mkdir -p ~/ws
git clone https://github.com/python/cpython ~/ws/cpython
cd ~/ws/cpython
git checkout 3.11

GDBM_CFLAGS="-I$(brew --prefix gdbm)/include" \
    GDBM_LIBS="-L$(brew --prefix gdbm)/lib -lgdbm" \
    ac_cv_working_openssl_ssl=yes \
    ac_cv_working_openssl_hashlib=yes \
    CPPFLAGS="-I$(brew --prefix openssl@3.0)/include" \  
    LDFLAGS="-L$(brew --prefix openssl@3.0)/lib" \
    PKG_CONFIG_PATH="$(brew --prefix openssl@3.0)/lib/pkgconfig" \
    ./configure \
                --with-openssl="$(brew --prefix openssl@3.0)" \
                --enable-framework \
                --enable-optimizations \
                --with-lto

echo "export CPLUS_INCLUDE_PATH=/Library/Frameworks/Python.framework/Versions/3.11/Headers" >> ~/.setting

make -s -j8
sudo make install
sudo ln -s -f /usr/local/bin/python3.11 /usr/local/bin/python
sudo ln -s -f /usr/local/bin/python3.11 /usr/local/bin/python3
sudo ln -s -f /usr/local/bin/pip3.11 /usr/local/bin/pip
sudo ln -s -f /usr/local/bin/pip3.11 /usr/local/bin/pip3

간단한 코드를 통해 mps 와 cpu 의 성능을 비교해보면 다음과 같다. 참고로 필자는 iGPU 밖에 없는 13 inch MacBook Pro 유저이다.

(Through a simple code, the performance of MPS and CPU can be compared as follows.)

import torch
import timeit

x = torch.randn(1000, 1000)
print(timeit.timeit("x @ x", number=10, globals=globals()))
# 0.0907975767040625
y = x.to("mps")
print(timeit.timeit("y @ y; torch.mps.synchronize()", number=10, globals=globals()))
# 4.1007534249999935 (very slow)

엄청나게 느린걸 확인할 수 있다. 쓰지말자..

(You can see that it is extremely slow. Let's not use it...)

그런데 pytest test/test_mps.py 돌려보면 inference 을 제대로 하지 못하는 걸 알 수 있는데, 이는 Apple 의 MPS library 부터 잘못 동작하기 때문이다…

(However, if you run pytest test/test_mps.py, you can see that inference is not working properly, which is because Apple's MPS library is not working properly…)

matrixMultiplicationWithPrimaryTensor 가 이상한 값을 뱉어낸다.

(matrixMultiplicationWithPrimaryTensor is spitting out strange values.)

확인할 수 있는 코드는 다음과 같다.

(The code you can check is as follows.)

#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>
#import <MetalPerformanceShadersGraph/MetalPerformanceShadersGraph.h>

@interface MatMul : NSObject {
    MPSGraph *_graph;
    MPSGraphTensor *_aTensor;
    MPSGraphTensor *_bTensor;
    MPSGraphTensor *_resultTensor;
}

- (instancetype)initWithDevice:(id<MTLDevice>)device;
- (void)matmul:(id<MTLBuffer>)A B:(id<MTLBuffer>)B C:(id<MTLBuffer>)C aShape:(MPSShape *)aShape bShape:(MPSShape *)bShape cShape:(MPSShape *)cShape;

@end

@implementation MatMul

- (instancetype)initWithDevice:(id<MTLDevice>)device
{
    if (self = [super init]) {
        _graph = [[MPSGraph alloc] init];
    }
    return self;
}

- (void)matmul:(id<MTLBuffer>)A B:(id<MTLBuffer>)B C:(id<MTLBuffer>)C aShape:(MPSShape *)aShape bShape:(MPSShape *)bShape cShape:(MPSShape *)cShape
{
    @autoreleasepool {
	    MPSDataType dataType = MPSDataTypeFloat32;
	
	    MPSGraphTensorData *aTensorData = [[MPSGraphTensorData alloc] initWithMTLBuffer:A shape:aShape dataType:dataType];
	    MPSGraphTensorData *bTensorData = [[MPSGraphTensorData alloc] initWithMTLBuffer:B shape:bShape dataType:dataType];
	    MPSGraphTensorData *cTensorData = [[MPSGraphTensorData alloc] initWithMTLBuffer:C shape:cShape dataType:dataType];
	
	    _aTensor = [_graph placeholderWithShape:aShape dataType:dataType name:NULL];
	    _bTensor = [_graph placeholderWithShape:bShape dataType:dataType name:NULL];
	
	    _resultTensor = [_graph matrixMultiplicationWithPrimaryTensor:_aTensor secondaryTensor:_bTensor name:NULL];
	
	    MPSGraphTensorDataDictionary *inputs = @{_aTensor: aTensorData, _bTensor: bTensorData};
	
	    MPSGraphTensorDataDictionary *result = [_graph runWithFeeds:inputs targetTensors:@[_resultTensor] targetOperations:@[[_resultTensor operation]]];
	
	    MPSGraphTensorData *resultData = result[_resultTensor];
	    [resultData.mpsndarray readBytes:C.contents strideBytes:nil];
	}
}

@end


int main(int argc, const char * argv[]) {
    @autoreleasepool {
        id<MTLDevice> device = MTLCreateSystemDefaultDevice();

        // shape
        MPSShape *aShape = @[@1, @128];
        MPSShape *bShape = @[@128, @1];
        MPSShape *cShape = @[@1, @1];

        unsigned long aSize = (unsigned long)[aShape[0] intValue] * (unsigned long)[aShape[1] intValue];
        unsigned long bSize = (unsigned long)[bShape[0] intValue] * (unsigned long)[bShape[1] intValue];
        unsigned long cSize = (unsigned long)[cShape[0] intValue] * (unsigned long)[cShape[1] intValue];

        float *A = (float *)malloc(sizeof(float) * aSize);
        float *B = (float *)malloc(sizeof(float) * bSize);
        float *C = (float *)malloc(sizeof(float) * cSize);

        for (int i = 0; i < aSize; i++) {
            A[i] = 1;
        }
        for (int i = 0; i < bSize; i++) {
            B[i] = 1;
        }

        id<MTLBuffer> A_buffer = [device newBufferWithBytes:A length:sizeof(float) * aSize options:MTLResourceStorageModeShared];
        id<MTLBuffer> B_buffer = [device newBufferWithBytes:B length:sizeof(float) * bSize options:MTLResourceStorageModeShared];
        id<MTLBuffer> C_buffer = [device newBufferWithBytes:C length:sizeof(float) * cSize options:MTLResourceStorageModeShared];

        MatMul *matmul = [[MatMul alloc] initWithDevice:device];
        [matmul matmul:A_buffer B:B_buffer C:C_buffer aShape:aShape bShape:bShape cShape:cShape];

        memcpy(A, A_buffer.contents, sizeof(float) * aSize);
        memcpy(B, B_buffer.contents, sizeof(float) * bSize);
        memcpy(C, C_buffer.contents, sizeof(float) * cSize);

        for (int i = 0; i < aSize; i++) {
            printf("%f ", A[i]);
        }
        printf("\n");
        for (int i = 0; i < bSize; i++) {
            printf("%f ", B[i]);
        }
        printf("\n");
        for (int i = 0; i < cSize; i++) {
            printf("%f ", C[i]);
        }
        printf("\n");
        free(A);
        free(B);
        free(C);
    }
    return 0;
}

output = 32 ≠ 128 (true output)

아래와 같은 코드를 구현하면 대강 쓸순 있는데 어차피 intel iGPU 와 mps 의 조합은 엄청나게 느리니 쓰지말자.

‣

code

Intel Mac 에서 Intel GPU 로 Pytorch MPS 사용하기

Warning: MPS on Intel MacOS with Intel iGPU can produce incorrect results (because of APPLE MPS library)

Warning: MPS on Intel MacOS with Intel iGPU can produce incorrect results (because of `APPLE MPS library`)