SPEECH - Transcrição de Áudio para Texto com IA

O que é este Node?

O SPEECH é o node responsável por transcrever áudio para texto usando modelos avançados de reconhecimento de fala (Speech-to-Text). Suporta três provedores: OpenAI Whisper, Google Cloud Speech-to-Text e Groq Whisper (ultra-rápido).

Por que este Node existe?

Áudio é a forma mais natural de comunicação, mas sistemas precisam processar texto. O SPEECH existe para:

Acessibilidade Universal: Permitir que mensagens de voz sejam processadas automaticamente pelo sistema
Análise de Conteúdo: Transcrever áudios para análise de sentimento, extração de entidades e outras operações
Automação de Atendimento: Compreender comandos por voz sem intervenção humana
Flexibilidade de Provedores: Escolher entre velocidade (Groq), precisão (Google) ou versatilidade (OpenAI)

Como funciona internamente?

Quando o SPEECH é executado, o sistema:

Valida entrada: Verifica se audioUrl ou audioBase64 foi fornecido
Obtém áudio: Se URL, faz download; se base64, converte para Buffer
Seleciona provider: Usa provider configurado (padrão: OpenAI)
Envia para API: Faz requisição ao serviço escolhido (OpenAI/Google/Groq)
Processa resposta: Extrai texto transcrito, confiança, idioma e metadados
Se erro: Lança exceção com mensagem detalhada do provider
Se sucesso: Retorna transcrição completa com estatísticas

Código interno (ai-processing-executor.service.ts:240-524):

private async executeSpeech(parameters: any, context: any): Promise<any> {
  const {
    audioSource,
    audioUrl,
    audioBase64,
    language,
    provider,
    apiKey
  } = parameters;

  this.logger.log(`🎤 SPEECH - Transcribing audio using ${provider || 'OpenAI Whisper'}`);

  // Determine audio source
  let audioBuffer: Buffer;
  let audioUrlToUse: string | null = null;

  if (audioBase64) {
    // Audio provided as base64 (from Evolution API)
    audioBuffer = Buffer.from(audioBase64, 'base64');
    this.logger.log(`📦 Audio received as base64 (${audioBuffer.length} bytes)`);
  } else if (audioUrl) {
    // Audio provided as URL - download it
    audioUrlToUse = audioUrl;
    this.logger.log(`🔗 Downloading audio from URL: ${audioUrl}`);
    audioBuffer = await this.downloadAudio(audioUrl);
  } else {
    throw new Error('Either audioUrl or audioBase64 is required for speech transcription');
  }

  // Select transcription provider
  const transcriptionProvider = provider || 'openai'; // Default: OpenAI Whisper

  let transcriptionResult: any;

  try {
    switch (transcriptionProvider.toLowerCase()) {
      case 'openai':
        transcriptionResult = await this.transcribeWithOpenAI(audioBuffer, language, apiKey);
        break;

      case 'google':
      case 'google-cloud':
        transcriptionResult = await this.transcribeWithGoogle(audioBuffer, language, apiKey);
        break;

      case 'groq':
        transcriptionResult = await this.transcribeWithGroq(audioBuffer, language, apiKey);
        break;

      default:
        throw new Error(`Unsupported transcription provider: ${transcriptionProvider}`);
    }

    return {
      success: true,
      action: 'speech_transcribed',
      provider: transcriptionProvider,
      audio: {
        source: audioBase64 ? 'base64' : 'url',
        url: audioUrlToUse,
        language: language || 'pt-BR',
        sizeBytes: audioBuffer.length
      },
      transcription: transcriptionResult,
      timestamp: new Date().toISOString()
    };

  } catch (error) {
    this.logger.error(`❌ Speech transcription failed with ${transcriptionProvider}:`, error.message);
    throw new Error(`Speech transcription failed: ${error.message}`);
  }
}

/**
 * Transcribe audio using OpenAI Whisper API
 * Docs: https://platform.openai.com/docs/guides/speech-to-text
 */
private async transcribeWithOpenAI(
  audioBuffer: Buffer,
  language?: string,
  apiKey?: string
): Promise<any> {
  if (!apiKey) {
    throw new Error('OpenAI API key is required. Please configure it in the node properties.');
  }

  const openaiKey = apiKey;

  this.logger.log(`🤖 Transcribing with OpenAI Whisper (${audioBuffer.length} bytes)`);

  try {
    // Create form data for OpenAI API
    const FormData = require('form-data');
    const formData = new FormData();
    formData.append('file', audioBuffer, {
      filename: 'audio.ogg', // Evolution API usually sends OGG
      contentType: 'audio/ogg',
    });
    formData.append('model', 'whisper-1');

    if (language) {
      formData.append('language', language === 'pt-BR' ? 'pt' : language);
    }

    const response = await firstValueFrom(
      this.httpService.post(
        'https://api.openai.com/v1/audio/transcriptions',
        formData,
        {
          headers: {
            ...formData.getHeaders(),
            'Authorization': `Bearer ${openaiKey}`,
          },
          timeout: 60000, // 60 seconds
        }
      )
    );

    const text = response.data.text || '';

    return {
      text: text,
      confidence: 0.95, // OpenAI doesn't provide confidence
      language: response.data.language || language || 'pt',
      duration: response.data.duration || null,
      wordsCount: text.split(/\s+/).filter(w => w.length > 0).length,
      provider: 'openai'
    };

  } catch (error) {
    this.logger.error(`❌ OpenAI Whisper error:`, error.response?.data || error.message);
    throw new Error(`OpenAI transcription failed: ${error.response?.data?.error?.message || error.message}`);
  }
}

/**
 * Transcribe audio using Google Cloud Speech-to-Text API
 * Docs: https://cloud.google.com/speech-to-text/docs/reference/rest
 */
private async transcribeWithGoogle(
  audioBuffer: Buffer,
  language?: string,
  apiKey?: string
): Promise<any> {
  if (!apiKey) {
    throw new Error('Google Cloud API key is required. Please configure it in the node properties.');
  }

  const googleKey = apiKey;

  this.logger.log(`☁️ Transcribing with Google Cloud Speech (${audioBuffer.length} bytes)`);

  try {
    const audioBase64 = audioBuffer.toString('base64');

    const requestBody = {
      config: {
        encoding: 'OGG_OPUS', // Evolution API format
        languageCode: language || 'pt-BR',
        enableAutomaticPunctuation: true,
        model: 'default',
      },
      audio: {
        content: audioBase64,
      },
    };

    const response = await firstValueFrom(
      this.httpService.post(
        `https://speech.googleapis.com/v1/speech:recognize?key=${googleKey}`,
        requestBody,
        {
          timeout: 60000,
        }
      )
    );

    const results = response.data.results || [];
    if (results.length === 0) {
      throw new Error('No transcription results from Google Cloud');
    }

    const transcript = results
      .map((result: any) => result.alternatives[0].transcript)
      .join(' ');

    const confidence = results[0]?.alternatives[0]?.confidence || 0.9;

    return {
      text: transcript,
      confidence: confidence,
      language: language || 'pt-BR',
      duration: null,
      wordsCount: transcript.split(/\s+/).filter(w => w.length > 0).length,
      provider: 'google',
      alternatives: results.map((r: any) => r.alternatives).flat(),
    };

  } catch (error) {
    this.logger.error(`❌ Google Cloud Speech error:`, error.response?.data || error.message);
    throw new Error(`Google transcription failed: ${error.response?.data?.error?.message || error.message}`);
  }
}

/**
 * Transcribe audio using Groq Whisper API (ultra-fast!)
 * Docs: https://console.groq.com/docs/speech-text
 */
private async transcribeWithGroq(
  audioBuffer: Buffer,
  language?: string,
  apiKey?: string
): Promise<any> {
  if (!apiKey) {
    throw new Error('Groq API key is required. Please configure it in the node properties.');
  }

  const groqKey = apiKey;

  this.logger.log(`⚡ Transcribing with Groq Whisper (${audioBuffer.length} bytes) - ULTRA FAST!`);

  try {
    // Create form data for Groq API
    const FormData = require('form-data');
    const formData = new FormData();
    formData.append('file', audioBuffer, {
      filename: 'audio.ogg',
      contentType: 'audio/ogg',
    });
    formData.append('model', 'whisper-large-v3'); // Groq's fastest model

    if (language) {
      formData.append('language', language === 'pt-BR' ? 'pt' : language);
    }

    formData.append('response_format', 'verbose_json'); // Get detailed info

    const response = await firstValueFrom(
      this.httpService.post(
        'https://api.groq.com/openai/v1/audio/transcriptions',
        formData,
        {
          headers: {
            ...formData.getHeaders(),
            'Authorization': `Bearer ${groqKey}`,
          },
          timeout: 30000, // Groq is MUCH faster than OpenAI
        }
      )
    );

    const text = response.data.text || '';

    return {
      text: text,
      confidence: 0.93, // Groq doesn't provide confidence
      language: response.data.language || language || 'pt',
      duration: response.data.duration || null,
      wordsCount: text.split(/\s+/).filter(w => w.length > 0).length,
      provider: 'groq',
      segments: response.data.segments || null, // Word-level timestamps
    };

  } catch (error) {
    this.logger.error(`❌ Groq Whisper error:`, error.response?.data || error.message);
    throw new Error(`Groq transcription failed: ${error.response?.data?.error?.message || error.message}`);
  }
}

Quando você DEVE usar este Node?

Use SPEECH sempre que precisar de conversão automática de áudio em texto:

Casos de uso

Atendimento por Voz: "Transcrever mensagem de áudio do cliente para processar pedido automaticamente"
Análise de Sentimento: "Converter áudio em texto para análise com node AI"
Comandos por Voz: "Permitir que usuários controlem o bot por áudio (ex: 'consultar saldo')"
Documentação Médica: "Transcrever relatos de sintomas para criar prontuário"
Assistentes Virtuais: "Processar comandos de voz em aplicações conversacionais"

Quando NÃO usar SPEECH

Áudio para armazenamento: Use node MEDIA para apenas salvar/gerenciar arquivos de áudio
Identificação de locutor: SPEECH não identifica QUEM fala, apenas O QUE fala
Tradução de áudio: SPEECH apenas transcreve; use node AI para traduzir o texto depois

Parâmetros Detalhados

audioUrl (string, condicional)

O que é: URL pública do arquivo de áudio para transcrição (Evolution API geralmente fornece automaticamente).

Padrão: Nenhum (ou audioBase64 deve ser fornecido)