Hasty Briefsbeta

All tags

#machine-learning

166 stories total

Bilingual

Ollama's new engine for multimodal models
a year ago
- Ollama now supports multimodal models with its new engine, starting with vision multimodal models like Llama 4 Scout and Gemma 3.
- Llama 4 Scout is a 109 billion parameter model capable of answering location-based questions about video frames.
- Gemma 3 can analyze multiple images at once and identify common elements, such as animals appearing in all images.
- Qwen 2.5 VL is used for document scanning and character recognition, including translating Chinese spring couplets to English.
- Ollama's new engine improves reliability and accuracy for local inference, supporting future modalities like speech, image, and video generation.
- Model modularity ensures each model is self-contained, simplifying integration for creators and developers.
- Accuracy improvements include handling large images and ensuring correct positional information during processing.
- Memory management features include image caching and optimizations for efficient memory usage.
- Ollama collaborates with hardware manufacturers to optimize inference on various devices.
- Future goals include supporting longer context sizes, reasoning, tool calling with streaming responses, and enabling computer use.
Climbing trees 1: what are decision trees?
a year ago
- Decision trees are foundational in machine learning, used for both classification and regression tasks.
- They work by splitting data into regions based on feature values, making them interpretable but prone to overfitting.
- Key types include classification trees (for categorical outcomes) and regression trees (for continuous outcomes).
- Popular algorithms include ID3, C4.5, and CART, with CART supporting both classification and regression.
- Decision trees use objective functions like Gini impurity, entropy, or squared loss to optimize splits.
- They struggle with non-hierarchical relationships (e.g., XOR, additive structures) due to axis-parallel splits.
- The 'staircase effect' describes their limitation in modeling smooth or oblique decision boundaries.
- Pros: interpretability, scalability, minimal data prep, handling mixed data types, and non-linear relationships.
- Cons: overfitting, instability, lack of smooth predictions, and difficulty with global dependencies.
- Decision trees do not extrapolate beyond training data bounds, which can be a pro or con depending on context.
- Ensemble methods like bagging and boosting (e.g., random forests, gradient boosting) enhance their performance.
Diffusion Models Explained Simply
a year ago
- Diffusion models are trained to identify and remove noise from images based on captions.
- Unlike transformers, diffusion models operate on entire images or tensors, not sequences of tokens.
- Training involves adding noise to images and having the model predict the noise added.
- Inference starts with pure noise and iteratively removes layers to generate an image.
- Variational auto-encoders (VAEs) compress images into smaller, random-looking tensors for efficiency.
- Classifier-free guidance ensures the model generates images relevant to the caption.
- Diffusion models can be stopped early for faster but noisier results, unlike transformers.
- Video diffusion models treat entire video clips as single tensors, learning frame relationships.
- Text diffusion models add noise to text embeddings, but converting back to text is challenging.
- Diffusion models are powerful for images, videos, and audio, but text generation is less straightforward.
You could have designed state of the art positional encoding
a year ago
- The post discusses the iterative development of positional encoding in transformer models, leading to Rotary Positional Encoding (RoPE).
- Positional encoding is essential in transformers to maintain the relationship between tokens in a sequence, as self-attention alone is permutation equivariant.
- A motivating example shows that without positional encoding, identical tokens in different positions produce the same output, failing to capture distinct meanings.
- Desirable properties for positional encoding include unique encodings per position, linear relations between positions, generalization to longer sequences, deterministic generation, and extensibility to multiple dimensions.
- Initial attempts like integer and binary positional encoding have limitations, such as value range issues and discontinuous changes.
- Sinusoidal positional encoding, introduced in the 'Attention is All You Need' paper, uses trigonometric functions to provide smooth, continuous encodings.
- Rotary Positional Encoding (RoPE) improves upon sinusoidal encoding by using rotations to encode relative positions directly in the dot product of self-attention, preserving semantic information.
- RoPE can be extended to higher dimensions (e.g., images) by handling each dimension independently, maintaining the structure of the space.
- Despite its advantages, RoPE has limitations, and future improvements may draw from signal processing or hierarchical implementations.
- Positional encoding is a critical but often overlooked component of transformers, and the post encourages viewing it as a key area for innovation.
Show HN: KVoiceWalk – Voice cloning for Kokoro TTS using random walk algorithms
a year ago
- KVoiceWalk uses a random walk algorithm and hybrid scoring to clone target voices.
- The project builds on Kokoro and Resemblyzer to evolve new voice tensors.
- Target audio should be 20-30 seconds, 24000 Hz WAV, single speaker.
- Process involves finding closest matches, random walk, and saving best voices.
- Interpolation search improves starting population for random walk.
- Scoring combines Resemblyzer similarity, self similarity, and feature extraction.
- Harmonic mean in scoring allows balanced improvement across metrics.
- Future improvements could include genetic algorithms or predictive models.
- Multiple instances can run in parallel depending on hardware.
The Annotated Kolmogorov-Arnold Network (Kan)
a year ago
- Kolmogorov-Arnold Networks (KANs) are introduced as an alternative to Multi-layer Perceptrons (MLPs), focusing on parameterizing activation functions through function application rather than scalar multiplication.
- KANs leverage the Kolmogorov-Arnold representation theorem, which allows any continuous, smooth function to be expressed via univariate functions, though this guarantee is specific to 2-layer KAN models.
- The architecture of KANs is modular, consisting of layers where each layer applies learnable non-linear functions to inputs, similar to matrix-vector operations in MLPs but with function application.
- B-splines are used as learnable activation functions in KANs, providing flexibility through piecewise polynomial approximations, with coefficients that can be learned during training.
- Training KANs involves standard deep learning techniques, including backpropagation and regularization (L1 and entropy regularization) to encourage sparsity and avoid duplicate activations.
- KANs offer potential advantages in interpretability and parameter efficiency but face challenges in computational efficiency and scalability compared to MLPs.
- The post includes practical implementations and visualizations of KANs, demonstrating their application to synthetic functions and highlighting current limitations in scaling to tasks like MNIST classification.
- Open research questions remain about optimizing KANs for efficiency, including choices of parameterized function families and potential improvements in computational kernels.
A visual exploration of vector embeddings
a year ago
- Vector embeddings map inputs like words or images to lists of floating-point numbers representing them in a multidimensional space.
- Different embedding models (e.g., word2vec, text-embedding-ada-002, text-embedding-3-small) have unique dimensions, input types, and similarity characteristics.
- Similarity spaces allow comparing vectors using metrics like cosine similarity, with rankings varying across models.
- Vector similarity metrics include cosine similarity, dot product, Euclidean distance, and Manhattan distance, each suited for different scenarios.
- Vector search enables finding semantically similar items across languages or media types, using exhaustive or approximate nearest neighbor (ANN) algorithms.
- Vector compression techniques like quantization (scalar, binary) and dimension reduction (e.g., MRL) save storage and computation while preserving semantic information.
- Compression with rescoring combines compressed vectors for indexing with original vectors for high-quality search results.
- Resources for further learning include Jupyter notebooks, talks, and documentation on embedding models and vector databases.
Reproducing the deep double descent paper
a year ago
- The author spent time at the Recurse Center to learn machine learning (ML) without prior background.
- Focused on reproducing results from the 'Deep Double Descent' paper to test understanding.
- Double descent refers to model performance improving, then worsening, and improving again with increased model size or training duration.
- Small models (underparameterized) improve with more parameters but can't fully learn the problem.
- At the interpolation threshold, models memorize training data but perform poorly on test data.
- Larger models (overparameterized) can learn underlying features well without overfitting.
- Label noise was introduced to study its effect on double descent.
- The author attempted to reproduce results using ResNet18 on CIFAR-10, adjusting for image size and output categories.
- Training challenges included incorrect label noise application and model adjustments for CIFAR-10.
- Results showed double descent with label noise, matching the paper's findings.
- Larger models initially performed worse but recovered with more training, especially with higher label noise.
Machine Learning: The Native Language of Biology
a year ago
- Machine learning better describes biological systems than traditional math.
- Traditional math struggles with biology's complexity, dimensionality, and diversity.
- Machine learning captures non-linear, context-dependent biological relationships.
- Biology's complexity mirrors natural language, suited for machine learning.
- Transcription factors act as symbols in cellular language, similar to ML latent spaces.
- Predictive Biology focuses on predicting outcomes rather than reducing systems.
- Biological systems are messy and context-dependent, challenging traditional models.
- Machine learning enables new bioengineering possibilities like protein design.
- Hybrid approaches combining interpretability and predictive power are emerging.
- Machine learning may be the 'native language' of biology, aligning with its complexity.
Apple's AI-driven Stem Splitter audio separation tech has improved
a year ago
- Rebalancing or transforming mixed audio tracks (e.g., adjusting vocals, converting genres) was traditionally difficult without original project files.
- Early tools used crude techniques like high- and low-pass filtering, but results were often unsatisfactory.
- Machine learning has significantly improved the ability to isolate and extract specific audio components from dense mixes.
- Apple's Logic Pro 11 update introduced Stem Splitter, an AI-powered feature that separates mixed audio into four distinct parts: Drums, Bass, Vocals, and Other instruments.
- Stem Splitter works best when keeping all stems together for rebalancing or adding effects, but isolating single stems can produce audio artifacts.
Zero shot forecasting: finding the right foundation model for O11Y forecasting
a year ago
- Shift from classic statistical methods to foundation models in time-series forecasting.
- Foundation models promise zero-shot and transfer learning capabilities for time-series data.
- Benchmarked models include Amazon Chronos, Google TimesFM, IBM Tiny Time-Mixers, and Datadog Toto.
- Evaluation focused on MAPE (Mean Absolute Percentage Error) for robustness and interpretability.
- Dataset used real-world Kubernetes pod metrics to reflect production challenges.
- Datadog Toto performed best in multivariate forecasting tasks.
- Classical models like Vector-ARIMA remain competitive for steady-state workloads.
- Foundation models excel in handling data variety and reducing operational overhead.
- Trade-offs include inference latency and robustness to unseen data patterns.
- Conclusion: Foundation models are a valuable addition but not a universal solution.
Text-to-LoRA: Hypernetwork that generates task-specific LLM adapters (LoRAs)
a year ago
- Install `uv` for dependency management and follow the installation guide.
- Clone the `text-to-lora` repository and set up the environment using `uv`.
- Install specific dependencies, including a custom wheel for flash attention.
- Download trained T2L models using Hugging Face CLI.
- Run a web UI demo locally with Mistral-7B-Instruct-v0.2 and T2L.
- Generate LoRAs from task descriptions via CLI, supporting models like Llama-3.1-8B and Gemma-2-2b.
- Evaluate generated LoRAs using scripts, with options for async validation via `watcher.py`.
- Train T2L models and oracle adapters for tasks, requiring significant GPU resources.
- Reconstruction training for T2L involves warmup, learning rate adjustments, and specific configurations.
- Performance comparisons show T2L outperforming baselines across multiple models and tasks.
- Note on non-deterministic behavior in vLLM with LoRA and dataset caching issues.
Foundations of Computer Vision
a year ago
- The book 'Foundations of Computer Vision' covers foundational topics with an image processing and machine learning perspective.
- Target audience includes undergraduate and graduate students, as well as experienced practitioners.
- Originally intended to be a large book, it was condensed to focus on important concepts, with each chapter limited to five pages.
- Writing the book took over 10 years, with the manuscript length fluctuating non-monotonically.
- The deep learning revolution in 2012 influenced the book's content, reinforcing foundational ideas with new tools.
- The book is structured into multiple parts, each covering coherent topics in computer vision, meant to be read in order.
- Key topics include image formation, signal and image processing, neural networks, generative modeling, and scene understanding.
- The book does not cover current state-of-the-art applications extensively, focusing instead on foundational concepts.
- Acknowledgments highlight contributions from teachers, students, colleagues, and family members.
- Resources for instructors include slides and a print version available for purchase.
AMD's CDNA 4 Architecture Announcement – By Chester Lam
a year ago
- CDNA 4 is AMD’s latest compute-oriented GPU architecture, focusing on boosting matrix multiplication performance for machine learning workloads.
- CDNA 4 maintains AMD’s lead in vector operations while improving low-precision matrix throughput, doubling per-CU matrix performance in some cases.
- The architecture uses a chiplet design similar to CDNA 3, with eight XCDs (Accelerator Compute Dies) atop four base dies, leveraging Infinity Fabric for coherent memory access.
- Compared to Nvidia’s B200, AMD’s MI355X (CDNA 4) has more compute units but slightly lower per-unit performance, relying on higher clock speeds to compensate.
- CDNA 4 increases LDS (Local Data Share) capacity to 160 KB and doubles read bandwidth, improving efficiency for thread-local data storage.
- New LDS instructions, including read-with-transpose, optimize matrix multiplication by handling inefficient memory access patterns more effectively.
- MI355X upgrades to HBM3E memory, offering higher bandwidth (8 TB/s) and capacity (288 GB) compared to Nvidia’s B200 (7.7 TB/s, 180 GB).
- AMD retains a significant advantage in vector throughput and high-precision compute, while Nvidia leads in low-precision matrix operations.
- CDNA 4’s improvements are incremental, refining CDNA 3’s design rather than overhauling it, similar to AMD’s Zen 3 to Zen 4 transition.
- AMD’s strategy mirrors Nvidia’s focus on refining successful architectures, with CDNA 4 building on the MI300X’s achievements in supercomputing.
Show HN: I built a tensor library from scratch in C++/CUDA
a year ago
- DSC is a PyTorch-compatible tensor library and inference framework for machine learning models.
- Features an intuitive API similar to NumPy/PyTorch with usability improvements.
- Includes built-in neural networks support with nn.Module, making porting from PyTorch trivial.
- Supports multiple backends (CPU, CUDA) with seamless switching via dsc.set_default_device().
- Minimal external dependencies; core operations written in portable C++ for efficiency and portability.
- No runtime allocations due to a custom memory allocator; supports linear allocator for reduced overhead.
- Requirements: C++20 compatible compiler and GNU Make for building.
- Installation involves cloning the repository, setting up a virtual environment, and building the C++ library.
- Supports CUDA backend for GPU acceleration; requires CUDA Toolkit for NVIDIA GPUs.
- Includes pytest for unit testing against NumPy as a correctness reference.
Minimal auto-differentiation engine in Rust (for educational purposes)
10 months ago
- Minimal automatic-differentiation engine written in Rust.
- Demo trains a tiny Multi-Layer Perceptron to learn the XOR function.
- Writes a rendered computation graph of a single Perceptron to graph.html.
- Example usage with Scalar objects for automatic differentiation.
- Scalar stores value, gradient, and operation details.
- Operator overloads and helpers build a directed acyclic graph.
- backward() propagates gradients recursively through the graph.
- Graph visualization with plot::dump_graph emits a D3.js HTML file.
DeepSpeech Is Discontinued (2020)
10 months ago
- DeepSpeech is an open-source Speech-To-Text engine.
- It uses a model trained by machine learning techniques based on Baidu's Deep Speech research paper.
- The project utilizes Google's TensorFlow for easier implementation.
- Documentation is available on deepspeech.readthedocs.io.
- Latest releases, including pre-trained models, can be found on GitHub.
- Contribution guidelines are in CONTRIBUTING.rst.
- Contact and support information is in SUPPORT.rst.
- This project is now discontinued.
The bitter lesson is coming for tokenization
10 months ago
- The Bitter Lesson emphasizes using general-purpose methods that leverage compute and data over domain-specific crafted methods.
- Tokenization, particularly Byte-Pair Encoding (BPE), is a bottleneck in transformer models, leading to inefficiencies and downstream issues like 'glitch tokens'.
- Tokenization's role is to compress byte representations to reduce computational complexity, but it often fails to achieve optimal trade-offs between compression and granularity.
- Pure byte-level models like ByT5 and MambaByte show promise in removing tokenization but face challenges like increased compute and training time.
- Recent architectures like Byte Latent Transformer (BLT) aim to learn tokenization end-to-end, improving performance and scaling curves while reducing inference FLOPS.
- BLT uses dynamic patching based on entropy thresholds, allowing adaptive compute allocation and better handling of out-of-distribution data.
- BLT outperforms subword-level models in compute-controlled settings, especially on character-level tasks, and shows better scaling trends.
- Future directions include integrating the patcher end-to-end, extending BLT to multi-modal tasks, and addressing challenges like dynamic patch boundaries in large contexts.
Theoretical Analysis of Positional Encodings in Transformer Models
10 months ago
- Positional encodings are essential in transformer models for processing sequential data without recurrence.
- The paper introduces a theoretical framework to analyze different positional encoding methods (sinusoidal, learned, relative, ALiBi).
- Expressiveness is defined via function approximation, and generalization bounds are established using Rademacher complexity.
- New encoding methods based on orthogonal functions (wavelets, Legendre polynomials) are proposed.
- Orthogonal transform-based encodings outperform traditional sinusoidal encodings in generalization and extrapolation.
- The work provides insights for transformer design in NLP, computer vision, and other applications.
There Are No New Ideas in AI Only New Datasets
10 months ago
- AI progress is driven by new datasets rather than new ideas.
- Major AI breakthroughs (DNNs, Transformers, RLHF, Reasoning) were enabled by new data sources.
- Supervised and reinforcement learning techniques are not new but were applied to new datasets.
- The next paradigm shift in AI will likely come from unlocking new data sources like video (YouTube) or embodied data (robots).
- Current AI models may be hitting limits due to the constraints of existing datasets.
- The importance of data over model architecture is highlighted by the equivalence of different models trained on the same data.
- Future AI advancements may focus on efficiency and scalability to utilize richer data sources.

first prev1next

About|Login

#machine-learning

Ollama's new engine for multimodal models

Climbing trees 1: what are decision trees?

Diffusion Models Explained Simply

You could have designed state of the art positional encoding

Show HN: KVoiceWalk – Voice cloning for Kokoro TTS using random walk algorithms

The Annotated Kolmogorov-Arnold Network (Kan)

A visual exploration of vector embeddings

Reproducing the deep double descent paper

Machine Learning: The Native Language of Biology

Apple's AI-driven Stem Splitter audio separation tech has improved

Zero shot forecasting: finding the right foundation model for O11Y forecasting

Text-to-LoRA: Hypernetwork that generates task-specific LLM adapters (LoRAs)

Foundations of Computer Vision

AMD's CDNA 4 Architecture Announcement – By Chester Lam

Show HN: I built a tensor library from scratch in C++/CUDA

Minimal auto-differentiation engine in Rust (for educational purposes)

DeepSpeech Is Discontinued (2020)

The bitter lesson is coming for tokenization

Theoretical Analysis of Positional Encodings in Transformer Models

There Are No New Ideas in AI Only New Datasets