Intel® Developer Zone:
Performance

Highlights

Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources


Development Tools

 

Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Intel® Xeon Phi™ Coprocessor code named “Knights Landing” - Application Readiness
By adminPosted 09/15/20140
As part of the application readiness efforts for future Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (code named Knights Landing), developers are interested in improving two key aspects of their workloads: Vectorization/code generation Thread parallelism This article mainly talks a...
Using Intel® VTune™ Amplifier XE to Tune Software on the Intel® Xeon® Processor E5 v3 Family
By Jackson Marusarz (Intel)Posted 11/03/20140
Download this guide (see Article Attachments, below) to learn how to identify performance issues on software running on the Intel® Xeon® Processor E5 v3 Family (based on Intel® Microarchitecture Codename Haswell). The guide explains the General Exploration Analysis viewpoint available in Intel® V...
Sierpiński Carpet in OpenCL 2.0
By Robert Ioffe (Intel)Posted 10/29/20140
We demonstrate how to create a Sierpinski Carpet in OpenCL 2.0 Prerequisites:       A laptop or a workstation with the 5th Generation Intel® Core™ Processor OpenCL™ Drivers and Runtimes for Intel® Architecture ...
Diagnostic 15523: Loop was not vectorized: cannot compute loop iteration count before executing the loop.
By Devorah H. (Intel)Posted 10/29/20140
Product Version: Intel(R) Visual Fortran Compiler XE 15.0.0.070 Cause: The vectorization report generated when using Visual Fortran Compiler's optimization options ( -O3  -Qopt-report:2 ) states that loop was not vectorized since loop iteration count cannot be computed. Example: An example be...
Subscribe to Intel Developer Zone Articles
Introduction to OpenMP* on YouTube
By Mike Pearce (Intel) Posted on 12/03/14 0
Tim Mattson (Intel), has authored an extensive series of excellent videos as in introduction to OpenMP*. Not only does he walk through a series of programming exercises in C, he also starts with a background introduction on parallel programming. Check out the series: https://www.youtube.com/watc...
Benefits of Intel(R) Cache Monitoring Technology in the Intel(R) Xeon(TM) Processor E5 v3 Family
By Khang Nguyen (Intel) Posted on 09/08/14 0
Introduction The number of cores is increasing with the introduction of new processors.  As more cores are added, the number of diverse workloads that potentially can run simultaneously is also increasing.  Workloads can be single-threaded or multi-threaded applications and they can run in nativ...
Web Resources about Intel® Transactional Synchronization Extensions
By Roman Dementiev (Intel) Posted on 07/28/14 3
Short URL for this page: www.intel.com/software/tsx In this blog I list useful technical resources related to Intel® Transactional Synchronization Extensions (Intel TSX). I will try to keep the list up-to-date as new material becomes available (subscribe to this page below to get update notifica...
Additional AVX-512 instructions
By James Reinders (Intel) Posted on 07/17/14 1
Additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512) The Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. As I discussed in my first blog about Intel® AVX-...
Subscribe to Intel Developer Zone Blogs
Intel® Parallel Studio XE SP1 & Intel® Cluster Studio XE SP1
By kathy-farrel (Intel)0
Intel® Parallel Studio XE SP1 & Intel® Cluster Studio XE SP1 - What's New - Webinar Tuesday, September 17 9am PDT Please join us for a technical presentation on the new features found in the recently released Intel® Parallel Studio XE 2013 SP1 Intel® Cluster Studio XE SP1. This release includes support for compilers and performance analysis on Intel® Xeon Phi™ on Windows*. The technical presentation will briefly cover new features for both C++ and Fortran on Linux*, Windows*, and OS X* operating systems as well as error checking and performance profiling tools. Learn how to efficiently boost your application performance! Not too late! - Register Now  Learn about Upcoming Webinars
Multi-Threading
By Mayur B.0
Hello everyone,        I want to solve sparse matrix (for solving linear equations) with minimum time. Now, I am using "pardiso" function from Intel MKL library(Version10.3). But this function takes too long time. Is there any other function available in latest Version which fulfills minimum time requirement?     Could you please help me. Thanks in advance. Mayur
My new invention: Scalable distributed sequential lock
By aminer102
  Hello, Scalable Distributed Sequential lock version 1.01 Author: Amine Moulay Ramdane.  Email: aminer@videotron.ca Description: This scalable distributed sequential lock was invented by Amine Moulay Ramdane, and it combines the characteristics of a distributed reader-writer lock with the characteristics of a sequential lock , so it is a clever hybrid reader-writer lock that is more powerful than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry  Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, so my invention that is my scalable distributed sequential lock has eliminated this weakness of the Dmitry Vyukov's distributed reader-writer mutex,  so that the writers throughput has become faster and very fast, and my scalable distributed sequential lock elminates the weaknesses of the Seqlock (sequential lock) that is "live...
interlocked or not interlocked?
By Rudolf M.1
I'm using an InterlockedCompareExchange to set a variable to my id (something like "while(0 != InterlockedCompareExchange(&var, myId, 0)) ::Sleep(100);" ) now... no other thread will change this variable until it becomes 0 again... after using it, I could do an "InterlockedExchange(&var, 0);" or simply "var = 0;" ... I'm not sure, but I think, this doesn't change much... which one is the bether solution? which one the faster? ... or is one even wrong? ... I thought, the second one could be the faster one, when I don't expect to see a lot of threads trying to "take" this variable at the same time... is that correct?
OpenMP Block gives false results
By Jack S.1
Hi all, I would appreciate your point of view where I might did wrong using OpenMP.  I parallelized this code pretty straight forward - yet even with single thread (i.e., call omp_set_num_threads(1)) I get wrong results. I have checked with Intel Inspector, and I do not have a race condition, yet the Inspector tool indicated (as a warning) that a thread might approach other thread stack (I have this warning in other code I have, and it runs well with OpenMP). I'm pretty sure this is not relate to the problem. Thanks, Jack. SUBROUTINE GR(NUMBER_D, RAD_D, RAD_CC, SPECT) use TERM,only: DENSITY, TEMPERATURE, VISCOSITY, WATER_DENSITY, & PRESSURE, D_HOR, D_VER, D_TEMP, QQQ, UMU use SATUR,only: FF, A1, A2, AAA, BBB, SAT use DELTA,only: DDM, DT use CONST,only: PI, G IMPLICIT NONE INTEGER,INTENT(IN) :: NUMBER_D DOUBLE PRECISION,INTENT(IN) :: RAD_CC(NUMBER_D), SPECT(NUMBER_D) DOUBLE PRECISION,INTENT(INOUT) :: RAD_D(NUMBER_D) DOUBLE PRECISION :: R3, ...
OpenMP 4.0 task depend too limited would TBB be better?
By Nicholas B.0
Hello I have been looking at task depend in OpenMP 4.0 but it looks like it is too limited for what I want to do. To do what I want it would need to take a vector subscript in the array section in the depend clause. My code would look something like ths: type cell_type ... contains procedure :: process end type cell_type type(cell_type), dimension(n) :: cells type edge_type integer, dimension(:), allocatable :: icells ... contains procedure :: process end type edge_type type(edge_type), dimension(m) :: edges ! a bit like a c++ std::vector<std::vector<int>> edges(1)%icells = [1, 5, 7, 8, 100] ! edge 1 depends on cells 1, 5, 7, 8 and 100 edges(2)%icells = [1, 2, 4] ! edge 2 depends on cells 1, 2 and 4 ... do i=1,n !$omp task depend(out:cells(i)) call cells(i)%process() !$omp end task end do do j=1,m ! next line not allowed !$omp task depend(out:edges(j)) depend(in:cells(edges(j)%icells)) call edges(j)%process(cells) !$omp end task end d...
Nested OMP on Xeon Phi using OMP4
By james B.3
Xeon Phi has 60 cores and 4 threads per core. I am writing an experiment that will have 1 master thread on each core, and each of these will spawn  4 slave threads. Looking at the manual https://software.intel.com/en-us/node/512835 it seems that I want to set the envars: MIC_OMP_NESTED=TRUE MIC_OMP_PROC_BIND="spread, close" MIC_OMP_NUM_THREADS=60Is this correct? I've tested this and it doesn't die... Is there a way I can get the runtime to spitout affinity debug info about where it is actually placing things so I can be certain? Cheers, James
Slowdown with OpenMP
By Matt S.8
I'm getting some pretty unusual results from using OpenMP on a fractional differential equations code written in fortran. No matter where I use OpenMP in the code, whether it be on an intilization loop or on a computational loop, I get a slowdown across the entire code. I can put OpenMP in one loop and it will slow down an unrelated one (timed seperately)! The code is a bit unusual, as it initalizes arrays starting at 0 (and some even negative). For example, real*8 :: gx(0:Nx) real*8 :: AxLh(1-Nx:Nx-1), AxRh(1-Nx:Nx-1), AxL0(1-Nx:Nx-1), AxR0(1-Nx:Nx-1) Where Nx is, let's say, 512. Would that possibly have anything to do with the ubiquitous slowdown with OpenMP? Also, any ideas on reducing "pow" overhead in the following snippet would be greatly appreciated do k = 1, 5 hgck = foo_c(k) hgpk = foo_p(k) do j = 1, 100 vx = vx + hgck * ux(x, t, foo(j) + hgpk) end do end do where ux is a function defined by function ux(x,t,xi) impl...
Subscribe to Forums

Highlights