Several driving forces have recently brought about significant advances in the field of configurable computing. They have also enabled parallel processing within a single field-programmable gate array (FPGA) chip. The ever-increasing complexity of application algorithms and the supercomputing crisis have made this new parallel-processing approach more important and pertinent. Its cost-effectiveness provides system designers with the greatest flexibility while imposing many challenges to current hardware and software codesign methodologies. This paper explores practical hardware and software design and implementation issues for FPGA-based configurable multiprocessors, based on the authors' first-hand experience with a shared-memory implementation of parallel LU factorization for sparse block-diagonal-bordered (BDB) matrices. We also propose a new dynamic load balancing strategy for parallel LU factorization on our system. Performance results are included to prove the viability of this new multiprocessor design approach.